[gpfsug-discuss] Joining RDMA over different networks?

Thu Aug 24 19:42:33 BST 2023

On 24/08/2023 17:46, Alec wrote:

> So why not use the built in QOS features of Spectrum Scale to adjust the 
> performance of a particular fileset, that way you can ensure you have 
> appropriate bandwidth?
> 

Because all the users files are in the same fileset would be the simple 
answer. Way way to much administration overhead for that to change. 
There is huge amounts of KISS involved in the cluster design. Also it's 
only a subset of John's jobs that peg the network. Oh and at tender we 
didn't know we would get GPFS so we had to account for that in the 
system design.

As a side note is that GPU nodes get 40Gbps network connections, so I am 
bandwidth limiting by node type.

The flip side is that the high speed network (Omnipath in this case) has 
been reserved for MPI (or similar) traffic.

Basically we observed that core counts where growing at faster rate than 
Infiniband/Omnipath bandwidth. We went from 12 cores a node to 40 cores, 
but from 40Gbps Infiniband to 100Gbps Omnipath. So rather than mixing 
both storage and MPI on the same fabric we moved the storage out onto 
10Gbps Ethernet which for >99% of users is adequate and freed up 
capacity on the low latency, high speed network for the MPI traffic. I 
stand by that design choice 110%.

Then because low latency/high speed network is only for MPI traffic we 
don't need to equip all nodes with Omnipath (as the tender turned out) 
which saved $$$$ which could be spent otherwise. A login node for 
example does just fine with plain Ethernet. As does a large memory (3TB 
RAM) node which doesn't run multinode jobs. Same for GPU nodes, and 
worked again in our favour when we added a whole bunch of refurb 
ethernet only connected standard nodes last year as we had capacity 
problems. Most of our jobs run on a single node so topology aware 
scheduling in Slurm to the rescue. Cheap addition if your storage is 
commodity Ethernet, would have been horrendously expensive for Omnipath.

There are also other considerations. Running GPFS is already enough of a 
minority sport that running it over the likes of Omnipath or Infiniband 
or even with RDMA is just asking for problems and fails the KISS test IMHO.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG