[gpfsug-discuss] Joining RDMA over different networks?

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Tue Aug 22 11:20:38 BST 2023


On 22/08/2023 10:51, Kidger, Daniel wrote:
> 
> Jonathan,
> 
> Thank you for the great answer!
> Just to be clear though - are you talking about TCP/IP mounting of the filesystem(s) rather than RDMA ?
> 

Yes for a few reasons. Firstly a bunch of our Ethernet adaptors don't 
support RDMA. Second there a lot of ducks to be got in line and kept in 
line for RDMA to work and that's too much effort IMHO. Thirdly the nodes 
can peg the 10Gbps interface they have which is a hard QOS that we are 
happy with. Though if specifying today we would have 25Gbps to the 
compute nodes and 100 possibly 200Gbps on the DSS-G nodes. Basically we 
don't want one node to go nuts and monopolize the file system :-) The 
DSS-G nodes don't have an issue keeping up so I am not sure there is 
much performance benefit from RDMA to be had.

That said you are supposed to be able to do IPoIB over the RDMA 
hardware's network, and I had presumed that the same could be said of 
TCP/IP over RDMA on Ethernet.

> I think routing of RDMA is perhaps something only Lustre can do?
> 

Possibly, something else is that we have our DSS-G nodes doing MLAG's 
over a pair of switches. I need to be able to do firmware updates on the 
network switches the DSS-G nodes are connected to without shutting down 
the cluster. I don't think you can do that with RDMA reading the switch 
manuals so another reason not to do it IMHO. In the 2020's the mantra is 
patch baby patch and everything is focused on making that quick and easy 
to achieve. Your expensive HPC system is for jack if hackers have taken 
it over because you didn't path it in a timely fashion. Also I would 
have a *lot* of explaining to do which I would rather not.

Also in our experience storage is rarely the bottle neck and when it is 
aka Gromacs is creating a ~1TB temp file at 10Gbps (yeah that's a real 
thing we have observed on a fairly regular basis) that's an intended QOS 
so everyone else can get work done and I don't get a bunch of tickets 
from users complaining about the file system performing badly. We have 
seen enough simultaneous Gromacs that without the 10Gbps hard QOS the 
filesystem would have been brought to it's knees.

We can't do the temp files locally on the node because we only spec'ed 
them with 1TB local disks and the Gromacs temp files regularly exceed 
the available local space. Also getting users to do it would be a 
nightmare :-)


JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG




More information about the gpfsug-discuss mailing list