[gpfsug-discuss] Joining RDMA over different networks?

Alec anacreo at gmail.com
Tue Aug 22 11:52:29 BST 2023


I wouldn't want to use GPFS if I didn't want my nodes to be able to go
nuts, why bother to be frank.

I had tested a configuration with a single x86 box and 4 x 100Gbe adapters
talking to an ESS, that thing did amazing performance in excess of 25 GB/s
over Ethernet.  If you have a node that needs that performance build to
it.  Spend more time configuring QoS to fair share your bandwidth than
baking bottlenecks into your configuration.

The reasoning of holding end nodes to a smaller bandwidth than the backend
doesn't make sense.  You want to clear "the work" as efficiently as
possible, more than keep IT from having any constraints popping up.  That's
what leads to just endless dithering and diluting of infrastructure until
no one can figure out how to get real performance.

So yeah 95% of the workloads don't care about their performance and can
live on dithered and diluted infrastructure that costs a zillion times more
money than what the 5% of workload that does care about bandwidth needs to
spend to actually deliver.

Build your infrastructure storage as high bandwidth as possible per node
because compared to all the other costs it's a drop in the bucket... Don't
cheap out on "cables".

The real joke is the masses are running what big iron pioneered, can't even
fathom how much workload that last 5% of the data center is doing, and then
trying to dictate how to "engineer" platforms by not engineering.  Just god
help you if you have a SharePoint list with 5000+ entries, you'll likely
break the internets with that high volume workload.

Alec

On Tue, Aug 22, 2023, 3:23 AM Jonathan Buzzard <
jonathan.buzzard at strath.ac.uk> wrote:

> On 22/08/2023 10:51, Kidger, Daniel wrote:
> >
> > Jonathan,
> >
> > Thank you for the great answer!
> > Just to be clear though - are you talking about TCP/IP mounting of the
> filesystem(s) rather than RDMA ?
> >
>
> Yes for a few reasons. Firstly a bunch of our Ethernet adaptors don't
> support RDMA. Second there a lot of ducks to be got in line and kept in
> line for RDMA to work and that's too much effort IMHO. Thirdly the nodes
> can peg the 10Gbps interface they have which is a hard QOS that we are
> happy with. Though if specifying today we would have 25Gbps to the
> compute nodes and 100 possibly 200Gbps on the DSS-G nodes. Basically we
> don't want one node to go nuts and monopolize the file system :-) The
> DSS-G nodes don't have an issue keeping up so I am not sure there is
> much performance benefit from RDMA to be had.
>
> That said you are supposed to be able to do IPoIB over the RDMA
> hardware's network, and I had presumed that the same could be said of
> TCP/IP over RDMA on Ethernet.
>
> > I think routing of RDMA is perhaps something only Lustre can do?
> >
>
> Possibly, something else is that we have our DSS-G nodes doing MLAG's
> over a pair of switches. I need to be able to do firmware updates on the
> network switches the DSS-G nodes are connected to without shutting down
> the cluster. I don't think you can do that with RDMA reading the switch
> manuals so another reason not to do it IMHO. In the 2020's the mantra is
> patch baby patch and everything is focused on making that quick and easy
> to achieve. Your expensive HPC system is for jack if hackers have taken
> it over because you didn't path it in a timely fashion. Also I would
> have a *lot* of explaining to do which I would rather not.
>
> Also in our experience storage is rarely the bottle neck and when it is
> aka Gromacs is creating a ~1TB temp file at 10Gbps (yeah that's a real
> thing we have observed on a fairly regular basis) that's an intended QOS
> so everyone else can get work done and I don't get a bunch of tickets
> from users complaining about the file system performing badly. We have
> seen enough simultaneous Gromacs that without the 10Gbps hard QOS the
> filesystem would have been brought to it's knees.
>
> We can't do the temp files locally on the node because we only spec'ed
> them with 1TB local disks and the Gromacs temp files regularly exceed
> the available local space. Also getting users to do it would be a
> nightmare :-)
>
>
> JAB.
>
> --
> Jonathan A. Buzzard                         Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230822/5ec631c6/attachment.htm>


More information about the gpfsug-discuss mailing list