<div dir="auto"><div>I wouldn't want to use GPFS if I didn't want my nodes to be able to go nuts, why bother to be frank.<div dir="auto"><br></div><div dir="auto">I had tested a configuration with a single x86 box and 4 x 100Gbe adapters talking to an ESS, that thing did amazing performance in excess of 25 GB/s over Ethernet.  If you have a node that needs that performance build to it.  Spend more time configuring QoS to fair share your bandwidth than baking bottlenecks into your configuration.</div><div dir="auto"><br></div><div dir="auto">The reasoning of holding end nodes to a smaller bandwidth than the backend doesn't make sense.  You want to clear "the work" as efficiently as possible, more than keep IT from having any constraints popping up.  That's what leads to just endless dithering and diluting of infrastructure until no one can figure out how to get real performance.</div><div dir="auto"><br></div><div dir="auto">So yeah 95% of the workloads don't care about their performance and can live on dithered and diluted infrastructure that costs a zillion times more money than what the 5% of workload that does care about bandwidth needs to spend to actually deliver.</div><div dir="auto"><br></div>Build your infrastructure storage as high bandwidth as possible per node because compared to all the other costs it's a drop in the bucket... Don't cheap out on "cables".</div><div dir="auto"><br></div><div dir="auto">The real joke is the masses are running what big iron pioneered, can't even fathom how much workload that last 5% of the data center is doing, and then trying to dictate how to "engineer" platforms by not engineering.  Just god help you if you have a SharePoint list with 5000+ entries, you'll likely break the internets with that high volume workload.</div><div dir="auto"><br></div><div dir="auto">Alec<br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Tue, Aug 22, 2023, 3:23 AM Jonathan Buzzard <<a href="mailto:jonathan.buzzard@strath.ac.uk">jonathan.buzzard@strath.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 22/08/2023 10:51, Kidger, Daniel wrote:<br>

> <br>

> Jonathan,<br>

> <br>

> Thank you for the great answer!<br>

> Just to be clear though - are you talking about TCP/IP mounting of the filesystem(s) rather than RDMA ?<br>

> <br>

<br>

Yes for a few reasons. Firstly a bunch of our Ethernet adaptors don't <br>

support RDMA. Second there a lot of ducks to be got in line and kept in <br>

line for RDMA to work and that's too much effort IMHO. Thirdly the nodes <br>

can peg the 10Gbps interface they have which is a hard QOS that we are <br>

happy with. Though if specifying today we would have 25Gbps to the <br>

compute nodes and 100 possibly 200Gbps on the DSS-G nodes. Basically we <br>

don't want one node to go nuts and monopolize the file system :-) The <br>

DSS-G nodes don't have an issue keeping up so I am not sure there is <br>

much performance benefit from RDMA to be had.<br>

<br>

That said you are supposed to be able to do IPoIB over the RDMA <br>

hardware's network, and I had presumed that the same could be said of <br>

TCP/IP over RDMA on Ethernet.<br>

<br>

> I think routing of RDMA is perhaps something only Lustre can do?<br>

> <br>

<br>

Possibly, something else is that we have our DSS-G nodes doing MLAG's <br>

over a pair of switches. I need to be able to do firmware updates on the <br>

network switches the DSS-G nodes are connected to without shutting down <br>

the cluster. I don't think you can do that with RDMA reading the switch <br>

manuals so another reason not to do it IMHO. In the 2020's the mantra is <br>

patch baby patch and everything is focused on making that quick and easy <br>

to achieve. Your expensive HPC system is for jack if hackers have taken <br>

it over because you didn't path it in a timely fashion. Also I would <br>

have a *lot* of explaining to do which I would rather not.<br>

<br>

Also in our experience storage is rarely the bottle neck and when it is <br>

aka Gromacs is creating a ~1TB temp file at 10Gbps (yeah that's a real <br>

thing we have observed on a fairly regular basis) that's an intended QOS <br>

so everyone else can get work done and I don't get a bunch of tickets <br>

from users complaining about the file system performing badly. We have <br>

seen enough simultaneous Gromacs that without the 10Gbps hard QOS the <br>

filesystem would have been brought to it's knees.<br>

<br>

We can't do the temp files locally on the node because we only spec'ed <br>

them with 1TB local disks and the Gromacs temp files regularly exceed <br>

the available local space. Also getting users to do it would be a <br>

nightmare :-)<br>

<br>

<br>

JAB.<br>

<br>

-- <br>

Jonathan A. Buzzard                         Tel: +44141-5483420<br>

HPC System Administrator, ARCHIE-WeSt.<br>

University of Strathclyde, John Anderson Building, Glasgow. G4 0NG<br>

<br>

<br>

_______________________________________________<br>

gpfsug-discuss mailing list<br>

gpfsug-discuss at <a href="http://gpfsug.org" rel="noreferrer noreferrer" target="_blank">gpfsug.org</a><br>

<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org" rel="noreferrer noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org</a><br>

</blockquote></div></div></div>