<div dir="auto">So why not use the built in QOS features of Spectrum Scale to adjust the performance of a particular fileset, that way you can ensure you have appropriate bandwidth?<div dir="auto"><br></div><div dir="auto"><a href="https://www.ibm.com/docs/en/storage-scale/5.1.1?topic=reference-mmqos-command">https://www.ibm.com/docs/en/storage-scale/5.1.1?topic=reference-mmqos-command</a><br></div><div dir="auto"><br></div><div dir="auto">What you're saying is that you don't want to build a system to meet Johns demands because you're worried about Tom not having bandwidth for his process.  When in fact there is a way to guarantee a minimum quality of service for every user and still allow the system to perform exceptionally well for those that need / want it.</div><div dir="auto"><br></div><div dir="auto">You can also set hard caps if you want.  I haven't tested it but you should also be able to set a maxbps for a node so that it won't exceed a certain limit if you really need to.</div><div dir="auto"><br></div><div dir="auto">Not sure if you're using LSF but you can even tie LSF queues to Spectrum Scale QOS, I didn't really try it but thought that has some great possibilities.</div><div dir="auto"><br></div><div dir="auto">I would say don't hurt John to keep Tom happy.. make both of them happy.</div><div dir="auto"><br></div><div dir="auto">In this scenario you don't have to intimately know the CPU vs IO characteristics of a job.  You just need to know that reserving 1GB/s of I/O per filesystem is fair, and letting jobs consume max I/O when available is efficient.  In Linux you have other mechanisms such as cgroups to refine workload distribution within the node.</div><div dir="auto"><br></div><div dir="auto">Another way to think about it is that in a system that is trying to get work done any unused capacity is costing someone somewhere something.  At the same time if a system can't perform reliably and predictably that is a problem, but QOS is there to solve that problem.</div><div dir="auto"><br></div><div dir="auto">Alec</div><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Thu, Aug 24, 2023, 8:28 AM Jonathan Buzzard <<a href="mailto:jonathan.buzzard@strath.ac.uk">jonathan.buzzard@strath.ac.uk</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On 22/08/2023 11:52, Alec wrote:<br>

<br>

> I wouldn't want to use GPFS if I didn't want my nodes to be able to go <br>

> nuts, why bother to be frank.<br>

> <br>

<br>

Because there are multiple users to the system. Do you want to be the <br>

one explaining to 50 other users that they can't use the system today <br>

because John from Chemistry is pounding the filesystem to death for his <br>

jobs? Didn't think so.<br>

<br>

There is not an infinite amount of money available and it is not <br>

possible with a reasonable amount of money to make a file system that <br>

all the nodes can max out their network connection at once.<br>

<br>

> I had tested a configuration with a single x86 box and 4 x 100Gbe <br>

> adapters talking to an ESS, that thing did amazing performance in excess <br>

> of 25 GB/s over Ethernet.  If you have a node that needs that <br>

> performance build to it.  Spend more time configuring QoS to fair share <br>

> your bandwidth than baking bottlenecks into your configuration.<br>

> <br>

<br>

There are finite budgets and compromises have to be made. The <br>

compromises we made back in 2017 when the specification was written and <br>

put out to tender have held up really well.<br>

<br>

> The reasoning of holding end nodes to a smaller bandwidth than the <br>

> backend doesn't make sense.  You want to clear "the work" as efficiently <br>

> as possible, more than keep IT from having any constraints popping up.  <br>

> That's what leads to just endless dithering and diluting of <br>

> infrastructure until no one can figure out how to get real performance.<br>

> <br>

<br>

It does because a small number of jobs can hold the system to ransom for <br>

lots of other users. I have to balance things across a large number of <br>

nodes. There is only a finite amount of bandwidth to the storage and it <br>

has to be shared out fairly. I could attempt to do it with QOS on the <br>

switches or I could go sod that for a lark 10Gbps is all you get and <br>

lets keep it simple. Though like I said today it would be 25Gbps, but <br>

this was a specification written six years ago when 25Gbps Ethernet was <br>

rather exotic and too expensive.<br>

<br>

> So yeah 95% of the workloads don't care about their performance and can <br>

> live on dithered and diluted infrastructure that costs a zillion times <br>

> more money than what the 5% of workload that does care about bandwidth <br>

> needs to spend to actually deliver.<br>

> <br>

<br>

They do care about performance, they just don't need to max out the <br>

allotted performance per node. However if performance of the file system <br>

is bad the performance of the their jobs will also be bad and the total <br>

FLOPS I get from the system will plummet through the floor.<br>

<br>

Note it is more like 0.1% of jobs that peg the 10Gbps network interface <br>

for any period of time it at all.<br>

<br>

> Build your infrastructure storage as high bandwidth as possible per node <br>

> because compared to all the other costs it's a drop in the bucket... <br>

> Don't cheap out on "cables".<br>

<br>

No it's not. The Omnipath network (which by the way is reserved <br>

deliberately for MPI) cost a *LOT* of money. We are having serious <br>

conversations that with current core counts per node that an <br>

Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps <br>

Ethernet will do just fine for a standard compute node.<br>

<br>

Around 85% of our jobs run on 40 cores (aka one node) or less. If you go <br>

to 128 cores a node it's more like 95% of all jobs. If you go to 192 <br>

cores it's about 98% of all jobs. The maximum job size we allow <br>

currently is 400 cores.<br>

<br>

Better to ditch the expensive interconnect and use the hundreds of <br>

thousands of dollars saved and buy more compute nodes is the current <br>

thinking. The 2% of users can just have longer runtimes but hey there <br>

will be a lot more FLOPS available in total and they rarely have just <br>

one job in the queue so it will all balance out in the wash and be <br>

positive for most users.<br>

<br>

In consultation the users are on board with this direction of travel. <br>

 From our perspective if a user absolutely needs more than 192 cores on <br>

a modern system it would not be unreasonable to direct them to a <br>

national facility that can handle the really huge jobs. We are an <br>

institutional HPC facility after all. We don't claim to be able to <br>

handle a 1000 core job for example.<br>

<br>

<br>

JAB.<br>

<br>

-- <br>

Jonathan A. Buzzard                         Tel: +44141-5483420<br>

HPC System Administrator, ARCHIE-WeSt.<br>

University of Strathclyde, John Anderson Building, Glasgow. G4 0NG<br>

<br>

<br>

_______________________________________________<br>

gpfsug-discuss mailing list<br>

gpfsug-discuss at <a href="http://gpfsug.org" rel="noreferrer noreferrer" target="_blank">gpfsug.org</a><br>

<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org" rel="noreferrer noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org</a><br>

</blockquote></div></div>