[gpfsug-discuss] Joining RDMA over different networks?

Alec anacreo at gmail.com
Thu Aug 24 17:46:04 BST 2023


So why not use the built in QOS features of Spectrum Scale to adjust the
performance of a particular fileset, that way you can ensure you have
appropriate bandwidth?

https://www.ibm.com/docs/en/storage-scale/5.1.1?topic=reference-mmqos-command

What you're saying is that you don't want to build a system to meet Johns
demands because you're worried about Tom not having bandwidth for his
process.  When in fact there is a way to guarantee a minimum quality of
service for every user and still allow the system to perform exceptionally
well for those that need / want it.

You can also set hard caps if you want.  I haven't tested it but you should
also be able to set a maxbps for a node so that it won't exceed a certain
limit if you really need to.

Not sure if you're using LSF but you can even tie LSF queues to Spectrum
Scale QOS, I didn't really try it but thought that has some great
possibilities.

I would say don't hurt John to keep Tom happy.. make both of them happy.

In this scenario you don't have to intimately know the CPU vs IO
characteristics of a job.  You just need to know that reserving 1GB/s of
I/O per filesystem is fair, and letting jobs consume max I/O when available
is efficient.  In Linux you have other mechanisms such as cgroups to refine
workload distribution within the node.

Another way to think about it is that in a system that is trying to get
work done any unused capacity is costing someone somewhere something.  At
the same time if a system can't perform reliably and predictably that is a
problem, but QOS is there to solve that problem.

Alec

On Thu, Aug 24, 2023, 8:28 AM Jonathan Buzzard <
jonathan.buzzard at strath.ac.uk> wrote:

> On 22/08/2023 11:52, Alec wrote:
>
> > I wouldn't want to use GPFS if I didn't want my nodes to be able to go
> > nuts, why bother to be frank.
> >
>
> Because there are multiple users to the system. Do you want to be the
> one explaining to 50 other users that they can't use the system today
> because John from Chemistry is pounding the filesystem to death for his
> jobs? Didn't think so.
>
> There is not an infinite amount of money available and it is not
> possible with a reasonable amount of money to make a file system that
> all the nodes can max out their network connection at once.
>
> > I had tested a configuration with a single x86 box and 4 x 100Gbe
> > adapters talking to an ESS, that thing did amazing performance in excess
> > of 25 GB/s over Ethernet.  If you have a node that needs that
> > performance build to it.  Spend more time configuring QoS to fair share
> > your bandwidth than baking bottlenecks into your configuration.
> >
>
> There are finite budgets and compromises have to be made. The
> compromises we made back in 2017 when the specification was written and
> put out to tender have held up really well.
>
> > The reasoning of holding end nodes to a smaller bandwidth than the
> > backend doesn't make sense.  You want to clear "the work" as efficiently
> > as possible, more than keep IT from having any constraints popping up.
> > That's what leads to just endless dithering and diluting of
> > infrastructure until no one can figure out how to get real performance.
> >
>
> It does because a small number of jobs can hold the system to ransom for
> lots of other users. I have to balance things across a large number of
> nodes. There is only a finite amount of bandwidth to the storage and it
> has to be shared out fairly. I could attempt to do it with QOS on the
> switches or I could go sod that for a lark 10Gbps is all you get and
> lets keep it simple. Though like I said today it would be 25Gbps, but
> this was a specification written six years ago when 25Gbps Ethernet was
> rather exotic and too expensive.
>
> > So yeah 95% of the workloads don't care about their performance and can
> > live on dithered and diluted infrastructure that costs a zillion times
> > more money than what the 5% of workload that does care about bandwidth
> > needs to spend to actually deliver.
> >
>
> They do care about performance, they just don't need to max out the
> allotted performance per node. However if performance of the file system
> is bad the performance of the their jobs will also be bad and the total
> FLOPS I get from the system will plummet through the floor.
>
> Note it is more like 0.1% of jobs that peg the 10Gbps network interface
> for any period of time it at all.
>
> > Build your infrastructure storage as high bandwidth as possible per node
> > because compared to all the other costs it's a drop in the bucket...
> > Don't cheap out on "cables".
>
> No it's not. The Omnipath network (which by the way is reserved
> deliberately for MPI) cost a *LOT* of money. We are having serious
> conversations that with current core counts per node that an
> Infiniband/Omnipath network doesn't make sense any more, and that 25Gbps
> Ethernet will do just fine for a standard compute node.
>
> Around 85% of our jobs run on 40 cores (aka one node) or less. If you go
> to 128 cores a node it's more like 95% of all jobs. If you go to 192
> cores it's about 98% of all jobs. The maximum job size we allow
> currently is 400 cores.
>
> Better to ditch the expensive interconnect and use the hundreds of
> thousands of dollars saved and buy more compute nodes is the current
> thinking. The 2% of users can just have longer runtimes but hey there
> will be a lot more FLOPS available in total and they rarely have just
> one job in the queue so it will all balance out in the wash and be
> positive for most users.
>
> In consultation the users are on board with this direction of travel.
>  From our perspective if a user absolutely needs more than 192 cores on
> a modern system it would not be unreasonable to direct them to a
> national facility that can handle the really huge jobs. We are an
> institutional HPC facility after all. We don't claim to be able to
> handle a 1000 core job for example.
>
>
> JAB.
>
> --
> Jonathan A. Buzzard                         Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230824/0e115247/attachment-0001.htm>


More information about the gpfsug-discuss mailing list