[gpfsug-discuss] WAS: alternative path; Now: RDMA

Sun Dec 12 23:00:21 GMT 2021

So I never said this node wasn't in a HPC Cluster, it has partners...  For
our use case however some nodes have very expensive per core software
licensing, and we have to weigh the human costs of empowering traditional
monolithic code to do the job, or bringing in more users to re-write and
maintain distributed code (someone is going to spend the money to get this
work done!).  So to get the most out of those licensed cores we have
designed our virtual compute machine(s) with 128Gbps+ of SAN fabric.  Just
to achieve our average business day reads it would take 3 of your cluster
nodes maxed out 24 hours, or 9 of them in a business day to achieve the
same read speeds... and another 4 nodes to handle the writes.  I guess HPC
is in the eye of the business...  In my experience cables and ports are
cheaper than servers.

The classic shared HPC design you have is being up-ended by the fact that
there is so much compute power (cpu and memory) now in the nodes, you can't
simply build a system with two storage connections (Noah's ark) and call it
a day.  If you look at the spec 25Gbps Ethernet is only delivering ~3GB/s
(which is just above USB 3.2, and below USB 4).

Spectrum Scale does very well for us when met with a fully saturated
workload, we maintain one node for SLA and one node for AdHoc workload, and
like clockwork the SLA box always steals exactly half the bandwidth when a
job fires, so that 1 SLA job can take half the bandwidth and complete
compared to the 40 AdHoc jobs on the other node.  In newer releases IBM has
introduced fileset throttling.... this is very exciting as we can really
just design the biggest fattest pipes from VM to Storage and then software
define the storage AND the bandwidth from the standard nobody cares about
workloads all the way up to the most critical workloads...

I don't buy the smaller bandwidth is better, as I see that as just one
band-aid that has more elegant solutions, such as simply doing more
resource constraints (you can't push the bandwidth if you can't get the
CPU...), or using a workload orchestrator such as LSF with limits set, but
I also won't say it never makes sense, as well I only know my problems and
my solutions.  For years the network team wouldn't let users have more than
10mb then 100mb networking as they were always worried about their backend
being overwhelmed... I literally had faster home internet service than my
work desktop connection at one point in my life.. it was all a falesy, the
workload should drive the technology, the technology shouldn't hinder the
workload.

You can do a simple exercise, try scaling up... imagine your cluster is
asked to start computing 100x more work... and that work must be completed
on time.  Do you simply say let me buy 100x more of everything?  Or do you
start to look at where can I gain efficiency and what actual bottlenecks do
I need to lift... for some of us it's CPU, for some it's Memory, for some
it's disk, depending on the work... I'd say the extremely rare case is
where you need 100x more of EVERYTHING, but you have to get past the
performance of the basic building blocks baked into the cake before you do
need to dig deeper into the bottlenecks and it makes practical and
financial sense.  If your main bottleneck was storage, you'd be asking far
different questions about RDMA.

Alec

On Sun, Dec 12, 2021 at 3:19 AM Jonathan Buzzard <
jonathan.buzzard at strath.ac.uk> wrote:

> On 12/12/2021 02:19, Alec wrote:
>
> > I feel the need to respond here...  I see many responses on this
> > User Group forum that are dismissive of the fringe / extreme use
> > cases and of the "what do you need that for '' mindset.  The thing is
> > that Spectrum Scale is for the extreme, just take the word "Parallel"
> > in the old moniker that was already an extreme use case.
>
> I wasn't been dismissive, I was asking what the benefits of using RDMA
> where. There is very little information about it out there and not a lot
> of comparative benchmarking on it either. Without the benefits being
> clearly laid out I am unlikely to consider it and might be missing a trick.
>
> IBM's literature on the topic is underwhelming to say the least.
>
> [SNIP]
>
>
> > I have an AIX LPAR that traverses more than 300TB+ of data a day on a
> > Spectrum Scale file system, it is fully virtualized, and handles a
> > million files.  If that performance level drops, regulatory reports
> > will be late, business decisions won't be current. However, the
> > systems of today and the future have to traverse this much data and
> > if they are slow then they can't keep up with real-time data feeds.
>
> I have this nagging suspicion that modern all flash storage systems
> could deliver that sort of performance without the overhead of a
> parallel file system.
>
> [SNIP]
>
> >
> > Douglas's response is the right one, how much IO does the
> > application / environment need, it's nice to see Spectrum Scale have
> > the flexibility to deliver.  I'm pretty confident that if I can't
> > deliver the required I/O performance on Spectrum Scale, nobody else
> > can on any other storage platform within reasonable limits.
> >
>
> I would note here that in our *shared HPC* environment I made a very
> deliberate design decision to attach the compute nodes with 10Gbps
> Ethernet for storage. Though I would probably pick 25Gbps if we where
> procuring the system today.
>
> There where many reasons behind that, but the main ones being that
> historical file system performance showed that greater than 99% of the
> time the file system never got above 20% of it's benchmarked speed.
> Using 10Gbps Ethernet was not going to be a problem.
>
> Secondly by limiting the connection to 10Gbps it stops one person
> hogging the file system to the detriment of other users. We have seen
> individual nodes peg their 10Gbps link from time to time, even several
> nodes at once (jobs from the same user) and had they had access to a
> 100Gbps storage link that would have been curtains for everyone else's
> file system usage.
>
> At this juncture I would note that the GPFS admin traffic is handled by
> on separate IP address space on a separate VLAN which we prioritize with
> QOS on the switches. So even when a node floods it's 10Gbps link for
> extended periods of time it doesn't get ejected from the cluster. The
> need for a separate physical network for admin traffic is not necessary
> in my experience.
>
> That said you can do RDMA with Ethernet... Unfortunately the teaching
> cluster and protocol nodes are on Intel X520's which I don't think do
> RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do
> RDMA. I could upgrade the protocol nodes but the teaching cluster would
> be a problem.
>
>
> JAB.
>
> --
> Jonathan A. Buzzard                         Tel: +44141-5483420
> HPC System Administrator, ARCHIE-WeSt.
> University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20211212/4f9e482c/attachment-0002.htm>