[gpfsug-discuss] WAS: alternative path; Now: RDMA

Fri Dec 10 04:24:21 GMT 2021

Jonathan:

You posed a reasonable question, which was "when is RDMA worth the 
hassle?"  I agree with part of your premises, which is that it only 
matters when the bottleneck isn't somewhere else. With a parallel file 
system, like Scale/GPFS, the absolute performance bottleneck is not the 
throughput of a single drive. In a majority of Scale/GPFS clusters the 
network data path is the performance limitation. If they deploy HDR or 
100/200/400Gbps Ethernet...  At that point, the buffer copy time inside 
the server matters. 

When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) 
is easily demonstrated because it eliminates the bounce copy through the 
system memory. In our NVIDIA DGX A100 server testing testing we were able 
to get around 2x the per system throughput by using RDMA direct to GPU 
(GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per 
storage node.) 

However, your question remains. Synthetic benchmarks are good indicators 
of technical benefit, but do your users and applications need that extra 
performance? 

These are probably only a handful of codes in organizations that need 
this. However, they are high-value use cases. We have client applications 
that either read a lot of data semi-randomly and not-cached - think 
mini-Epics for scaling ML training. Or, demand lowest response time, like 
production inference on voice recognition and NLP. 

If anyone has use cases for GPU accelerated codes with truly demanding 
data needs, please reach out directly. We are looking for more use cases 
to characterize the benefit for a new paper. f you can provide some code 
examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a 
benefit. 

Thanks,

doug

Douglas O'Flaherty
douglasof at us.ibm.com

----- Message from Jonathan Buzzard <jonathan.buzzard at strath.ac.uk> on 
Fri, 10 Dec 2021 00:27:23 +0000 -----
To:
gpfsug-discuss at spectrumscale.org
Subject:
Re: [gpfsug-discuss] 
On 09/12/2021 16:04, Douglas O'flaherty wrote:
> 
> Though not directly about your design, our work with NVIDIA on GPUdirect 

> Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both 
> MOFED and Firmware version compatibility can be.
> 
> I would suggest anyone debugging RDMA issues should look at those 
closely.
> 
May I ask what are the alleged benefits of using RDMA in GPFS?

I can see there would be lower latency over a plain IP Ethernet or IPoIB 
solution but surely disk latency is going to swamp that?

I guess SSD drives might change that calculation but I have never seen 
proper benchmarks comparing the two, or even better yet all four 
connection options.

Just seems a lot of complexity and fragility for very little gain to me.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

----- Original message -----
From: "Jonathan Buzzard" <jonathan.buzzard at strath.ac.uk>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: gpfsug-discuss at spectrumscale.org
Cc:
Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS 
Servers for Datamigration
Date: Fri, Dec 10, 2021 10:27

On 09/12/2021 16:04, Douglas O'flaherty wrote:
>
> Though not directly about your design, our work with NVIDIA on GPUdirect
> Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both
> MOFED and Firmware version compatibility can be.
>
> I would suggest anyone debugging RDMA issues should look at those 
closely.
>
May I ask what are the alleged benefits of using RDMA in GPFS?

I can see there would be lower latency over a plain IP Ethernet or IPoIB
solution but surely disk latency is going to swamp that?

I guess SSD drives might change that calculation but I have never seen
proper benchmarks comparing the two, or even better yet all four
connection options.

Just seems a lot of complexity and fragility for very little gain to me.

JAB.

--
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20211210/8669235b/attachment-0001.htm>