<div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10pt" ><div dir="ltr" >Jonathan,</div>

<div dir="ltr" > </div>

<div dir="ltr" >What is the main outcome or business requirement of the teaching cluster ( i notice your specific in the use of defining it as a teaching cluster)</div>

<div dir="ltr" >It is entirely possible that the use case for this cluster does not warrant the use of high speed low latency networking, and it simply needs the benefits of a parallel filesystem.</div>

<div dir="ltr" > </div>

<div dir="ltr" >As a Storage vendor / solution architect,  there are generally two problems that i'm trying to solve when it comes to Storage design.  (Capacity - generally pretty easy)</div>

<div dir="ltr" >and Performance (after that there are things like, Access type, resiliency, and all the other data management components). but Capacity and Performance primarily,  and performance can be very different for different use cases.  however in most use cases its a derivative of "how do i reduce CPU or GPU wait times" </div>

<div dir="ltr" > </div>

<div dir="ltr" >Sometimes that answer will be - Big pipe feed a big chuck of data in as fast as I can and let the servers go off and do their thing</div>

<div dir="ltr" >Sometimes that answer will be - Lots of nodes with lots of single load request  let the servers go off and do their thing (once a day / week / month) type job</div>

<div dir="ltr" > </div>

<div dir="ltr" >All too often though, it's now about how many jobs / queries can I complete in x period of time.  And this is where RDMA really starts to make a difference for my clients in Australia and New Zealand.</div>

<div dir="ltr" > </div>

<div dir="ltr" >By reducing the TCP/IP latency and improving the data transfer rates, we can massively reduce the CPU / GPU IO wait times between jobs, which can translate into results like</div>

<div dir="ltr" >30% more jobs in the same 24 hour period, or 2 jobs per server per day more on the same compute / memory resources scaled by number of nodes, or any other number of outcomes depending on the workload and the measurement factor.</div>

<div dir="ltr" > </div>

<div dir="ltr" >For some of my commercial / fintech clients this is the ability to run multiple iterations of the same report per day rather than once per day, or potentially even real time query processing. </div>

<div dir="ltr" >For some of my research clients this is the ability to run 20-30% more compute jobs on the same HPC resources in the same 24H period, which means that they can reduce the amount of time they need on the HPC cluster to get the data results that they are looking for. </div>

<div dir="ltr" >For some of my other research clients this is the ability to run 1 single multi iteration job significantly faster across multiple nodes within the cluster - reducing what historically might have been a 3-5day job down to <24 hours.</div>

<div dir="ltr" > </div>

<div dir="ltr" >I can quantifiably verify that every RoCE / Infiniband deployment that i have been involved in has seen user benefits that are measurable outcomes, in many different industries</div>

<div dir="ltr" > </div>

<div dir="ltr" >Platforms such as SAS Grid Analytics clients who've switched from block storage or NFS based storage are always impressed by the 3-5x benefit of using a parallel filesystem</div>

<div dir="ltr" >but then to pick up another 2x benefit from using RoCE or Infiniband interconnects can often be a significant bonus. </div>

<div dir="ltr" > </div>

<div dir="ltr" >The same is often true for Splunk / Spark and many similar platforms, delivering an architecture that is designed end to end to give wide open pipes, specifically tuned to reduce latency, and CPU IO wait times delivers massive business benefits in terms of time to data access. </div>

<div dir="ltr" > </div>

<div dir="ltr" ><div>As for IBM documentation on RDMA, its not an IBM product, so no we don't have a massive amount of documentation on RDMA,  There is documentation on how Spectrum Scale Implements the RDMA commands, there is documentation on some of the known issues ( ie there are requirements to use IPV6 with RDMA-CM for RoCE, and IBM does not default to enabling IPV6 especially on ESS based hardware)</div>

<div> </div>

<div>but as RDMA is actually a Networking feature the majority of the documentation is provided by the networking vendors, and the key infrastructure component in deciding to enable or not RDMA particularly on an Ethernet network is actually your choice of switch, and how well they implement the requirements to deliver RoCE without inhibiting other functionality within the switching.   NVIDIA / Mellanox  have 2 different implementations depending on if your using Onyx or Cumulus switching OS.  The Arista implementation is different again.  I actively try and avoid using RoCE on CISCO switching, although i'm sure there are plenty of commercial users in VMWare land who are using it happily, i've yet to have a positive experience with RoCE on cisco networking gear.  </div>

<div> </div>

<div>At the end of the day the need for RDMA really comes down to what your trying to achieve,  Infiniband networking at 56, 100, 200, 400Gbit speeds is all about trying to deliver data at the fastest possible rate with the lowest possible latency - and RDMA is a mandatory requirement.  100Gbit / 200Gbit ethernet is probably chasing the same outcome but with more variety in terms of networking vendor, and is also going to want the benefits of RDMA.  10/25/40/50Gbit networking tends to be where the focus is less about storage performance delivery and more around cost optimisation, and maybe in this space RDMA is more of a difficult implementation decision - if you already have bottlenecks else where in your solution design, does the implementation of RDMA add enough value... possibly not. </div></div>

<div dir="ltr" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial, Helvetica, sans-serif;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div class="socmaildefaultfont" dir="ltr" style="font-family:Arial;font-size:10.5pt" ><div dir="ltr" style="margin-top: 20px;" ><div style="font-size: 12pt; font-weight: bold; font-family: sans-serif; color: #7C7C5F;" >Andrew Beattie</div>

<div><strong>Technical Sales - Storage for Big Data and AI</strong></div>

<div><strong>IBM Systems - Storage </strong></div>

<div><strong>IBM Australia & New Zealand</strong></div>

<div style="font-size: 8pt; font-family: sans-serif; margin-top: 10px;" ><div><span style="font-weight: bold; color: #336699;" >Phone: </span>614-2133-7927</div>

<div><span style="font-weight: bold; color: #336699;" >E-mail: </span><a href="mailto:abeattie@au1.ibm.com" style="color: #555">abeattie@au1.ibm.com</a></div></div></div></div></div></div></div></div></div>

<div dir="ltr" > </div>

<div dir="ltr" > </div>

<blockquote data-history-content-modified="1" dir="ltr" style="border-left:solid #aaaaaa 2px; margin-left:5px; padding-left:5px; direction:ltr; margin-right:0px" >----- Original message -----<br>From: "Jonathan Buzzard" <jonathan.buzzard@strath.ac.uk><br>Sent by: gpfsug-discuss-bounces@spectrumscale.org<br>To: gpfsug-discuss@spectrumscale.org<br>Cc:<br>Subject: [EXTERNAL] Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA<br>Date: Sun, Dec 12, 2021 21:19<br> 

<div><font face="Default Monospace,Courier New,Courier,monospace" size="2" >On 12/12/2021 02:19, Alec wrote:<br><br>> I feel the need to respond here...  I see many responses on this<br>> User Group forum that are dismissive of the fringe / extreme use<br>> cases and of the "what do you need that for '' mindset.  The thing is<br>> that Spectrum Scale is for the extreme, just take the word "Parallel"<br>> in the old moniker that was already an extreme use case.<br><br>I wasn't been dismissive, I was asking what the benefits of using RDMA<br>where. There is very little information about it out there and not a lot<br>of comparative benchmarking on it either. Without the benefits being<br>clearly laid out I am unlikely to consider it and might be missing a trick.<br><br>IBM's literature on the topic is underwhelming to say the least.<br><br>[SNIP]<br><br><br>> I have an AIX LPAR that traverses more than 300TB+ of data a day on a<br>> Spectrum Scale file system, it is fully virtualized, and handles a<br>> million files.  If that performance level drops, regulatory reports<br>> will be late, business decisions won't be current. However, the<br>> systems of today and the future have to traverse this much data and<br>> if they are slow then they can't keep up with real-time data feeds.<br><br>I have this nagging suspicion that modern all flash storage systems<br>could deliver that sort of performance without the overhead of a<br>parallel file system.<br><br>[SNIP]<br><br>><br>> Douglas's response is the right one, how much IO does the<br>> application / environment need, it's nice to see Spectrum Scale have<br>> the flexibility to deliver.  I'm pretty confident that if I can't<br>> deliver the required I/O performance on Spectrum Scale, nobody else<br>> can on any other storage platform within reasonable limits.<br>><br><br>I would note here that in our *shared HPC* environment I made a very<br>deliberate design decision to attach the compute nodes with 10Gbps<br>Ethernet for storage. Though I would probably pick 25Gbps if we where<br>procuring the system today.<br><br>There where many reasons behind that, but the main ones being that<br>historical file system performance showed that greater than 99% of the<br>time the file system never got above 20% of it's benchmarked speed.<br>Using 10Gbps Ethernet was not going to be a problem.<br><br>Secondly by limiting the connection to 10Gbps it stops one person<br>hogging the file system to the detriment of other users. We have seen<br>individual nodes peg their 10Gbps link from time to time, even several<br>nodes at once (jobs from the same user) and had they had access to a<br>100Gbps storage link that would have been curtains for everyone else's<br>file system usage.<br><br>At this juncture I would note that the GPFS admin traffic is handled by<br>on separate IP address space on a separate VLAN which we prioritize with<br>QOS on the switches. So even when a node floods it's 10Gbps link for<br>extended periods of time it doesn't get ejected from the cluster. The<br>need for a separate physical network for admin traffic is not necessary<br>in my experience.<br><br>That said you can do RDMA with Ethernet... Unfortunately the teaching<br>cluster and protocol nodes are on Intel X520's which I don't think do<br>RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do<br>RDMA. I could upgrade the protocol nodes but the teaching cluster would<br>be a problem.<br><br><br>JAB.<br><br>--<br>Jonathan A. Buzzard                         Tel: +44141-5483420<br>HPC System Administrator, ARCHIE-WeSt.<br>University of Strathclyde, John Anderson Building, Glasgow. G4 0NG<br>_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at spectrumscale.org<br><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a> </font></div></blockquote>

<div dir="ltr" > </div></div><BR>

<BR>