[gpfsug-discuss] Network switches/architecture for GPFS

Fri Mar 20 21:18:42 GMT 2020

Hello All,

I would like to discuss or understand on which ethernet networking switches/architecture seems to work best with GPFS. 
We had thought about infiniband, but are not yet ready to move to infiniband because of the complexity/upgrade and debugging issues that come with it. 

Current hardware:

We are currently using Arista 7328x 100G core switch for networking among the GPFS clusters and the compute nodes.

It is heterogeneous network, with some of the servers on 10G/25G/100G with LACP and without LACP.

For example: 

GPFS storage clusters either have 25G LACP, or 10G LACP, or a single 100G network port.
Compute nodes range from 10G to 100G.
Login nodes/transfer servers etc have 25G bonded.

Most of the servers have Mellanox ConnectX-4 or ConnectX-5 adapters. But we also have few older Intel,Broadcom and Chelsio network cards in the clusters.

Most of the transceivers that we use are Mellanox,Finisar,Intel.

Issue:

We had upgraded to the above switch recently, and we had seen that it is not able to handle the network traffic because of higher NSD servers bandwidth vs lower compute node bandwidth.

One issue that we did see was a lot of network discards on the switch side and network congestion with slow IO performance on respective compute nodes.

Once we enabled ECN - we did see that it had reduced the network congestion.

We do see expels once in a while, but that is mostly related to the network errors or the host not responding. We observed that bonding/LACP does make expels much more trickier, so we have decided to go with no LACP until GPFS code gets better at handling LACP - which I think they are working on.

We have heard that our current switch is a shallow buffer switch, and we would need a higher/deep buffer Arista switch to perform better with no congestion/lesser latency and more throughput.

On the other side, Mellanox promises to use better ASIC design and buffer architecture with spine leaf design, instead of one deep buffer core switch to get better performance than Arista.

Most of the applications that run on the clusters are either genomic applications on CPUs and deep learning applications on GPUs. 

All of our GPFS storage cluster versions are above 5.0.2 with the compute filesystems at 16M block size on near line rotating disks, and Flash storage at 512K block size.

May I know if could feedback from anyone who is using Arista or Mellanox switches on the clusters to understand the pros and cons, stability and the performance numbers of the same?

Thank you,
Lohit
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200320/bd78bdc5/attachment-0001.htm>