[gpfsug-discuss] Network switches/architecture for GPFS

Sun Mar 22 00:58:42 GMT 2020

We’ve had good luck moving from older Mellanox 1710 ethernet switches to newer Arista ethernet switches.
Our core is a pair of Arista 7508s primarily with 100G cards.
Leaf switches are Arista 7280QR for racks with 40Gb-connected servers and 7280SR for racks w/ 10Gb-connected servers.
Uplinks from leaf switches to core are multiple 100G connections.
Our nsd servers are connected with dual-40Gb connections, each connection a separate Mellanox ConnectX-3 card to spread load and failure across separate pcie slots.
Our compute nodes are primarily connected with dual-10Gb connections on Intel x520 or x710 nics (dual-port on a single nic).
We also have some Cisco UCS nodes going through Cisco FI’s, these do not perform nearly as well and we’ve had some trouble with them and high bandwidth network storage, especially with defaults.
We have some data transfer nodes connected at 2x40Gb, but other than that our only 40Gb-connected nodes are nsd servers.
Any server, nsd or compute, uses lacp to bond and has mtu set to 9000. We also set:
BONDING_OPTS="mode=4 miimon=100 xmit_hash_policy=layer3+4"
For ecn, we have sysctl net.ipv4.tcp_ecn = 2.

We also run primarily genomics applications.
We have no experience with more recent Mellanox switches, but the ethernet software implementation on their older switches gave us plenty of problems. I’m not the network expert at our site, but they seem to like the Arista software much more than Mellanox.

We run some non-default tcp/ip and ethernet settings, primarily from fasterdata.es.net recommendations. IBM’s older wiki notes about linux sysctls sometimes does not match es.net recommendation, and in those cases we generally go w/ es.net, especially as some of the IBM docs were written for older Linux kernels. However, there are some sysctl recommendations from IBM docs that are unique to gpfs (net.core.somaxconn).
Regarding non-net tuning to improve gpfs stability, we’ve found the following are also important:
vm.min_free_kbytes
vm.dirty_bytes
vm.dirty_background_bytes

It took us a long time to figure out that on systems with lots of memory, many dirty pages could be buffered before being flushed out to network, resulting in a storm of heavy traffic that could cause problems for gpfs disk lease renewals and other control traffic to get through quick enough to avoid expels.

For client NIC tuning, we set txqueuelen 10000 but I’ve read that this may not be necessary on newer kernels.
On older nics, or even current intel nics with older firmware, we found turning some offload optimizations OFF made things better (gro, lro, gso, lso).

I hope this helps you or others running gpfs on ethernet!
-Chris

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of "Valleru, Lohit/Information Systems" <valleru at cbio.mskcc.org>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Friday, March 20, 2020 at 5:18 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: [gpfsug-discuss] Network switches/architecture for GPFS

Hello All,

I would like to discuss or understand on which ethernet networking switches/architecture seems to work best with GPFS.
We had thought about infiniband, but are not yet ready to move to infiniband because of the complexity/upgrade and debugging issues that come with it.

Current hardware:

We are currently using Arista 7328x 100G core switch for networking among the GPFS clusters and the compute nodes.

It is heterogeneous network, with some of the servers on 10G/25G/100G with LACP and without LACP.

For example:

GPFS storage clusters either have 25G LACP, or 10G LACP, or a single 100G network port.
Compute nodes range from 10G to 100G.
Login nodes/transfer servers etc have 25G bonded.

Most of the servers have Mellanox ConnectX-4 or ConnectX-5 adapters. But we also have few older Intel,Broadcom and Chelsio network cards in the clusters.

Most of the transceivers that we use are Mellanox,Finisar,Intel.

Issue:

We had upgraded to the above switch recently, and we had seen that it is not able to handle the network traffic because of higher NSD servers bandwidth vs lower compute node bandwidth.

One issue that we did see was a lot of network discards on the switch side and network congestion with slow IO performance on respective compute nodes.

Once we enabled ECN - we did see that it had reduced the network congestion.

We do see expels once in a while, but that is mostly related to the network errors or the host not responding. We observed that bonding/LACP does make expels much more trickier, so we have decided to go with no LACP until GPFS code gets better at handling LACP - which I think they are working on.

We have heard that our current switch is a shallow buffer switch, and we would need a higher/deep buffer Arista switch to perform better with no congestion/lesser latency and more throughput.

On the other side, Mellanox promises to use better ASIC design and buffer architecture with spine leaf design, instead of one deep buffer core switch to get better performance than Arista.

Most of the applications that run on the clusters are either genomic applications on CPUs and deep learning applications on GPUs.

All of our GPFS storage cluster versions are above 5.0.2 with the compute filesystems at 16M block size on near line rotating disks, and Flash storage at 512K block size.

May I know if could feedback from anyone who is using Arista or Mellanox switches on the clusters to understand the pros and cons, stability and the performance numbers of the same?

Thank you,
Lohit
________________________________
This message is for the recipient’s use only, and may contain confidential, privileged or protected information. Any unauthorized use or dissemination of this communication is prohibited. If you received this message in error, please immediately notify the sender and destroy all copies of this message. The recipient should check this email and any attachments for the presence of viruses, as we accept no liability for any damage caused by any virus transmitted by this email.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200322/0a35390c/attachment-0002.htm>