[gpfsug-discuss] infiniband fabric instability effects

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Fri Sep 13 10:48:52 BST 2019

On Fri, 2019-09-13 at 05:14 -0400, david_johnson at brown.edu wrote:


> Moving a non ha subnet manager from primary to backup and back again
> has worked for us without disruption, but I would try to do this in a
> maintenance window. 

Not on GPFS but in the past I have moved from one subnet manager to
another with dozens of running MPI jobs, and Lustre running over the
fabric and not missed a beat. My current cluster used 10 and 40Gbps
ethernet for GPFS with Omnipath exclusively for MPI traffic.

To be honest I just cannot wrap my head around the idea that you would
not be running two subnet managers in the first place. Just fire up two
subnet managers (whether on a switch or a node) and forget about it.
They will automatically work together to give you a HA solution. It is
the same with Omnipath too.

I would also note that you can fire up more than two fabric managers
and it all "just works".

If it where me and I didn't have fabric managers running on at least
two of my switches and I was doing GPFS over Infiniband, I would fire
up fabric managers on all of my NSD servers.


Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

More information about the gpfsug-discuss mailing list