[gpfsug-discuss] infiniband fabric instability effects
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Fri Sep 13 10:48:52 BST 2019
On Fri, 2019-09-13 at 05:14 -0400, david_johnson at brown.edu wrote:
[SNIP]
> Moving a non ha subnet manager from primary to backup and back again
> has worked for us without disruption, but I would try to do this in a
> maintenance window.
>
Not on GPFS but in the past I have moved from one subnet manager to
another with dozens of running MPI jobs, and Lustre running over the
fabric and not missed a beat. My current cluster used 10 and 40Gbps
ethernet for GPFS with Omnipath exclusively for MPI traffic.
To be honest I just cannot wrap my head around the idea that you would
not be running two subnet managers in the first place. Just fire up two
subnet managers (whether on a switch or a node) and forget about it.
They will automatically work together to give you a HA solution. It is
the same with Omnipath too.
I would also note that you can fire up more than two fabric managers
and it all "just works".
If it where me and I didn't have fabric managers running on at least
two of my switches and I was doing GPFS over Infiniband, I would fire
up fabric managers on all of my NSD servers.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list