[gpfsug-discuss] Services on DSS/ESS nodes

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Fri Oct 2 17:14:12 BST 2020

What if any are the rules around running additional services on DSS/ESS 
nodes with regard to support? Let me outline our scenario

Our main cluster uses 10Gbps ethernet for storage with the DSS-G nodes 
hooked up with redundant 40Gbps ethernet.

However we have an older cluster that is used for undergraduate teaching 
that only has 1Gbps ethernet and QDR Infiniband. With no money to 
upgrade this to 10Gbps ethernet to support this we flipped one of the 
ports on the ConnectX4 cards on each DSS-G node to Infiniband and using 
IPoIB run the teaching nodes in this way.

However it means that we need an Ethernet to Infiniband gateway as the 
ethernet only connected nodes want to talk to the Infiniband connected 
ones on their Infiniband address. Not a problem we grabbed an old spare 
machine installed CentOS and configured it up to act as a bridge, and 
deploy a custom route to all the ethernet only connected nodes. It has 
been working fine for a couple of years now.

The problem is that this becomes firstly a single point of failure, on 
hardware that is six years old now. Secondly to apply updates on the 
gateway machine means all the teaching nodes have to be drained and GPFS 
umounted to reboot the machine after updates have been installed. It is 
currently not getting patched as frequently as I would like (and 
required by the Scottish government).

So thinking about it I have come to the conclusion that the ideal 
situation would be to use the DSS-G nodes as the gateway and run 
keepalived to move the gateway ethernet IP address between the two 
machines. It is idea because as long as one DSS-G node is up then there 
is a functioning gateway and nodes don't get ejected from the cluster. 
If both DSS-G nodes are down then there is no GPFS to mount anyway and 
lack of a gateway is a moot point.

I grabbed a couple of the teaching compute nodes in the summer and 
trialed it out. It works a treat.

I now need to check IBM are not going to throw a wobbler down the line 
if I need to get support before deploying it to the DSS-G nodes :-)


Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG

More information about the gpfsug-discuss mailing list