[gpfsug-discuss] Services on DSS/ESS nodes
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Fri Oct 2 17:14:12 BST 2020
What if any are the rules around running additional services on DSS/ESS
nodes with regard to support? Let me outline our scenario
Our main cluster uses 10Gbps ethernet for storage with the DSS-G nodes
hooked up with redundant 40Gbps ethernet.
However we have an older cluster that is used for undergraduate teaching
that only has 1Gbps ethernet and QDR Infiniband. With no money to
upgrade this to 10Gbps ethernet to support this we flipped one of the
ports on the ConnectX4 cards on each DSS-G node to Infiniband and using
IPoIB run the teaching nodes in this way.
However it means that we need an Ethernet to Infiniband gateway as the
ethernet only connected nodes want to talk to the Infiniband connected
ones on their Infiniband address. Not a problem we grabbed an old spare
machine installed CentOS and configured it up to act as a bridge, and
deploy a custom route to all the ethernet only connected nodes. It has
been working fine for a couple of years now.
The problem is that this becomes firstly a single point of failure, on
hardware that is six years old now. Secondly to apply updates on the
gateway machine means all the teaching nodes have to be drained and GPFS
umounted to reboot the machine after updates have been installed. It is
currently not getting patched as frequently as I would like (and
required by the Scottish government).
So thinking about it I have come to the conclusion that the ideal
situation would be to use the DSS-G nodes as the gateway and run
keepalived to move the gateway ethernet IP address between the two
machines. It is idea because as long as one DSS-G node is up then there
is a functioning gateway and nodes don't get ejected from the cluster.
If both DSS-G nodes are down then there is no GPFS to mount anyway and
lack of a gateway is a moot point.
I grabbed a couple of the teaching compute nodes in the summer and
trialed it out. It works a treat.
I now need to check IBM are not going to throw a wobbler down the line
if I need to get support before deploying it to the DSS-G nodes :-)
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list