[gpfsug-discuss] Services on DSS/ESS nodes

Mon Oct 5 12:44:48 BST 2020

On 05/10/2020 07:27, Jordi Caubet Serrabou wrote:

 > Coming to the routing point, is there any reason why you need it ? I
 > mean, this is because GPFS trying to connect between compute nodes or
 > a reason outside GPFS scope ?
 > If the reason is GPFS,  imho best approach - without knowledge of the
 > licensing you have - would be to use separate clusters: a storage
 > cluster and two compute clusters.

The issue is that individual nodes want to talk to one another on the 
data interface. Which caught me by surprise as the cluster is set to 
admin mode central.

The admin interface runs over ethernet for all nodes on a specific VLAN 
which which is given 802.1p priority 5 (that's Voice, < 10 ms latency 
and jitter). That saved a bunch of switching and cabling as you don't 
need the extra interface for the admin traffic. The cabling already 
significantly restricts airflow for a compute rack as it is, without 
adding a whole bunch more for a barely used admin interface.

It's like the people who wrote the best practice about separate 
interface for the admin traffic know very little about networking to be 
frankly honest. This is all last century technology.

The nodes for undergraduate teaching only have a couple of 1Gb ethernet 
ports which would suck for storage usage. However they also have QDR 
Infiniband. That is because even though undergraduates can't run 
multinode jobs, on the old cluster the Lustre storage was delivered over 
Infiniband, so they got Infiniband cards.

 > Both compute clusters join using multicluster setup the storage
 > cluster. There is no need both compute clusters see each other, they
 > only need to see the storage cluster. One of the clusters using the
 > 10G, the other cluster using the IPoIB interface.
 > You need at least three quorum nodes in each compute cluster but if
 > licensing is per drive on the DSS, it is covered.

Three clusters is starting to get complicated from an admin perspective. 
The biggest issue is coordinating maintenance and keep sufficient quorum 
nodes up.

Maintenance on compute nodes is done via the job scheduler. I know some 
people think this is crazy, but it is in reality extremely elegant.

We can schedule a reboot on a node as soon as the current job has 
finished (usually used for firmware upgrades). Or we can schedule a job 
to run as root (usually for applying updates) as soon as the current job 
has finished. As such we have no way of knowing when that will be for a 
given node, and there is a potential for all three quorum nodes to be 
down at once.

Using this scheme we can seamlessly upgrade the nodes safe in the 
knowledge that a node is either busy and it's running on the current 
configuration or it has been upgraded and is running the new 
configuration. Consequently multinode jobs are guaranteed to have all 
nodes in the job running on the same configuration.

The alternative is to drain the node, but there is only a 23% chance the 
node will become available during working hours leading to a significant 
loss of compute time when doing maintenance compared to our existing 
scheme where the loss of compute time is only as long as the upgrade 
takes to install. Pretty much the only time we have idle nodes is when 
the scheduler is reserving nodes ready to schedule a multi node job.

Right now we have a single cluster with the quorum nodes being the two 
DSS-G nodes and the node used for backup. It is easy to ensure that 
quorum is maintained on these, they also all run real RHEL, where as the 
compute nodes run CentOS.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG