[gpfsug-discuss] Updating a medium size cluster efficiently

Wed Apr 27 21:21:07 BST 2022

On 27/04/2022 14:19, Hannappel, Juergen wrote:
> 
> Hi,
> we have a medium size gpfs client cluster (a few hundred nodes)
> and want to update the gpfs version in an efficient way in a
> rolling update, i.e.. update each node when it can be rebooted.
> 
> Doing so via a slurm script when the node is drained jut before
> the reboot work only most of the time because in some cases even
> when the node is drained the file systems are still buly and can't be unmounted,
> so the update fails.
> 

I assume this is because you have dead jobs?

The general trick is to submit a job as a special user that has sudo 
privileges, that runs as the next job on every node. That way you don't 
need to wait for the node to drain. Last "user" job on the node finishes 
and then the "special" job runs. It does it magic and reboots the node. 
Winner winner chicken dinner.

> Therefore I tried to trigger the update on the reboot, before gpfs starts.
> To do so I added a systemd service that is sheduled before the gpfs.service,
> which does a yum update (we run CentOs 7.9) but:
> 

> In the postinstall script of gpfs.base the gpfs.service is disabled and re-enabled
> via systemctl, and systemd apparently get's that wrong, so that if
> the update really happens it afterwards will not start the gpfs.service.
> 
> Does any one have a clever way how to do a rolling update that really works
> without maunually hunting after some per cent of machines that don't manage
> it on the first go?
> 

What you could do to find the nodes that don't work is have the upgrade 
script do an mmshutdown first before attempting the upgrade. Then check 
it actually managed to shutdown and if it didn't then send an email to 
an appropriate person saying there is an issue, before say putting the 
node in drain.

The man page for mmshutdown says it has an exit code of zero on success, 
and none zero of failure so should be trivial to script.

Being really clever I think you could then have the script submit a 
second copy of itself to the node that again will run as the next job 
and then reboot the node. That way when it comes back up it should be 
able to unmount GPFS and install the upgrade as the reboot will have 
cleared the issues that prevented the mmshutdown from working. You would 
obviously need to trial this out.

If you are just looking to upgrade gpfs-gplbin and don't want to have it 
being recompiled on every node, then there is a trick with systemd. What 
you do is create /etc/systemd/system/gpfs.service.d/install-module.conf 
with the following contents

[Service]
ExecStartPre=-/usr/bin/yum --assumeyes install gpfs.gplbin-%v

then everytime GPFS starts up it attempts to install the module for the 
currently running kernel (the special magic %v). This presumes you have 
a repository with the appropriate gpfs-gplbin RPM setup. Basically I 
take a node out build the RPM, test it is working and then deploy.

I have a special RPM that installs the above local customization to the 
GPFS serivce unit file.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG