[gpfsug-discuss] Updating a medium size cluster efficiently

Wed Apr 27 14:19:02 BST 2022

Hi,
we have a medium size gpfs client cluster (a few hundred nodes)
and want to update the gpfs version in an efficient way in a
rolling update, i.e.. update each node when it can be rebooted.

Doing so via a slurm script when the node is drained jut before
the reboot work only most of the time because in some cases even
when the node is drained the file systems are still buly and can't be unmounted,
so the update fails.

Therefore I tried to trigger the update on the reboot, before gpfs starts.
To do so I added a systemd service that is sheduled before the gpfs.service,
which does a yum update (we run CentOs 7.9) but:

In the postinstall script of gpfs.base the gpfs.service is disabled and re-enabled
via systemctl, and systemd apparently get's that wrong, so that if
the update really happens it afterwards will not start the gpfs.service.

Does any one have a clever way how to do a rolling update that really works
without maunually hunting after some per cent of machines that don't manage
it on the first go?

-- 
Dr. Jürgen Hannappel  DESY/IT    Tel.  : +49 40 8998-4616