[gpfsug-discuss] mmhealth alerts too quickly
Dietrich, Stefan
stefan.dietrich at desy.de
Fri Sep 13 13:15:12 BST 2024
Hello Peter,
> Since we upgraded to 5.1.9-5 we're getting random nodes moaning about lost
> connections when ever a another machine is rebooted or stops working. This is
> great, however there does not seam to be any great way to acknowledge the
> alerts, or close the connections gracefully if the machine is turned off rather
> than actually failing.
it's possible to resolve event in mmhealth:
# mmhealth event resolve
Missing arguments.
Usage:
mmhealth event resolve {EventName} [Identifier]
-> `mmhealth event resolve cluster_connections_down AFFECTED_IP` should do the trick.
In our clusters, a regular reboot doesn't seem to trigger this event. All our nodes are running Scale >= 5.2.0
Regards,
Stefan
--
------------------------------------------------------------------------
Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems)
Ein Forschungszentrum der Helmholtz-Gemeinschaft
Notkestr. 85
phone: +49-40-8998-4696 22607 Hamburg
e-mail: stefan.dietrich at desy.de Germany
------------------------------------------------------------------------
More information about the gpfsug-discuss
mailing list