[gpfsug-discuss] mmhealth alerts too quickly

Dietrich, Stefan stefan.dietrich at desy.de
Fri Sep 13 13:15:12 BST 2024


Hello Peter,

> Since we upgraded to 5.1.9-5 we're getting random nodes moaning about lost
> connections when ever a another machine is rebooted or stops working. This is
> great, however there does not seam to be any great way to acknowledge the
> alerts, or close the connections gracefully if the machine is turned off rather
> than actually failing.

it's possible to resolve event in mmhealth:

# mmhealth event resolve
Missing arguments.
Usage:
  mmhealth event resolve {EventName} [Identifier]

-> `mmhealth event resolve cluster_connections_down AFFECTED_IP` should do the trick.

In our clusters, a regular reboot doesn't seem to trigger this event. All our nodes are running Scale >= 5.2.0

Regards,
Stefan

-- 
------------------------------------------------------------------------
Stefan Dietrich            Deutsches Elektronen-Synchrotron (IT-Systems)
                        Ein Forschungszentrum der Helmholtz-Gemeinschaft
                                                            Notkestr. 85
phone:  +49-40-8998-4696                                   22607 Hamburg
e-mail: stefan.dietrich at desy.de                                  Germany
------------------------------------------------------------------------



More information about the gpfsug-discuss mailing list