[gpfsug-discuss] mmhealth alerts too quickly

Fri Sep 13 09:25:47 BST 2024

We have a nagios alert that watches the output of mmhealth and alerts us if Scale is unhappy on a node. its fairly simple and straight forward, and is very good at letting us know quickly to simple issues.

Since we upgraded to 5.1.9-5 we're getting random nodes moaning about lost connections when ever a another machine is rebooted or stops working. This is great, however there does not seam to be any great way to acknowledge the alerts, or close the connections gracefully if the machine is turned off rather than actually failing.

I'm aware of the "mmhealth --refresh" method but I've not actually seen this achieve anything and normally endup running "mmsysmoncontrol restart" to get the message to reset. The problem is we don't exactly want to lose the alerts, they are useful if there is a problem, but it would be nice if they where a little more helpful and could be acknowledged. Maybe mmshutdown just needs to close all the cluster connections gracefully so that other nodes don't moan. I've always found it a little abrupt.

The other main issue we have is the old memory leak in our ESS5000 https://www.ibm.com/support/pages/node/7027786 We have been working with IBM over the last 18 months on this issue, but still no resolution is in sight and I'm not sure the workaround is relevant any more.

Also we found going straight from 5.1.2 to 5.2.0 (to 5.2.1) is a stable upgrade path, and its best to pass though 5.1.9 first. I'm not sure if there is an issue here and something needs adding to the release notes, but that is certainly what we discovered.

I hope our findings help others.

Peter Childs