[gpfsug-discuss] OOM Killer killing off GPFS 3.5

Sanchez, Paul Paul.Sanchez at deshaw.com
Wed May 25 15:00:35 BST 2016


I'm sure that Yuri is right about the corner-case complexity across all linux and Spectrum/GPFS versions.

In situations where lots of outstanding tokens exist, and there are few token managers, we have seen the assassination of a large footprint mmfsd in GPFS 4.1 seem to impact entire clusters, potentially due to serialization in recovery of so many tokens, and overlapping access among nodes. We're looking forward to fixes in 4.2.1 to address some of this too.

But for what it's worth, on RH6/7 with 4.1, we have seen the end of OOM impacting GPFS since implementing the callback. One item I forgot is that we don't set it to -500, but to OOM_SCORE_ADJ_MIN, which on our systems is -1000. That causes the heuristic oom_badness to return the lowest possible score, more thoroughly immunizing it against selection.

Thx
Paul

Sent with Good Work (www.good.com)


From: Yuri L Volobuev <volobuev at us.ibm.com<mailto:volobuev at us.ibm.com>>
Date: Tuesday, May 24, 2016, 12:17 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Subject: Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5


This problem is more complex than it may seem. The thing is, mmfsd runs as root, as thus already possesses a certain amount of natural immunity to OOM killer. So adjusting mmfsd oom_score_adj doesn't radically change the ranking of OOM killer victims, only tweaks it. The way things are supposed to work is: a user process eats up a lot of memory, and once a threshold is hit, OOM killer picks off the memory hog, and the memory is released. Unprivileged processes inherently have a higher OOM score, and should be killed off first. If that doesn't work, for some reason, the OOM killer gets desperate and starts going after root processes. Once things get to this point, it's tough. If you somehow manage to spare mmfsd per se, what's going to happen next? The OOM killer still needs a victim. What we've seen happen in such a situation is semi-random privileged process killing. mmfsd stays alive, but various other system processes are picked off, and pretty quickly the node is a basket case. A Linux node is not very resilient to random process killing. And it doesn't help that those other privileged processes usually don't use much memory, so killing them doesn't release much, and the carnage keeps on going. The real problem is: why wasn't the non-privileged memory hog process killed off first, before root processes became fair game? This is where things get pretty complicated, and depend heavily on the Linux version. There's one specific issue that did get diagnosed. If a process is using mmap and has page faults going that result in GPFS IO, on older versions of GPFS the process would fail to error out after a SIGKILL, due to locking complications spanning Linux kernel VMM and GPFS mmap code. This means the OOM killer would attempt to kill a process, but that wouldn't produce the desired result (the process is still around), and the OOM killer keeps moving down the list. This problem has been fixed in the current GPFS service levels. It is possible that a similar problem may exist that prevents a memory hog process from erroring out. I strongly encourage opening a PMR to investigate such a situation, instead of trying to work around it without understanding why mmfsd was targeted in the first place.

This is the case of prevention being the best cure. Where we've seen success is customers using cgroups to prevent user processes from running a node out of memory in the first place. This has been shown to work well. Dealing with the fallout from running out of memory is a much harder task.

The post-mmfsd-kill symptoms that are described in the original note are not normal. If an mmfsd process is killed, other nodes will become aware of this fact faily quickly, and the node is going to be expelled from the cluster (yes, expels *can* be a good thing). In the normal case, TCP/IP sockets are closed as soon as mmfsd is killed, and other nodes immediately receive TCP RST packets, and close their connection endpoints. If the worst case, if a node just becomes catatonic, but RST is not sent out, the troubled node is going to be expelled from the cluster after about 2 minutes of pinging (in a default configuration). There should definitely not be a permanent hang that necessitates a manual intervention. Again, older versions of GPFS had no protection against surprise OOM thread kills, but in the current code some self-monitoring capabilities have been added, and a single troubled node won't have a lasting impact on the cluster. If you aren't running with a reasonably current level of GPFS 3.5 service, I strongly recommend upgrading. If you see the symptoms originally described with the current code, that's a bug that we need to fix, so please open a PMR to address the issue.

yuri

[Inactive hide details for "Sanchez, Paul" ---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in the Spectrum Sc]"Sanchez, Paul" ---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in the Spectrum Scale docs (http://www.ibm.com/support/knowle

From: "Sanchez, Paul" <Paul.Sanchez at deshaw.com>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date: 05/24/2016 07:33 AM
Subject: Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5
Sent by: gpfsug-discuss-bounces at spectrumscale.org

________________________________



Hi Peter,

This is mentioned explicitly in the Spectrum Scale docs (http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en) as a problem for the admin to consider, and many of us have been bitten by this. There are references going back at least to GPFS 3.1 in 2008 on developerworks complaining about this situation.

While the answer you described below is essentially what we do as well, I would argue that this is a problem which IBM should just own and fix for everyone. I cannot think of a situation in which you would want GPFS to be sacrificed on a node due to out-of-memory conditions, and I have seen several terrible consequences of this, including loss of cached, user-acknowledged writes.

I don't think there are any real gotchas. But in addition, our own implementation also:

* uses "--event preStartup" instead of "startup", since it runs earlier and reduces the risk of a race

* reads the score back out and complains if it hasn't been set

* includes "set -e" to ensure that errors will terminate the script and return a non-zero exit code to the callback parent

Thx
Paul

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs
Sent: Tuesday, May 24, 2016 10:01 AM
To: gpfsug main discussion list
Subject: [gpfsug-discuss] OOM Killer killing off GPFS 3.5

Hi All,

We have an issue where the Linux kills off GPFS first when a computer runs out of memory, this happens when user processors have exhausted memory and swap and the out of memory killer in Linux kills the GPFS daemon as the largest user of memory, due to its large pinned memory foot print.

We have an issue where the Linux kills off GPFS first when a computer runs out of memory. We are running GPFS 3.5

We believe this happens when user processes have exhausted memory and swap and the out of memory killer in Linux chooses to kill the GPFS daemon as the largest user of memory, due to its large pinned memory footprint.

This means that GPFS is killed and the whole cluster blocks for a minute before it resumes operation, this is not ideal, and kills and causes issues with most of the cluster.

What we see is users unable to login elsewhere on the cluster until we have powered off the node. We believe this is because while the node is still pingable, GPFS doesn't expel it from the cluster.

This issue mainly occurs on our frontend nodes of our HPC cluster but can effect the rest of the cluster when it occurs.

This issue mainly occurs on the login nodes of our HPC cluster but can affect the rest of the cluster when it occurs.

I've seen others on list with this issue.

We've come up with a solution where by the gpfs is adjusted so that is unlikely to be the first thing to be killed, and hopefully the user process is killed and not GPFS.

We've come up with a solution to adjust the OOM score of GPFS, so that it is unlikely to be the first thing to be killed, and hopefully the OOM killer picks a user process instead.

Out testing says this solution works, but I'm asking here firstly to share our knowledge and secondly to ask if there is anything we've missed with this solution and issues with this.

We've tested this and it seems to work. I'm asking here firstly to share our knowledge and secondly to ask if there is anything we've missed with this solution.

Its short which is part of its beauty.

/usr/local/sbin/gpfs-oom_score_adj

<pre>
#!/bin/bash

for proc in $(pgrep mmfs); do
echo -500 >/proc/$proc/oom_score_adj done </pre>

This can then be called automatically on GPFS startup with the following:

<pre>
mmaddcallback startupoomkiller --command /usr/local/sbin/gpfs-oom_score_adj --event startup </pre>

and either restart gpfs or just run the script on all nodes.

Peter Childs
ITS Research Infrastructure
Queen Mary, University of London
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160525/fed71308/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160525/fed71308/attachment-0004.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: graycol.gif
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160525/fed71308/attachment-0005.gif>


More information about the gpfsug-discuss mailing list