[gpfsug-discuss] OOM Killer killing off GPFS 3.5

Wed May 25 15:09:06 BST 2016

Not now, but in a previous role, we would specifically increase the oom
score on computer processes on our cluster that could consume a large
amount of ram, trying to protect system processes. Once did this we had 0
system processes die.

On 25 May 2016 at 17:00, Sanchez, Paul <Paul.Sanchez at deshaw.com> wrote:

> I'm sure that Yuri is right about the corner-case complexity across all
> linux and Spectrum/GPFS versions.
>
> In situations where lots of outstanding tokens exist, and there are few
> token managers, we have seen the assassination of a large footprint mmfsd
> in GPFS 4.1 seem to impact entire clusters, potentially due to
> serialization in recovery of so many tokens, and overlapping access among
> nodes. We're looking forward to fixes in 4.2.1 to address some of this too.
>
> But for what it's worth, on RH6/7 with 4.1, we have seen the end of OOM
> impacting GPFS since implementing the callback. One item I forgot is that
> we don't set it to -500, but to OOM_SCORE_ADJ_MIN, which on our systems is
> -1000. That causes the heuristic oom_badness to return the lowest possible
> score, more thoroughly immunizing it against selection.
>
> Thx
> Paul
>
> Sent with Good Work (www.good.com)
>
>
> *From: *Yuri L Volobuev <volobuev at us.ibm.com>
> *Date: *Tuesday, May 24, 2016, 12:17 PM
> *To: *gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> *Subject: *Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5
>
> This problem is more complex than it may seem. The thing is, mmfsd runs as
> root, as thus already possesses a certain amount of natural immunity to OOM
> killer. So adjusting mmfsd oom_score_adj doesn't radically change the
> ranking of OOM killer victims, only tweaks it. The way things are supposed
> to work is: a user process eats up a lot of memory, and once a threshold is
> hit, OOM killer picks off the memory hog, and the memory is released.
> Unprivileged processes inherently have a higher OOM score, and should be
> killed off first. If that doesn't work, for some reason, the OOM killer
> gets desperate and starts going after root processes. Once things get to
> this point, it's tough. If you somehow manage to spare mmfsd per se, what's
> going to happen next? The OOM killer still needs a victim. What we've seen
> happen in such a situation is semi-random privileged process killing. mmfsd
> stays alive, but various other system processes are picked off, and pretty
> quickly the node is a basket case. A Linux node is not very resilient to
> random process killing. And it doesn't help that those other privileged
> processes usually don't use much memory, so killing them doesn't release
> much, and the carnage keeps on going. The real problem is: why wasn't the
> non-privileged memory hog process killed off first, before root processes
> became fair game? This is where things get pretty complicated, and depend
> heavily on the Linux version. There's one specific issue that did get
> diagnosed. If a process is using mmap and has page faults going that result
> in GPFS IO, on older versions of GPFS the process would fail to error out
> after a SIGKILL, due to locking complications spanning Linux kernel VMM and
> GPFS mmap code. This means the OOM killer would attempt to kill a process,
> but that wouldn't produce the desired result (the process is still around),
> and the OOM killer keeps moving down the list. This problem has been fixed
> in the current GPFS service levels. It is possible that a similar problem
> may exist that prevents a memory hog process from erroring out. I strongly
> encourage opening a PMR to investigate such a situation, instead of trying
> to work around it without understanding why mmfsd was targeted in the first
> place.
>
> This is the case of prevention being the best cure. Where we've seen
> success is customers using cgroups to prevent user processes from running a
> node out of memory in the first place. This has been shown to work well.
> Dealing with the fallout from running out of memory is a much harder task.
>
> The post-mmfsd-kill symptoms that are described in the original note are
> not normal. If an mmfsd process is killed, other nodes will become aware of
> this fact faily quickly, and the node is going to be expelled from the
> cluster (yes, expels *can* be a good thing). In the normal case, TCP/IP
> sockets are closed as soon as mmfsd is killed, and other nodes immediately
> receive TCP RST packets, and close their connection endpoints. If the worst
> case, if a node just becomes catatonic, but RST is not sent out, the
> troubled node is going to be expelled from the cluster after about 2
> minutes of pinging (in a default configuration). There should definitely
> not be a permanent hang that necessitates a manual intervention. Again,
> older versions of GPFS had no protection against surprise OOM thread kills,
> but in the current code some self-monitoring capabilities have been added,
> and a single troubled node won't have a lasting impact on the cluster. If
> you aren't running with a reasonably current level of GPFS 3.5 service, I
> strongly recommend upgrading. If you see the symptoms originally described
> with the current code, that's a bug that we need to fix, so please open a
> PMR to address the issue.
>
> yuri
>
> [image: Inactive hide details for "Sanchez, Paul" ---05/24/2016 07:33:18
> AM---Hi Peter, This is mentioned explicitly in the Spectrum Sc]"Sanchez,
> Paul" ---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in
> the Spectrum Scale docs (http://www.ibm.com/support/knowle
>
> From: "Sanchez, Paul" <Paul.Sanchez at deshaw.com>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
> Date: 05/24/2016 07:33 AM
> Subject: Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> Hi Peter,
>
> This is mentioned explicitly in the Spectrum Scale docs (
> http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en)
> as a problem for the admin to consider, and many of us have been bitten by
> this. There are references going back at least to GPFS 3.1 in 2008 on
> developerworks complaining about this situation.
>
> While the answer you described below is essentially what we do as well, I
> would argue that this is a problem which IBM should just own and fix for
> everyone. I cannot think of a situation in which you would want GPFS to
> be sacrificed on a node due to out-of-memory conditions, and I have seen
> several terrible consequences of this, including loss of cached,
> user-acknowledged writes.
>
> I don't think there are any real gotchas. But in addition, our own
> implementation also:
>
> * uses "--event preStartup" instead of "startup", since it runs earlier
> and reduces the risk of a race
>
> * reads the score back out and complains if it hasn't been set
>
> * includes "set -e" to ensure that errors will terminate the script and
> return a non-zero exit code to the callback parent
>
> Thx
> Paul
>
> -----Original Message-----
> From: gpfsug-discuss-bounces at spectrumscale.org [
> mailto:gpfsug-discuss-bounces at spectrumscale.org
> <gpfsug-discuss-bounces at spectrumscale.org>] On Behalf Of Peter Childs
> Sent: Tuesday, May 24, 2016 10:01 AM
> To: gpfsug main discussion list
> Subject: [gpfsug-discuss] OOM Killer killing off GPFS 3.5
>
> Hi All,
>
> We have an issue where the Linux kills off GPFS first when a computer runs
> out of memory, this happens when user processors have exhausted memory and
> swap and the out of memory killer in Linux kills the GPFS daemon as the
> largest user of memory, due to its large pinned memory foot print.
>
> We have an issue where the Linux kills off GPFS first when a computer runs
> out of memory. We are running GPFS 3.5
>
> We believe this happens when user processes have exhausted memory and swap
> and the out of memory killer in Linux chooses to kill the GPFS daemon as
> the largest user of memory, due to its large pinned memory footprint.
>
> This means that GPFS is killed and the whole cluster blocks for a minute
> before it resumes operation, this is not ideal, and kills and causes issues
> with most of the cluster.
>
> What we see is users unable to login elsewhere on the cluster until we
> have powered off the node. We believe this is because while the node is
> still pingable, GPFS doesn't expel it from the cluster.
>
> This issue mainly occurs on our frontend nodes of our HPC cluster but can
> effect the rest of the cluster when it occurs.
>
> This issue mainly occurs on the login nodes of our HPC cluster but can
> affect the rest of the cluster when it occurs.
>
> I've seen others on list with this issue.
>
> We've come up with a solution where by the gpfs is adjusted so that is
> unlikely to be the first thing to be killed, and hopefully the user process
> is killed and not GPFS.
>
> We've come up with a solution to adjust the OOM score of GPFS, so that it
> is unlikely to be the first thing to be killed, and hopefully the OOM
> killer picks a user process instead.
>
> Out testing says this solution works, but I'm asking here firstly to share
> our knowledge and secondly to ask if there is anything we've missed with
> this solution and issues with this.
>
> We've tested this and it seems to work. I'm asking here firstly to share
> our knowledge and secondly to ask if there is anything we've missed with
> this solution.
>
> Its short which is part of its beauty.
>
> /usr/local/sbin/gpfs-oom_score_adj
>
> <pre>
> #!/bin/bash
>
> for proc in $(pgrep mmfs); do
> echo -500 >/proc/$proc/oom_score_adj done </pre>
>
> This can then be called automatically on GPFS startup with the following:
>
> <pre>
> mmaddcallback startupoomkiller --command
> /usr/local/sbin/gpfs-oom_score_adj --event startup </pre>
>
> and either restart gpfs or just run the script on all nodes.
>
> Peter Childs
> ITS Research Infrastructure
> Queen Mary, University of London
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160525/f137a550/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160525/f137a550/attachment-0002.gif>