<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=us-ascii">
<meta content="text/html; charset=us-ascii">
</head>
<body>
<div>I'm sure that Yuri is right about the corner-case complexity across all linux and Spectrum/GPFS versions. </div>
<div><br>
</div>
<div>In situations where lots of outstanding tokens exist, and there are few token managers, we have seen the assassination of a large footprint mmfsd in GPFS 4.1 seem to impact entire clusters, potentially due to serialization in recovery of so many tokens,
and overlapping access among nodes. We're looking forward to fixes in 4.2.1 to address some of this too.</div>
<div><br>
</div>
<div>But for what it's worth, on RH6/7 with 4.1, we have seen the end of OOM impacting GPFS since implementing the callback. One item I forgot is that we don't set it to -500, but to OOM_SCORE_ADJ_MIN, which on our systems is -1000. That causes the heuristic
oom_badness to return the lowest possible score, more thoroughly immunizing it against selection.</div>
<div><br>
</div>
<div>Thx</div>
<div>Paul</div>
<br>
Sent with Good Work (www.good.com)<br>
<br>
<br>
<div style="border-top:#b5c4df 1pt solid; padding-top:6px; font-size:14px">
<div><b>From: </b><span>Yuri L Volobuev <<a href="mailto:volobuev@us.ibm.com">volobuev@us.ibm.com</a>></span></div>
<div><b>Date: </b><span>Tuesday, May 24, 2016, 12:17 PM</span></div>
<div><b>To: </b><span>gpfsug main discussion list <<a href="mailto:gpfsug-discuss@spectrumscale.org">gpfsug-discuss@spectrumscale.org</a>></span></div>
<div><b>Subject: </b><span>Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5</span></div>
</div>
<br>
<div>
<p>This problem is more complex than it may seem. The thing is, mmfsd runs as root, as thus already possesses a certain amount of natural immunity to OOM killer. So adjusting mmfsd oom_score_adj doesn't radically change the ranking of OOM killer victims, only
tweaks it. The way things are supposed to work is: a user process eats up a lot of memory, and once a threshold is hit, OOM killer picks off the memory hog, and the memory is released. Unprivileged processes inherently have a higher OOM score, and should be
killed off first. If that doesn't work, for some reason, the OOM killer gets desperate and starts going after root processes. Once things get to this point, it's tough. If you somehow manage to spare mmfsd per se, what's going to happen next? The OOM killer
still needs a victim. What we've seen happen in such a situation is semi-random privileged process killing. mmfsd stays alive, but various other system processes are picked off, and pretty quickly the node is a basket case. A Linux node is not very resilient
to random process killing. And it doesn't help that those other privileged processes usually don't use much memory, so killing them doesn't release much, and the carnage keeps on going. The real problem is: why wasn't the non-privileged memory hog process
killed off first, before root processes became fair game? This is where things get pretty complicated, and depend heavily on the Linux version. There's one specific issue that did get diagnosed. If a process is using mmap and has page faults going that result
in GPFS IO, on older versions of GPFS the process would fail to error out after a SIGKILL, due to locking complications spanning Linux kernel VMM and GPFS mmap code. This means the OOM killer would attempt to kill a process, but that wouldn't produce the desired
result (the process is still around), and the OOM killer keeps moving down the list. This problem has been fixed in the current GPFS service levels. It is possible that a similar problem may exist that prevents a memory hog process from erroring out. I strongly
encourage opening a PMR to investigate such a situation, instead of trying to work around it without understanding why mmfsd was targeted in the first place.<br>
<br>
This is the case of prevention being the best cure. Where we've seen success is customers using cgroups to prevent user processes from running a node out of memory in the first place. This has been shown to work well. Dealing with the fallout from running out
of memory is a much harder task.<br>
<br>
The post-mmfsd-kill symptoms that are described in the original note are not normal. If an mmfsd process is killed, other nodes will become aware of this fact faily quickly, and the node is going to be expelled from the cluster (yes, expels *can* be a good
thing). In the normal case, TCP/IP sockets are closed as soon as mmfsd is killed, and other nodes immediately receive TCP RST packets, and close their connection endpoints. If the worst case, if a node just becomes catatonic, but RST is not sent out, the troubled
node is going to be expelled from the cluster after about 2 minutes of pinging (in a default configuration). There should definitely not be a permanent hang that necessitates a manual intervention. Again, older versions of GPFS had no protection against surprise
OOM thread kills, but in the current code some self-monitoring capabilities have been added, and a single troubled node won't have a lasting impact on the cluster. If you aren't running with a reasonably current level of GPFS 3.5 service, I strongly recommend
upgrading. If you see the symptoms originally described with the current code, that's a bug that we need to fix, so please open a PMR to address the issue.<br>
<br>
yuri<br>
<br>
<img width="16" height="16" src="cid:1__=07BBF52EDFC577758f9e8a93df938690918c07B@" border="0" alt="Inactive hide details for "Sanchez, Paul" ---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in the Spectrum Sc"><font color="#424282">"Sanchez,
Paul" ---05/24/2016 07:33:18 AM---Hi Peter, This is mentioned explicitly in the Spectrum Scale docs (<a href="http://www.ibm.com/support/knowle">http://www.ibm.com/support/knowle</a></font><br>
<br>
<font size="2" color="#5F5F5F">From: </font><font size="2">"Sanchez, Paul" <Paul.Sanchez@deshaw.com></font><br>
<font size="2" color="#5F5F5F">To: </font><font size="2">gpfsug main discussion list <gpfsug-discuss@spectrumscale.org>,
</font><br>
<font size="2" color="#5F5F5F">Date: </font><font size="2">05/24/2016 07:33 AM</font><br>
<font size="2" color="#5F5F5F">Subject: </font><font size="2">Re: [gpfsug-discuss] OOM Killer killing off GPFS 3.5</font><br>
<font size="2" color="#5F5F5F">Sent by: </font><font size="2">gpfsug-discuss-bounces@spectrumscale.org</font><br>
</p>
<hr width="100%" size="2" align="left" noshade="" style="color:#8091A5">
<br>
<br>
<br>
<font face="Consolas">Hi Peter, </font><br>
<font face="Consolas"></font><br>
<font face="Consolas">This is mentioned explicitly in the Spectrum Scale docs (</font><font face="Consolas"><a href="http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en">http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.pdg.doc/bl1pdg_kerncfg.htm?lang=en</a></font><font face="Consolas">)
as a problem for the admin to consider, and many of us have been bitten by this. There are references going back at least to GPFS 3.1 in 2008 on developerworks complaining about this situation.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">While the answer you described below is essentially what we do as well,
</font><font color="#FF0000" face="Consolas">I would argue that this is a problem which IBM should just own and fix for everyone.</font><font face="Consolas"> I cannot think of a situation in which you would want GPFS to be sacrificed on a node due to out-of-memory
conditions, and I have seen several terrible consequences of this, including loss of cached, user-acknowledged writes.
</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">I don't think there are any real gotchas. But in addition, our own implementation also:</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">* uses "--event preStartup" instead of "startup", since it runs earlier and reduces the risk of a race</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">* reads the score back out and complains if it hasn't been set</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">* includes "set -e" to ensure that errors will terminate the script and return a non-zero exit code to the callback parent</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">Thx<br>
Paul</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">-----Original Message-----<br>
From: gpfsug-discuss-bounces@spectrumscale.org [</font><font face="Consolas"><a href="mailto:gpfsug-discuss-bounces@spectrumscale.org">mailto:gpfsug-discuss-bounces@spectrumscale.org</a></font><font face="Consolas">] On Behalf Of Peter Childs<br>
Sent: Tuesday, May 24, 2016 10:01 AM<br>
To: gpfsug main discussion list<br>
Subject: [gpfsug-discuss] OOM Killer killing off GPFS 3.5</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">Hi All,</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">We have an issue where the Linux kills off GPFS first when a computer runs out of memory, this happens when user processors have exhausted memory and swap and the out of memory killer in Linux kills the GPFS daemon as the largest user
of memory, due to its large pinned memory foot print.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">We have an issue where the Linux kills off GPFS first when a computer runs out of memory. We are running GPFS 3.5</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">We believe this happens when user processes have exhausted memory and swap and the out of memory killer in Linux chooses to kill the GPFS daemon as the largest user of memory, due to its large pinned memory footprint.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">This means that GPFS is killed and the whole cluster blocks for a minute before it resumes operation, this is not ideal, and kills and causes issues with most of the cluster.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">What we see is users unable to login elsewhere on the cluster until we have powered off the node. We believe this is because while the node is still pingable, GPFS doesn't expel it from the cluster.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">This issue mainly occurs on our frontend nodes of our HPC cluster but can effect the rest of the cluster when it occurs.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">This issue mainly occurs on the login nodes of our HPC cluster but can affect the rest of the cluster when it occurs.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">I've seen others on list with this issue.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">We've come up with a solution where by the gpfs is adjusted so that is unlikely to be the first thing to be killed, and hopefully the user process is killed and not GPFS.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">We've come up with a solution to adjust the OOM score of GPFS, so that it is unlikely to be the first thing to be killed, and hopefully the OOM killer picks a user process instead.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">Out testing says this solution works, but I'm asking here firstly to share our knowledge and secondly to ask if there is anything we've missed with this solution and issues with this.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">We've tested this and it seems to work. I'm asking here firstly to share our knowledge and secondly to ask if there is anything we've missed with this solution.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">Its short which is part of its beauty.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">/usr/local/sbin/gpfs-oom_score_adj</font><br>
<font face="Consolas"></font><br>
<font face="Consolas"><pre></font><br>
<font face="Consolas">#!/bin/bash</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">for proc in $(pgrep mmfs); do</font><br>
<font face="Consolas">echo -500 >/proc/$proc/oom_score_adj done </pre></font><br>
<font face="Consolas"></font><br>
<font face="Consolas">This can then be called automatically on GPFS startup with the following:</font><br>
<font face="Consolas"></font><br>
<font face="Consolas"><pre></font><br>
<font face="Consolas">mmaddcallback startupoomkiller --command /usr/local/sbin/gpfs-oom_score_adj --event startup </pre></font><br>
<font face="Consolas"></font><br>
<font face="Consolas">and either restart gpfs or just run the script on all nodes.</font><br>
<font face="Consolas"></font><br>
<font face="Consolas">Peter Childs</font><br>
<font face="Consolas">ITS Research Infrastructure</font><br>
<font face="Consolas">Queen Mary, University of London</font><br>
<font face="Consolas">_______________________________________________</font><br>
<font face="Consolas">gpfsug-discuss mailing list</font><br>
<font face="Consolas">gpfsug-discuss at spectrumscale.org</font><br>
<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><font color="#0000FF" face="Consolas">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></a><tt>_______________________________________________<br>
gpfsug-discuss mailing list<br>
gpfsug-discuss at spectrumscale.org<br>
</tt><tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a></tt><tt><br>
</tt><br>
<br>
</div>
</body>
</html>