<html>


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


</head>


<body style="word-wrap: break-word; -webkit-nbsp-mode: space; -webkit-line-break: after-white-space; color: rgb(0, 0, 0); font-size: 14px; font-family: Helvetica, sans-serif;">


<div>


<div>


<div>Some general thoughts on “deadlocks” and automated deadlock detection.</div>


<div><br>


</div>


<div>I personally don’t like the term “deadlock” as it implies a condition that won’t ever resolve itself. In GPFS terms, a deadlock is really a “long RPC waiter” over a certain threshold. RPCs that wait on certain events can and do occur and they can take


 some time to complete. This is not necessarily a condition that is a problem, but you should be looking into them.</div>


<div><br>


</div>


<div>GPFS does have automated deadlock detection and collection, but in the early releases it was … well.. it’s not very “robust”. With later releases (4.2) it’s MUCH better. I personally don’t rely on it because in larger clusters it can be too aggressive


 and depending on what’s really going on it can make things worse. This statement is my opinion and it doesn’t mean it’s not a good thing to have. :-) </div>


<div><br>


</div>


<div>On the point of what commands to execute and what to collect – be careful about long running callback scripts and executing commands on other nodes. Depending on what the issues is, you could end up causing a deadlock or making it worse. Some basic data


 collection, local to the node with the long RPC waiter is a good thing. Test them well before deploying them. And make sure that you don’t conflict with the automated collections. (which you might consider turning off) </div>


<div><br>


</div>


<div>For my larger clusters, I dump the cluster waiters on a regular basis (once a minute: mmlsnode –N waiters –L), count the types and dump them into a database for graphing via Grafana. This doesn’t help me with true deadlock alerting, but it does give me


 insight into overall cluster behavior. If I see large numbers of long waiters I will (usually) go and investigate them on a cases by case basis. If you have large numbers of long RPC waiters on an ongoing basis, it's an indication of a larger problem that


 should be investigated. A few here and there is not a cause for real alarm in my experience.</div>


<div><br>


</div>


<div>Last – if you have a chance to upgrade to 4.1.1 or 4.2, I would encourage you to do so as the deadlock detection has improved quite a bit.</div>


<div>


<div id="">


<div style="color: rgb(0, 0, 0); font-family: Calibri, sans-serif; font-size: 14px;">


<span style="font-family: Calibri; font-size: medium;"><br>


</span></div>


<div style="color: rgb(0, 0, 0); font-family: Helvetica, sans-serif; font-size: 14px;">


<font face="Helvetica">Bob Oesterlin<br>


Sr Storage Engineer, Nuance HPC Grid<br>


</font></div>


<div style="color: rgb(0, 0, 0); font-family: Helvetica, sans-serif; font-size: 14px;">


<font face="Helvetica">robert.oesterlin@nuance.com</font></div>


</div>


</div>


</div>


</div>


<div><br>


</div>


<span id="OLK_SRC_BODY_SECTION">


<div style="font-family:Calibri; font-size:12pt; text-align:left; color:black; BORDER-BOTTOM: medium none; BORDER-LEFT: medium none; PADDING-BOTTOM: 0in; PADDING-LEFT: 0in; PADDING-RIGHT: 0in; BORDER-TOP: #b5c4df 1pt solid; BORDER-RIGHT: medium none; PADDING-TOP: 3pt">


<span style="font-weight:bold">From: </span><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a>> on behalf of Roland Pabel <<a href="mailto:dr.roland.pabel@gmail.com">dr.roland.pabel@gmail.com</a>><br>


<span style="font-weight:bold">Organization: </span>RRZK Uni Köln<br>


<span style="font-weight:bold">Reply-To: </span>gpfsug main discussion list <<a href="mailto:gpfsug-discuss@spectrumscale.org">gpfsug-discuss@spectrumscale.org</a>><br>


<span style="font-weight:bold">Date: </span>Tuesday, April 12, 2016 at 3:03 AM<br>


<span style="font-weight:bold">To: </span>gpfsug main discussion list <<a href="mailto:gpfsug-discuss@spectrumscale.org">gpfsug-discuss@spectrumscale.org</a>><br>


<span style="font-weight:bold">Subject: </span>[gpfsug-discuss] Executing Callbacks on other Nodes<br>


</div>


<div><br>


</div>


<div>


<div>


<div>Hi everyone,</div>


<div><br>


</div>


<div>we are using GPFS 4.1.0.8 with 4 servers and 850 clients. Our GPFS setup is </div>


<div>fairly new, we are still in the testing phase. A few days ago, we had some </div>


<div>problems in the cluster which seemed to have started with deadlocks on a small


</div>


<div>number of nodes. To be better prepared for this scenario, I would like to </div>


<div>install a callback for Event deadlockDetected. But this is a local event and


</div>


<div>the callback is executed on the client nodes, from which I cannot even send an


</div>


<div>email.</div>


<div><br>


</div>


<div>Is it possible using mm-commands to instead delegate the callback to the </div>


<div>servers (Nodeclass nsdNodes)?</div>


<div><br>


</div>


<div>I guess it would be possible to use a callback of the form "ssh nsd0 </div>


<div>/root/bin/deadlock-callback.sh", but then it is contingent upon server nsd0 </div>


<div>being available. The mm-command style "-N nsdNodes" would more reliable in my


</div>


<div>opinion, because it would be run on all servers. On the servers, I can then </div>


<div>check to actually only execute the script on the cluster manager.</div>


<div><br>


</div>


<div>Thanks</div>


<div><br>


</div>


<div>Roland</div>


<div>-- </div>


<div>Dr. Roland Pabel</div>


<div>Regionales Rechenzentrum der Universität zu Köln (RRZK)</div>


<div>Weyertal 121, Raum 3.07</div>


<div>D-50931 Köln</div>


<div><br>


</div>


<div>Tel.: +49 (221) 470-89589</div>


<div>E-Mail: <a href="mailto:pabel@uni-koeln.de">pabel@uni-koeln.de</a></div>


<div>_______________________________________________</div>


<div>gpfsug-discuss mailing list</div>


<div>gpfsug-discuss at spectrumscale.org</div>


<div><a href="https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=CwIFAw&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=c7jzNm-H6SdZMztP1xkwgySivoe4FlOcI2pS2SCJ8K8&s=AfohxS7tz0ky5C8ImoufbQmQpdwpo4wEO7cSCzHPCD0&e=">https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=CwIFAw&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=c7jzNm-H6SdZMztP1xkwgySivoe4FlOcI2pS2SCJ8K8&s=AfohxS7tz0ky5C8ImoufbQmQpdwpo4wEO7cSCzHPCD0&e=</a>


</div>


<div><br>


</div>


</div>


</div>


</span>


</body>


</html>