<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">
Hi All,
<div class=""><br class="">
</div>
<div class="">Time for a daily update on this saga…</div>
<div class=""><br class="">
</div>
<div class="">First off, responses to those who have responded to me:</div>
<div class=""><br class="">
</div>
<div class="">Yaron - we have QLogic switches, but I’ll RTFM and figure out how to clear the counters … with a quick look via the CLI interface to one of them I don’t see how to even look at those counters, must less clear them, but I’ll do some digging.  QLogic
 does have a GUI app, but given that the Mac version is PowerPC only I think that’s a dead end!  :-O</div>
<div class=""><br class="">
</div>
<div class="">Jonathan - understood.  We were just wanting to eliminate as much hardware as potential culprits as we could.  The storage arrays will all get a power-cycle this Sunday when we take a downtime to do firmware upgrades on them … the vendor is basically
 refusing to assist further until we get on the latest firmware.</div>
<div class=""><br class="">
</div>
<div class="">So … we had noticed that things seem to calm down starting Friday evening and continuing throughout the weekend.  We have a script that runs every half hour and if there’s any NSD servers where “mmdiag —iohist” shows an I/O > 1,000 ms, we get
 an alert (again, designed to alert us of a CBM failure).  We only got three all weekend long (as opposed to last week, when the alerts were coming every half hour round the clock).</div>
<div class=""><br class="">
</div>
<div class="">Then, this morning I repeated the “dd” test that I had run before and after replacing the FC cables going to “eon34” and which had showed very typical I/O rates for all the NSDs except for the 4 in eon34, which were quite poor (~1.5 - 10 MB/sec).
  I ran the new tests this morning from different NSD servers and with a higher “count” passed to dd to eliminate any potential caching effects.  I ran the test twice from two different NSD servers and this morning all NSDs - including those on eon34 - showed
 normal I/O rates!</div>
<div class=""><br class="">
</div>
<div class="">Argh - so do we have a hardware problem or not?!?</div>
<div class=""><br class="">
</div>
<div class="">I still think we do, but am taking *nothing* for granted at this point!   So today we also used another script we’ve written to do some investigation … basically we took the script which runs “mmdiag —iohist” and added some options to it so that
 for every I/O greater than the threshold it will see which client issued the I/O.  It then queries SLURM to see what jobs are running on that client.</div>
<div class=""><br class="">
</div>
<div class="">Interestingly enough, one user showed up waaaayyyyyy more often than anybody else.  And many times she was on a node with only one other user who we know doesn’t access the GPFS filesystem and other times she was the only user on the node.  </div>
<div class=""><br class="">
</div>
<div class="">We certainly recognize that correlation is not causation (she could be a victim and not the culprit), but she was on so many of the reported clients that we decided to investigate further … but her jobs seem to have fairly modest I/O requirements.
  Each one processes 4 input files, which are basically just gzip’d text files of 1.5 - 5 GB in size.  This is what, however, prompted my other query to the list about determining which NSDs a given file has its’ blocks on.  I couldn’t see how files of that
 size could have all their blocks on only a couple of NSDs in the pool (out of 19 total!) but wanted to verify that.  The files that I have looked at are evenly spread out across the NSDs.</div>
<div class=""><br class="">
</div>
<div class="">So given that her files are spread across all 19 NSDs in the pool and the high I/O wait times are almost always only on LUNs in eon34 (and, more specifically, on two of the four LUNs in eon34) I’m pretty well convinced it’s not her jobs causing
 the problems … I’m back to thinking a weird hardware issue.</div>
<div class=""><br class="">
</div>
<div class="">But if anyone wants to try to convince me otherwise, I’ll listen…</div>
<div class=""><br class="">
</div>
<div class="">Thanks!</div>
<div class=""><br class="">
</div>
<div class="">Kevin<br class="">
<div><br class="">
<blockquote type="cite" class="">
<div class="">On Jul 8, 2018, at 12:32 PM, Yaron Daniel <<a href="mailto:YARD@il.ibm.com" class="">YARD@il.ibm.com</a>> wrote:</div>
<br class="Apple-interchange-newline">
<div class=""><span style=" font-size:10pt;font-family:sans-serif" class="">Hi</span><br class="">
<br class="">
<span style=" font-size:10pt;font-family:sans-serif" class="">Clean all counters on the FC switches and see which port have errors .</span><br class="">
<br class="">
<span style=" font-size:10pt;font-family:sans-serif" class="">For brocade run :</span><br class="">
<br class="">
<span style=" font-size:10pt;font-family:sans-serif" class="">slotstatsclear</span><br class="">
<span style=" font-size:10pt;font-family:sans-serif" class="">statsclear</span><br class="">
<span style=" font-size:10pt;font-family:sans-serif" class="">porterrshow</span><br class="">
<br class="">
<span style=" font-size:10pt;font-family:sans-serif" class="">For cisco run:</span><br class="">
<span style=" font-size:9pt;font-family:Arial" class=""> </span><br class="">
<span style=" font-size:9pt;font-family:Arial" class="">clear countersall</span><br class="">
<br class="">
<span style=" font-size:9pt;font-family:Arial" class="">There might be bad gbic/cable/Storage gbic, which can affect the performance, if there is something like that - u can see which ports have errors grow over time.</span><br class="">
<span style=" font-size:10pt;font-family:Arial" class="">Regards</span><br class="">
<span style=" font-size:9pt;font-family:Arial" class=""> </span><br class="">
<table width="780" style="border-collapse:collapse;" class="">
<tbody class="">
<tr height="8" class="">
<td width="780" colspan="4" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<div align="center" class="">
<hr noshade="" class="">
</div>
<br class="">
<span style=" font-size:1pt;font-family:Arial" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="780" colspan="4" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:1pt;font-family:Arial" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="516" colspan="2" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:blue;font-family:Arial" class=""><b class="">Yaron Daniel</b></span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class=""> 94 Em Ha'Moshavot Rd</span></td>
<td width="96" rowspan="3" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<div align="right" class=""><span id="cid:_1_0DF9D8440DF9D290006059D7C22582C4"><ATT00001.gif></span></div>
</td>
</tr>
<tr height="8" class="">
<td width="516" colspan="2" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:blue;font-family:Arial" class=""><b class="">Storage Architect – IL Lab Services (Storage)</b></span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class=""> Petach Tiqva, 49527</span></td>
</tr>
<tr height="8" class="">
<td width="516" colspan="2" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:blue;font-family:Arial" class=""><b class="">IBM Global Markets, Systems HW Sales</b></span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class=""> Israel</span></td>
</tr>
<tr height="8" class="">
<td width="516" colspan="2" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:blue;font-family:Arial" class=""><b class=""> </b></span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class=""> </span></td>
<td width="96" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:9pt;font-family:Arial" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="90" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">Phone:</span></td>
<td width="426" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">+972-3-916-5672</span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class=""> </span></td>
<td width="96" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="90" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">Fax:</span></td>
<td width="426" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">+972-3-916-5672</span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">  </span></td>
<td width="96" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="90" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">Mobile:</span></td>
<td width="426" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">+972-52-8395593</span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">  </span></td>
<td width="96" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="90" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">e-mail:</span></td>
<td width="426" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class=""><a href="mailto:yard@il.ibm.com" class="">yard@il.ibm.com</a></span></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">  </span></td>
<td width="96" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="516" colspan="2" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.ibm.com%2Fil%2Fhe%2F&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866046739&sdata=fpkPC3%2FjrhpFp1iLq3THOlRQTCGFdAInnRjsIs9zFEc%3D&reserved=0" class=""><span style=" font-size:10pt;color:blue;font-family:Arial" class=""><u class="">IBM
 Israel</u></span></a></td>
<td width="168" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt;color:#5f5f5f;font-family:Arial" class="">  </span></td>
<td width="96" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:10pt" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="780" colspan="4" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:Arial" class=""> </span></td>
</tr>
<tr height="8" class="">
<td width="780" colspan="4" style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;" class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:Arial" class=""> </span></td>
</tr>
</tbody>
</table>
<p style="margin-top:0px;margin-Bottom:0px" class=""></p>
<br class="">
<span id="cid:_1_0DF969980DF96588006059D7C22582C4"><ATT00002.gif></span><span id="cid:_1_0DF96BA00DF96588006059D7C22582C4"><ATT00003.gif></span><span id="cid:_1_0DF96DA80DF96588006059D7C22582C4"><ATT00004.gif></span><span id="cid:_1_0DF96FB00DF96588006059D7C22582C4"><ATT00005.gif></span><span style=" font-size:12pt" class=""> </span><span id="cid:_1_0DF971D00DF96588006059D7C22582C4"><ATT00006.gif></span><span style=" font-size:12pt" class="">     
</span><span id="cid:_2_0DF9741C0DF96588006059D7C22582C4"><ATT00007.jpeg></span><br class="">
<br class="">
<br class="">
<br class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif" class="">From:        </span><span style=" font-size:9pt;font-family:sans-serif" class="">Jonathan Buzzard <<a href="mailto:jonathan.buzzard@strath.ac.uk" class="">jonathan.buzzard@strath.ac.uk</a>></span><br class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif" class="">To:        </span><span style=" font-size:9pt;font-family:sans-serif" class=""><a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a></span><br class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif" class="">Date:        </span><span style=" font-size:9pt;font-family:sans-serif" class="">07/07/2018 11:43 AM</span><br class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif" class="">Subject:        </span><span style=" font-size:9pt;font-family:sans-serif" class="">Re: [gpfsug-discuss] High I/O wait times</span><br class="">
<span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif" class="">Sent by:        </span><span style=" font-size:9pt;font-family:sans-serif" class=""><a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">gpfsug-discuss-bounces@spectrumscale.org</a></span><br class="">
<hr noshade="" class="">
<br class="">
<br class="">
<br class="">
<tt class=""><span style=" font-size:10pt" class="">On 07/07/18 01:28, Buterbaugh, Kevin L wrote:<br class="">
<br class="">
[SNIP]<br class="">
<br class="">
> <br class="">
> So, to try to rule out everything but the storage array we replaced the <br class="">
> FC cables going from the SAN switches to the array, plugging the new <br class="">
> cables into different ports on the SAN switches.  Then we repeated the <br class="">
> dd tests from a different NSD server, which both eliminated the NSD <br class="">
> server and its’ FC cables as a potential cause … and saw results <br class="">
> virtually identical to the previous test.  Therefore, we feel pretty <br class="">
> confident that it is the storage array and have let the vendor know all <br class="">
> of this.<br class="">
<br class="">
I was not thinking of doing anything quite as drastic as replacing <br class="">
stuff, more look into the logs on the switches in the FC network and <br class="">
examine them for packet errors. The above testing didn't eliminate bad <br class="">
optics in the storage array itself for example, though it does appear to <br class="">
be the storage arrays themselves. Sounds like they could do with a power <br class="">
cycle...<br class="">
<br class="">
JAB.<br class="">
<br class="">
-- <br class="">
Jonathan A. Buzzard                         Tel: +44141-5483420<br class="">
HPC System Administrator, ARCHIE-WeSt.<br class="">
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG<br class="">
_______________________________________________<br class="">
gpfsug-discuss mailing list<br class="">
gpfsug-discuss at <a href="http://spectrumscale.org" class="">spectrumscale.org</a><br class="">
</span></tt><a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwIGaQ%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DBn1XE9uK2a9CZQ8qKnJE3Q%26m%3DTM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ%26s%3DAss164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866046739&sdata=B%2F2Q9L1bwUvPHv858hLhTzt1hFT%2BMhCIOVeqGvLv3Rg%3D&reserved=0" originalsrc="https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=TM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ&s=Ass164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM&e=" shash="bn6QhCapPNWgL4/t6rqwzZiOSdZ25Vvz5eXW6n1SyuHqq9ux+hKTXJOuUtgnuoP4KkCjDpbrH6SXN5y8rMiM5EKuVqxzNYHVmA0EUPaaPITt7VHgz07kEKG7xT2Wvc/vypw2FtTh461y4/CH7/uQQa4M42wQZuOnZIabLpXDTNQ=" class=""><tt class=""><span style=" font-size:10pt" class="">https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=TM-kJsvzTX9cq_xmR5ITHclBCfO4FDvZ3ZxyugfJCfQ&s=Ass164qVEhb9fC4_VCmzfZeYd_BLOv9cZsfkrzqi8pM&e=</span></tt></a><tt class=""><span style=" font-size:10pt" class=""><br class="">
<br class="">
</span></tt><br class="">
<br class="">
<br class="">
_______________________________________________<br class="">
gpfsug-discuss mailing list<br class="">
gpfsug-discuss at <a href="http://spectrumscale.org" class="">spectrumscale.org</a><br class="">
<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866066749&amp;sdata=Viltitj3L9aScuuVKCLSp9FKkj7xdzWxsvvPVDSUqHw%3D&amp;reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C7c1ced16f6d44055c63408d5e4fa7d2e%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636666686866066749&amp;sdata=Viltitj3L9aScuuVKCLSp9FKkj7xdzWxsvvPVDSUqHw%3D&amp;reserved=0</a><br class="">
</div>
</blockquote>
</div>
<br class="">
</div>
</body>
</html>