<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

Hi All,

<div class=""><br class="">

</div>

<div class="">So I had noticed some waiters on my NSD servers that I thought were unrelated to the mmchdisk.  However, I decided to try rebooting my NSD servers one at a time (mmshutdown failed!) to clear that up … and evidently one of them had things hung

 up because the mmchdisk start completed.</div>

<div class=""><br class="">

</div>

<div class="">Thanks…</div>

<div class=""><br class="">

</div>

<div class="">Kevin<br class="">

<div><br class="">

<blockquote type="cite" class="">

<div class="">On Jul 15, 2018, at 12:34 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] <<a href="mailto:aaron.s.knister@nasa.gov" class="">aaron.s.knister@nasa.gov</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<div class="">

<div dir="ltr" class="">Hmm...have you dumped waiters across the entire cluster or just on the NSD servers/fs managers? Maybe there’s a slow node out there participating in the suspend effort? Might be worth running some quick tracing on the FS manager to see

 what it’s up to. </div>

<span id="draft-break" class=""></span><br class="">

<br class="">

<br class="">

<span id="draft-break" class=""></span><br class="">

<br class="">

<div class="">

<div class="null" dir="auto">On July 15, 2018 at 13:27:54 EDT, Buterbaugh, Kevin L <<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">Kevin.Buterbaugh@Vanderbilt.Edu</a>> wrote:<br class="null">

</div>

<blockquote type="cite" style="border-left-style:solid;border-width:1px;margin-left:0px;padding-left:10px;" class="null">

<div class="null" dir="auto">

<div class="null">

<div class="null" nop="" style="word-wrap:break-word; line-break:after-white-space">

Hi All,

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">We are in a partial cluster downtime today to do firmware upgrades on our storage arrays.  It is a partial downtime because we have two GPFS filesystems:</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">1.  gpfs23 - 900+ TB and which corresponds to /scratch and /data, and which I’ve unmounted across the cluster because it has data replication set to 1.</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">2.  gpfs22 - 42 TB and which corresponds to /home.  It has data replication set to two, so what we’re doing is “mmchdisk gpfs22 suspend -d <the gpfs22 NSD>”, then doing the firmware upgrade, and once the array is back we’re doing a

 “mmchdisk gpfs22 resume -d <NSD>”, followed by “mmchdisk gpfs22 start -d <NSD>”.</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">On the 1st storage array this went very smoothly … the mmchdisk took about 5 minutes, which is what I would expect.</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">But on the 2nd storage array the mmchdisk appears to either be hung or proceeding at a glacial pace.  For more than an hour it’s been stuck at:</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">

<div nop="" class="null">mmchdisk: Processing continues ...</div>

<div nop="" class="null">Scanning file system metadata, phase 1 …</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">There are no waiters of any significance and “mmdiag —iohist” doesn’t show any issues either.</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">Any ideas, anyone?  Unless I can figure this out I’m hosed for this downtime, as I’ve got 7 more arrays to do after this one!</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<div nop="" class="null">Thanks!</div>

<br nop="" class="null">

<div nop="" class="null">

<div nop="" class="null">—</div>

<div nop="" class="null">Kevin Buterbaugh - Senior System Administrator</div>

<div nop="" class="null">Vanderbilt University - Advanced Computing Center for Research and Education</div>

<div nop="" class="null"><a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" nop="" class="null">Kevin.Buterbaugh@vanderbilt.edu</a> - (615)875-9633</div>

<div nop="" class="null"><br nop="" class="null">

</div>

<br nop="Apple-interchange-newline" class="null">

</div>

<br nop="" class="null">

</div>

</div>

</div>

</div>

</blockquote>

</div>

</div>

_______________________________________________<br class="">

gpfsug-discuss mailing list<br class="">

gpfsug-discuss at <a href="http://spectrumscale.org" class="">spectrumscale.org</a><br class="">

<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd518db52846a4be34e2208d5ea7a00d7%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636672732087040757&amp;sdata=m77IpWNOlODc%2FzLiYI2qiPo9Azs8qsIdXSY8%2FoC6Nn0%3D&amp;reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&amp;data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd518db52846a4be34e2208d5ea7a00d7%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636672732087040757&amp;sdata=m77IpWNOlODc%2FzLiYI2qiPo9Azs8qsIdXSY8%2FoC6Nn0%3D&amp;reserved=0</a><br class="">

</div>

</blockquote>

</div>

<br class="">

</div>

</body>

</html>