[gpfsug-discuss] mmchdisk hung / proceeding at a glacial pace?

Sun Jul 15 20:11:26 BST 2018

Hi All,

So I had noticed some waiters on my NSD servers that I thought were unrelated to the mmchdisk.  However, I decided to try rebooting my NSD servers one at a time (mmshutdown failed!) to clear that up … and evidently one of them had things hung up because the mmchdisk start completed.

Thanks…

Kevin

On Jul 15, 2018, at 12:34 PM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] <aaron.s.knister at nasa.gov<mailto:aaron.s.knister at nasa.gov>> wrote:

Hmm...have you dumped waiters across the entire cluster or just on the NSD servers/fs managers? Maybe there’s a slow node out there participating in the suspend effort? Might be worth running some quick tracing on the FS manager to see what it’s up to.

On July 15, 2018 at 13:27:54 EDT, Buterbaugh, Kevin L <Kevin.Buterbaugh at Vanderbilt.Edu<mailto:Kevin.Buterbaugh at Vanderbilt.Edu>> wrote:
Hi All,

We are in a partial cluster downtime today to do firmware upgrades on our storage arrays.  It is a partial downtime because we have two GPFS filesystems:

1.  gpfs23 - 900+ TB and which corresponds to /scratch and /data, and which I’ve unmounted across the cluster because it has data replication set to 1.

2.  gpfs22 - 42 TB and which corresponds to /home.  It has data replication set to two, so what we’re doing is “mmchdisk gpfs22 suspend -d <the gpfs22 NSD>”, then doing the firmware upgrade, and once the array is back we’re doing a “mmchdisk gpfs22 resume -d <NSD>”, followed by “mmchdisk gpfs22 start -d <NSD>”.

On the 1st storage array this went very smoothly … the mmchdisk took about 5 minutes, which is what I would expect.

But on the 2nd storage array the mmchdisk appears to either be hung or proceeding at a glacial pace.  For more than an hour it’s been stuck at:

mmchdisk: Processing continues ...
Scanning file system metadata, phase 1 …

There are no waiters of any significance and “mmdiag —iohist” doesn’t show any issues either.

Any ideas, anyone?  Unless I can figure this out I’m hosed for this downtime, as I’ve got 7 more arrays to do after this one!

Thanks!

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cd518db52846a4be34e2208d5ea7a00d7%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636672732087040757&sdata=m77IpWNOlODc%2FzLiYI2qiPo9Azs8qsIdXSY8%2FoC6Nn0%3D&reserved=0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180715/bb8c507e/attachment-0002.htm>