[gpfsug-discuss] hdisk suspend / stop (Buterbaugh, Kevin L)

Fri Feb 9 13:28:49 GMT 2018

Hi

Just make sure you have a backup, just in case ...

Regards

Yaron Daniel
 94 Em Ha'Moshavot Rd

Storage architect
 Petach Tiqva, 49527
IBM Global Markets, Systems HW Sales
 Israel

Phone:
+972-3-916-5672

Fax:
+972-3-916-5672

Mobile:
+972-52-8395593

e-mail:
yard at il.ibm.com

IBM Israel

From:   "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   02/08/2018 09:49 PM
Subject:        Re: [gpfsug-discuss] hdisk suspend / stop (Buterbaugh, 
Kevin L)
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

Hi again all, 

It sounds like doing the ?mmchconfig unmountOnDiskFail=meta -i? suggested 
by Steve and Bob followed by using mmchdisk to stop the disks temporarily 
is the way we need to go.  We will, as an aside, also run a mmapplypolicy 
first to pull any files users have started accessing again back to the 
?regular? pool before doing any of this.

Given that this is our ?capacity? pool and files have to have an atime > 
90 days to get migrated there in the 1st place I think this is reasonable. 
 Especially since users will get an I/O error if they happen to try to 
access one of those NSDs during the brief maintenance window.

As to naming and shaming the vendor ? I?m not going to do that at this 
point in time.  We?ve been using their stuff for well over a decade at 
this point and have had a generally positive experience with them.  In 
fact, I have spoken with them via phone since my original post today and 
they have clarified that the problem with the mismatched firmware is only 
an issue because we are a major version off of what is current due to us 
choosing to not have a downtime and therefore not having done any firmware 
upgrades in well over 18 months.

Thanks, all...

Kevin

On Feb 8, 2018, at 11:17 AM, Steve Xiao <sxiao at us.ibm.com> wrote:

You can change the cluster configuration to online unmount the file system 
when there is error accessing metadata.   This can be done run the 
following command:
   mmchconfig unmountOnDiskFail=meta -i 

After this configuration change, you should be able to stop all 5 NSDs 
with mmchdisk stop command.    While these NSDs are in down state, any 
user IO to files resides on these disks will fail but your file system 
should state mounted and usable.

Steve Y. Xiao

> Date: Thu, 8 Feb 2018 15:59:44 +0000
> From: "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Subject: [gpfsug-discuss] mmchdisk suspend / stop
> Message-ID: <8DCA682D-9850-4C03-8930-EA6C68B41109 at vanderbilt.edu>
> Content-Type: text/plain; charset="utf-8"
> 
> Hi All,
> 
> We are in a bit of a difficult situation right now with one of our 
> non-IBM hardware vendors (I know, I know, I KNOW - buy IBM hardware!
> <grin>) and are looking for some advice on how to deal with this 
> unfortunate situation.
> 
> We have a non-IBM FC storage array with dual-?redundant? 
> controllers.  One of those controllers is dead and the vendor is 
> sending us a replacement.  However, the replacement controller will 
> have mis-matched firmware with the surviving controller and - long 
> story short - the vendor says there is no way to resolve that 
> without taking the storage array down for firmware upgrades. 
> Needless to say there?s more to that story than what I?ve included 
> here, but I won?t bore everyone with unnecessary details.
> 
> The storage array has 5 NSDs on it, but fortunately enough they are 
> part of our ?capacity? pool ? i.e. the only way a file lands here is
> if an mmapplypolicy scan moved it there because the *access* time is
> greater than 90 days.  Filesystem data replication is set to one.
> 
> So ? what I was wondering if I could do is to use mmchdisk to either
> suspend or (preferably) stop those NSDs, do the firmware upgrade, 
> and resume the NSDs?  The problem I see is that suspend doesn?t stop
> I/O, it only prevents the allocation of new blocks ? so, in theory, 
> if a user suddenly decided to start using a file they hadn?t needed 
> for 3 months then I?ve got a problem.  Stopping all I/O to the disks
> is what I really want to do.  However, according to the mmchdisk man
> page stop cannot be used on a filesystem with replication set to one.
> 
> There?s over 250 TB of data on those 5 NSDs, so restriping off of 
> them or setting replication to two are not options.
> 
> It is very unlikely that anyone would try to access a file on those 
> NSDs during the hour or so I?d need to do the firmware upgrades, but
> how would GPFS itself react to those (suspended) disks going away 
> for a while?  I?m thinking I could be OK if there was just a way to 
> actually stop them rather than suspend them.  Any undocumented 
> options to mmchdisk that I?m not aware of???
> 
> Are there other options - besides buying IBM hardware - that I am 
> overlooking?  Thanks...
> 
> ?
> Kevin Buterbaugh - Senior System Administrator
> Vanderbilt University - Advanced Computing Center for Research and 
Education
> Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu
> > - (615)875-9633
> 
> 
> 

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C435bd89b3fcc4a94ee5008d56f17e49e%7C5f88b91902e3490fb772327aa8177b95%7C0%7C0%7C636537070783260582&sdata=AbY7rJQecb76rMC%2FlxrthyzHfueQDJTT%2FJuuRCac5g8%3D&reserved=0

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=Bn1XE9uK2a9CZQ8qKnJE3Q&m=3yfKUCiWGXtAEPiwlmQNFGTjLx5h3PlCYfUXDBMGJpQ&s=-pkjeFOUVSDUGgwtKkoYbmGLADk2UHfDbUPiuWSw4gQ&e=

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 1851 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0010.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 4376 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0011.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 5093 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0012.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 4746 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0013.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/gif
Size: 4557 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0014.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: image/jpeg
Size: 11294 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180209/9a3b0750/attachment-0002.jpe>