[gpfsug-discuss] forcibly panic stripegroup everywhere?

Sven Oehme oehmes at us.ibm.com
Mon Jan 23 04:12:02 GMT 2017


What version of Scale/ GPFS code is this cluster on ?

------------------------------------------
Sven Oehme
Scalable Storage Research
email: oehmes at us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
------------------------------------------



From:	Aaron Knister <aaron.s.knister at nasa.gov>
To:	<gpfsug-discuss at spectrumscale.org>
Date:	01/23/2017 01:31 AM
Subject:	Re: [gpfsug-discuss] forcibly panic stripegroup everywhere?
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



I was afraid someone would ask :)

One possible use would be testing how monitoring reacts to and/or
corrects stale filesystems.

The use in my case is there's an issue we see quite often where a
filesystem won't unmount when trying to shut down gpfs. Linux insists
its still busy despite every process being killed on the node just about
except init. It's a real pain because it complicates maintenance,
requiring a reboot of some nodes prior to patching for example.

I dug into it and it appears as though when this happens the
filesystem's mnt_count is ridiculously high (300,000+ in one case). I'm
trying to debug it further but I need to actually be able to make the
condition happen a few more times to debug it. A stripegroup panic isn't
a surefire way but it's the only way I've found so far to trigger this
behavior somewhat on demand.

One way I've found to trigger a mass stripegroup panic is to induce what
I call a  "301 error":

loremds07: Sun Jan 22 00:30:03.367 2017: [X] File System ttest unmounted
by the system with return code 301 reason code 0
loremds07: Sun Jan 22 00:30:03.368 2017: Invalid argument

and tickle a known race condition between nodes being expelled from the
cluster and a manager node joining the cluster. When this happens it
seems to cause a mass stripe group panic that's over in a few minutes.
The trick there is that it doesn't happen every time I go through the
exercise and when it does there's no guarantee the filesystem that
panics is the one in use. If it's not an fs in use then it doesn't help
me reproduce the error condition. I was trying to use the "mmfsadm test
panic" command to try a more direct approach.

Hope that helps shed some light.

-Aaron

On 1/22/17 8:16 PM, Andrew Beattie wrote:
> Out of curiosity -- why would you want to?
> Andrew Beattie
> Software Defined Storage  - IT Specialist
> Phone: 614-2133-7927
> E-mail: abeattie at au1.ibm.com <mailto:abeattie at au1.ibm.com>
>
>
>
>     ----- Original message -----
>     From: Aaron Knister <aaron.s.knister at nasa.gov>
>     Sent by: gpfsug-discuss-bounces at spectrumscale.org
>     To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>     Cc:
>     Subject: [gpfsug-discuss] forcibly panic stripegroup everywhere?
>     Date: Mon, Jan 23, 2017 11:11 AM
>
>     This is going to sound like a ridiculous request, but, is there a way
to
>     cause a filesystem to panic everywhere in one "swell foop"? I'm
assuming
>     the answer will come with an appropriate disclaimer of "don't ever do
>     this, we don't support it, it might eat your data, summon cthulu,
etc.".
>     I swear I've seen the fs manager initiate this type of operation
before.
>
>     I can seem to do it on a per-node basis with "mmfsadm test panic <fs>
>     <error code>" but if I do that over all 1k nodes in my test cluster
at
>     once it results in about 45 minutes of almost total deadlock while
each
>     panic is processed by the fs manager.
>
>     -Aaron
>
>     --
>     Aaron Knister
>     NASA Center for Climate Simulation (Code 606.2)
>     Goddard Space Flight Center
>     (301) 286-2776
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170123/9650f25a/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170123/9650f25a/attachment-0002.gif>


More information about the gpfsug-discuss mailing list