[gpfsug-discuss] forcibly panic stripegroup everywhere?

Aaron Knister aaron.s.knister at nasa.gov
Mon Jan 23 04:22:38 GMT 2017


It's at 4.1.1.10.

On 1/22/17 11:12 PM, Sven Oehme wrote:
> What version of Scale/ GPFS code is this cluster on ?
>
> ------------------------------------------
> Sven Oehme
> Scalable Storage Research
> email: oehmes at us.ibm.com
> Phone: +1 (408) 824-8904
> IBM Almaden Research Lab
> ------------------------------------------
>
> Inactive hide details for Aaron Knister ---01/23/2017 01:31:29 AM---I
> was afraid someone would ask :) One possible use would beAaron Knister
> ---01/23/2017 01:31:29 AM---I was afraid someone would ask :) One
> possible use would be testing how monitoring reacts to and/or
>
> From: Aaron Knister <aaron.s.knister at nasa.gov>
> To: <gpfsug-discuss at spectrumscale.org>
> Date: 01/23/2017 01:31 AM
> Subject: Re: [gpfsug-discuss] forcibly panic stripegroup everywhere?
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> ------------------------------------------------------------------------
>
>
>
> I was afraid someone would ask :)
>
> One possible use would be testing how monitoring reacts to and/or
> corrects stale filesystems.
>
> The use in my case is there's an issue we see quite often where a
> filesystem won't unmount when trying to shut down gpfs. Linux insists
> its still busy despite every process being killed on the node just about
> except init. It's a real pain because it complicates maintenance,
> requiring a reboot of some nodes prior to patching for example.
>
> I dug into it and it appears as though when this happens the
> filesystem's mnt_count is ridiculously high (300,000+ in one case). I'm
> trying to debug it further but I need to actually be able to make the
> condition happen a few more times to debug it. A stripegroup panic isn't
> a surefire way but it's the only way I've found so far to trigger this
> behavior somewhat on demand.
>
> One way I've found to trigger a mass stripegroup panic is to induce what
> I call a  "301 error":
>
> loremds07: Sun Jan 22 00:30:03.367 2017: [X] File System ttest unmounted
> by the system with return code 301 reason code 0
> loremds07: Sun Jan 22 00:30:03.368 2017: Invalid argument
>
> and tickle a known race condition between nodes being expelled from the
> cluster and a manager node joining the cluster. When this happens it
> seems to cause a mass stripe group panic that's over in a few minutes.
> The trick there is that it doesn't happen every time I go through the
> exercise and when it does there's no guarantee the filesystem that
> panics is the one in use. If it's not an fs in use then it doesn't help
> me reproduce the error condition. I was trying to use the "mmfsadm test
> panic" command to try a more direct approach.
>
> Hope that helps shed some light.
>
> -Aaron
>
> On 1/22/17 8:16 PM, Andrew Beattie wrote:
>> Out of curiosity -- why would you want to?
>> Andrew Beattie
>> Software Defined Storage  - IT Specialist
>> Phone: 614-2133-7927
>> E-mail: abeattie at au1.ibm.com <mailto:abeattie at au1.ibm.com>
>>
>>
>>
>>     ----- Original message -----
>>     From: Aaron Knister <aaron.s.knister at nasa.gov>
>>     Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>     To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>>     Cc:
>>     Subject: [gpfsug-discuss] forcibly panic stripegroup everywhere?
>>     Date: Mon, Jan 23, 2017 11:11 AM
>>
>>     This is going to sound like a ridiculous request, but, is there a way to
>>     cause a filesystem to panic everywhere in one "swell foop"? I'm assuming
>>     the answer will come with an appropriate disclaimer of "don't ever do
>>     this, we don't support it, it might eat your data, summon cthulu, etc.".
>>     I swear I've seen the fs manager initiate this type of operation before.
>>
>>     I can seem to do it on a per-node basis with "mmfsadm test panic <fs>
>>     <error code>" but if I do that over all 1k nodes in my test cluster at
>>     once it results in about 45 minutes of almost total deadlock while each
>>     panic is processed by the fs manager.
>>
>>     -Aaron
>>
>>     --
>>     Aaron Knister
>>     NASA Center for Climate Simulation (Code 606.2)
>>     Goddard Space Flight Center
>>     (301) 286-2776
>>     _______________________________________________
>>     gpfsug-discuss mailing list
>>     gpfsug-discuss at spectrumscale.org
>>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>>
>>
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list