[gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads

Aaron Knister aaron.knister at gmail.com
Fri Jun 30 16:47:40 BST 2017


Nicolas,

By chance do you have a skylake or kabylake based CPU?

Sent from my iPhone

> On Jun 30, 2017, at 02:57, IBM Spectrum Scale <scale at us.ibm.com> wrote:
> 
> I'm not aware this kind of defects, seems it should not. but lack of data, we don't know what happened. I suggest you can open a PMR for your issue. Thanks.
> 
> Regards, The Spectrum Scale (GPFS) team
> 
> ------------------------------------------------------------------------------------------------------------------
> If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. 
> 
> If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. 
> 
> The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team.
> 
> <graycol.gif>"CAPIT, NICOLAS" ---06/27/2017 02:59:59 PM---Hello, When the node is locked up there is no waiters ("mmdiad --waiters" or "mmfsadm dump waiters")
> 
> From: "CAPIT, NICOLAS" <ncapit at atos.net>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date: 06/27/2017 02:59 PM
> Subject: Re: [gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> 
> 
> 
> 
> Hello,
> 
> When the node is locked up there is no waiters ("mmdiad --waiters" or "mmfsadm dump waiters").
> In the GPFS log file "/var/mmfs/gen/mmfslog" there is nothing and nothing in the dmesg output or system log.
> The "mmgetstate" command says that the node is "active".
> The only thing is the freeze of the FS.
> 
> Best regards,
> Nicolas Capit
> ________________________________________
> De : gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] de la part de Aaron Knister [aaron.s.knister at nasa.gov]
> Envoyé : mardi 27 juin 2017 01:57
> À : gpfsug-discuss at spectrumscale.org
> Objet : Re: [gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads
> 
> That's a fascinating bug. When the node is locked up what does "mmdiag
> --waiters" show from the node in question? I suspect there's more
> low-level diagnostic data that's helpful for the gurus at IBM but I'm
> just curious what the waiters look like.
> 
> -Aaron
> 
> On 6/26/17 3:49 AM, CAPIT, NICOLAS wrote:
> > Hello,
> >
> > I don't know if this behavior/bug was already reported on this ML, so in
> > doubt.
> >
> > Context:
> >
> >    - SpectrumScale 4.2.2-3
> >    - client node with 64 cores
> >    - OS: RHEL7.3
> >
> > When a MPI job with 64 processes is launched on the node with 64 cores
> > then the FS freezed (only the output log file of the MPI job is put on
> > the GPFS; so it may be related to the 64 processes writing in a same
> > file???).
> >
> >    strace -p 3105         # mmfsd pid stucked
> >    Process 3105 attached
> >    wait4(-1,              # stucked at this point
> >
> >    strace ls /gpfs
> >    stat("/gpfs", {st_mode=S_IFDIR|0755, st_size=131072, ...}) = 0
> >    openat(AT_FDCWD, "/gpfs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC
> > # stucked at this point
> >
> > I have no problem with the other nodes of 28 cores.
> > The GPFS command mmgetstate is working and I am able to use mmshutdown
> > to recover the node.
> >
> >
> > If I put workerThreads=72 on the 64 core node then I am not able to
> > reproduce the freeze and I get the right behavior.
> >
> > Is this a known bug with a number of cores > workerThreads?
> >
> > Best regards,
> > --
> > *Nicolas Capit*
> >
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
> 
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170630/3865ddb8/attachment-0002.htm>


More information about the gpfsug-discuss mailing list