[gpfsug-discuss] FS freeze on client nodes with nbCores>workerThreads

Tue Jun 27 00:57:57 BST 2017

That's a fascinating bug. When the node is locked up what does "mmdiag 
--waiters" show from the node in question? I suspect there's more 
low-level diagnostic data that's helpful for the gurus at IBM but I'm 
just curious what the waiters look like.

-Aaron

On 6/26/17 3:49 AM, CAPIT, NICOLAS wrote:
> Hello,
> 
> I don't know if this behavior/bug was already reported on this ML, so in 
> doubt.
> 
> Context:
> 
>    - SpectrumScale 4.2.2-3
>    - client node with 64 cores
>    - OS: RHEL7.3
> 
> When a MPI job with 64 processes is launched on the node with 64 cores 
> then the FS freezed (only the output log file of the MPI job is put on 
> the GPFS; so it may be related to the 64 processes writing in a same 
> file???).
> 
>    strace -p 3105         # mmfsd pid stucked
>    Process 3105 attached
>    wait4(-1,              # stucked at this point
> 
>    strace ls /gpfs
>    stat("/gpfs", {st_mode=S_IFDIR|0755, st_size=131072, ...}) = 0
>    openat(AT_FDCWD, "/gpfs", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC   
> # stucked at this point
> 
> I have no problem with the other nodes of 28 cores.
> The GPFS command mmgetstate is working and I am able to use mmshutdown 
> to recover the node.
> 
> 
> If I put workerThreads=72 on the 64 core node then I am not able to 
> reproduce the freeze and I get the right behavior.
> 
> Is this a known bug with a number of cores > workerThreads?
> 
> Best regards,
> -- 
> *Nicolas Capit*
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776