[gpfsug-discuss] SpectrumScale / AFM / Singularity soft lockups

Venkateswara R Puvvada vpuvvada at in.ibm.com
Fri Mar 19 09:50:04 GMT 2021


Hi Robert,

So you might have started seeing problem after upgrading the gateway nodes 
to 5.0.5.2. Upgrading gateway nodes at cache cluster to 5.0.5.6 would 
resolve this problem.

~Venkat (vpuvvada at in.ibm.com)



From:   Robert Horton <robert.horton at icr.ac.uk>
To:     "gpfsug-discuss at spectrumscale.org" 
<gpfsug-discuss at spectrumscale.org>
Date:   03/19/2021 03:13 PM
Subject:        [EXTERNAL] Re: [gpfsug-discuss] SpectrumScale / AFM / 
Singularity soft lockups
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi Venkat,

Thanks for getting back to me.

On the cache side we're running 5.0.4-3 on the nsd servers and 5.0.5-2 
everywhere else, including gateway nodes.
The home cluster is 4.2.3-22 - unfortunately we're stuck on 4.x due to the 
licensing but we're in the process of replacing that system.

The actual AFM seems to be behaving fine though so I'm not sure that's our 
issue. I guess our next job is to see if we can reproduce it in a non-AFM 
fileset.

Rob

On Fri, 2021-03-19 at 12:02 +0530, Venkateswara R Puvvada wrote:
CAUTION: This email originated from outside of the ICR. Do not click links 
or open attachments unless you recognize the sender's email address and 
know the content is safe.

Robert,

What is the scale version ? This issue may be related to these alerts.

https://www.ibm.com/support/pages/node/6355983
https://www.ibm.com/support/pages/node/6380740

These are the recommended steps to resolve the issue, but need more 
details on the scale version.

1. Stop all AFM filesets at cache using "mmafmctl device stop -j fileset" 
command.
2. Perform rolling upgrade parallely at both cache and home clusters
    a. All nodes on home cluster to 5.0.5.6
    b. All gateway nodes in cache cluster to 5.0.5.6
 3. At home cluster, for each fileset target path, repeat below steps
      a. Remove .afmctl file
         mmafmlocal rm <fileset target path>/.afm/.afmctl
      b. Enable AFM
         mmafmconfig enable <fileset target path>
4. Start all AFM filesets at cache using "mmafmctl device start -j 
fileset" command. 

~Venkat (vpuvvada at in.ibm.com)



From:        Robert Horton <robert.horton at icr.ac.uk>
To:        "gpfsug-discuss at spectrumscale.org" 
<gpfsug-discuss at spectrumscale.org>
Date:        03/18/2021 09:17 PM
Subject:        [EXTERNAL] [gpfsug-discuss] SpectrumScale / AFM / 
Singularity soft lockups
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hello,

We've recently started having an issue where processes running in a 
singularity container get stuck in a soft lockup and eventually the node 
needs to be forcibly rebooted. I have included a sample call trace below. 
Additionally, other (non-singularity) processes on other nodes accessing 
the same fileset seem to get into the same state. It's also an AFM IW 
fileset just to add to the complexity ;)

Does anyone have any thoughts on what might be happening / how to proceed? 
I'm not really sure if it's a GPFS issue or a Singularity / Kernel issue - 
although fact it seems to spread to other nodes would seem to suggest some 
GPFS involvement. It's possible the user is doing something inadvisable 
with Singularity (it's difficult to work out what's happening in the 
Nextflow pipeline) but even if they are it would be good to find a way of 
preventing them taking nodes down. I'm assuming the AFM is unlikely to be 
relevant - any views on that?

Thanks,
Rob

 Call Trace:
? 
_Z11kSFSGetattrP15KernelOperationP13gpfsVfsData_tP10gpfsNode_tiP10cxiVattr_tP12gpfs_iattr64+0x1e4/0x5d0 
[mmfs26]
 
_ZL17refreshCacheAttrsP13gpfsVfsData_tP15KernelOperationP9cxiNode_tP10pcacheAttriPcj+0x441/0x450 
[mmfs26]
 
_Z21pcacheHandleCollisionP13gpfsVfsData_tP15KernelOperationP10gpfsNode_tS4_PcPvP9MMFSVInfoiP10pcacheAttriS5_10PcacheModej+0xa21/0x11b0 
[mmfs26]
 ? _ZN6ThCond6signalEv+0x82/0x190 [mmfs26]
  ? _ZN10MemoryPool6shFreeEPv9MallocUse+0x1a5/0x2a0 [mmfs26]
 ? 
_ZL14kSFSPcacheSendP13gpfsVfsData_tP15KernelOperation7FileUIDS3_PciiPPv+0x387/0x570 
[mmfs26]
 ? _ZL17pcacheNeedRefresh10PcacheModejlijj+0x206/0x230 [mmfs26]
_Z12pcacheLookupP13gpfsVfsData_tP15KernelOperationP10gpfsNode_tPvPcP7FilesetjjjPS5_PS4_PyPjS9_+0x1dcf/0x25c0 
[mmfs26]
? _Z15findFilesetByIdP15KernelOperationjjPP7Filesetj+0x4f/0xa0 [mmfs26]
 
_Z10gpfsLookupP13gpfsVfsData_tPvP9cxiNode_tS1_S1_PcjPS1_PS3_PyP10cxiVattr_tPjP10ext_cred_tjS5_PiS4_SD_+0x65c/0xad0 
[mmfs26]
gpfs_i_lookup+0x189/0x3f0 [mmfslinux]
 ? 
_Z8gpfsLinkP13gpfsVfsData_tP9cxiNode_tS2_PvPcjjP10ext_cred_t+0x6e0/0x6e0 
[mmfs26]
 ? d_alloc_parallel+0x99/0x4a0
 ? _Z33gpfsIsCifsBypassTraversalCheckingv+0xe2/0x130 [mmfs26]
 __lookup_slow+0x97/0x150
 lookup_slow+0x35/0x50
  walk_component+0x1bf/0x330
 ? 
_ZL12gpfsGetattrxP13gpfsVfsData_tP9cxiNode_tP10cxiVattr_tP12gpfs_iattr64i+0x147/0x390 
[mmfs26]
 path_lookupat.isra.49+0x75/0x200
  filename_lookup.part.63+0xa0/0x170
? strncpy_from_user+0x4f/0x1b0
 vfs_statx+0x73/0xe0
  __do_sys_newlstat+0x39/0x70
 ? syscall_trace_enter+0x1d3/0x2c0
 ? __audit_syscall_exit+0x249/0x2a0
  do_syscall_64+0x5b/0x1a0
 entry_SYSCALL_64_after_hwframe+0x65/0xca
-- 
Robert Horton | Research Data Storage Lead
The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB
T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk| W www.icr.ac.uk| 
Twitter @ICR_London
Facebook: www.facebook.com/theinstituteofcancerresearch

The Institute of Cancer Research: Royal Cancer Hospital, a charitable 
Company Limited by Guarantee, Registered in England under Company No. 
534147 with its Registered Office at 123 Old Brompton Road, London SW7 
3RP.

This e-mail message is confidential and for use by the addressee only. If 
the message is received by anyone other than the addressee, please return 
the message to the sender by replying to it and then delete the message 
from your computer and network.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=gHmKEtEM3EvdWRefAF0Cs8N2qXPZg5flGutpiJu_bfg&s=dnKFsINgU63_3b-7i3z3uDnxnij6iT-y8L_mmYHr8IE&e=




-- 
Robert Horton | Research Data Storage Lead
The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB
T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | 
Twitter @ICR_London
Facebook: www.facebook.com/theinstituteofcancerresearch

The Institute of Cancer Research: Royal Cancer Hospital, a charitable 
Company Limited by Guarantee, Registered in England under Company No. 
534147 with its Registered Office at 123 Old Brompton Road, London SW7 
3RP.

This e-mail message is confidential and for use by the addressee only. If 
the message is received by anyone other than the addressee, please return 
the message to the sender by replying to it and then delete the message 
from your computer and network.
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=KgYs-kXBKE5JoAaGYRiU9iIxNkJSZeicxpSTmL39_B8&s=6FodZ_EQ8VAOE_xoEkfoUzmJpaiF7bgbERvA9avLZfg&e= 





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20210319/21e8dcdd/attachment-0002.htm>


More information about the gpfsug-discuss mailing list