<span style=" font-size:10pt;font-family:sans-serif">Hello Simon,</span><br><br><span style=" font-size:10pt;font-family:sans-serif">Sadly, that "1036"

is not a node ID, but just a counter.</span><br><br><span style=" font-size:10pt;font-family:sans-serif">These are tricky

to troubleshoot. Usually, by the time you realize it's happening and try

to collect some data, things have already timed out.</span><br><br><span style=" font-size:10pt;font-family:sans-serif">Since this mmdelsnapshot

isn't something that's on a schedule from cron or the GUI and is a command

you are running, you could try some heavy-handed data collection.</span><br><br><span style=" font-size:10pt;font-family:sans-serif">You suspect a

particular fileset already, so maybe have a 'mmdsh -N all lsof /path/to/fileset'

ready to go in one window, and the 'mmdelsnapshot' ready to go in another

window? When the mmdelsnapshot times out, you can find the nodes it was

waiting on in the file system manager mmfs.log.latest and see what matches

up with the open files identified by lsof.</span><br><br><span style=" font-size:10pt;font-family:sans-serif">It sounds like

you already know this, but the <c0n42> type of internal node names

in the log messages can be translated with 'mmfsadm dump tscomm' or also

plain old 'mmdiag --network'.<br></span><br><span style=" font-size:10pt;font-family:sans-serif">Thanks,</span><br><span style=" font-size:9pt;font-family:Arial"><br></span><table width=650 style="border-collapse:collapse;"><tr height=8><td width=650 style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;"><span style=" font-size:12pt;color:#8f8f8f;font-family:Arial"><b>Nate

Falk</b></span><span style=" font-size:9pt;font-family:Arial"><br>IBM Spectrum Scale Level 2 Support<br>Software Defined Infrastructure, IBM Systems</span></table><p style="margin-top:0px;margin-Bottom:0px"></p><table width=650 style="border-collapse:collapse;"><tr height=8><td width=650 colspan=2 style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;"><tr valign=top height=8><td width=363 style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;"><td width=286 style="border-style:none none none none;border-color:#000000;border-width:0px 0px 0px 0px;padding:0px 0px;"><div align=right></div></table><p style="margin-top:0px;margin-Bottom:0px"></p><br><br><br><br><span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif">From:

       </span><span style=" font-size:9pt;font-family:sans-serif">Simon

Thompson <S.J.Thompson@bham.ac.uk></span><br><span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif">To:

       </span><span style=" font-size:9pt;font-family:sans-serif">gpfsug

main discussion list <gpfsug-discuss@spectrumscale.org></span><br><span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif">Date:

       </span><span style=" font-size:9pt;font-family:sans-serif">02/20/2020

03:14 PM</span><br><span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif">Subject:

       </span><span style=" font-size:9pt;font-family:sans-serif">[EXTERNAL]

Re: [gpfsug-discuss] Unkillable snapshots</span><br><span style=" font-size:9pt;color:#5f5f5f;font-family:sans-serif">Sent

by:        </span><span style=" font-size:9pt;font-family:sans-serif">gpfsug-discuss-bounces@spectrumscale.org</span><br><hr noshade><br><br><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">Hmm

... mmdiag --tokenmgr shows:</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><br><span style=" font-size:12pt;font-family:Calibri">    Server

stats: requests 195417431 ServerSideRevokes 120140</span><br><span style=" font-size:12pt;font-family:Calibri">     

     nTokens 2146923 nranges 4124507</span><br><span style=" font-size:12pt;font-family:Calibri">     

     designated mnode appointed 55481 mnode thrashing detected

1036</span><br><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">So

how do I convert "1036" to a node?</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">Simon</span></p><br><hr><br><span style=" font-size:11pt;font-family:Calibri"><b>From:</b> gpfsug-discuss-bounces@spectrumscale.org

<gpfsug-discuss-bounces@spectrumscale.org> on behalf of Simon Thompson

<S.J.Thompson@bham.ac.uk><b><br>Sent:</b> 20 February 2020 19:45:02<b><br>To:</b> gpfsug main discussion list<b><br>Subject:</b> [gpfsug-discuss] Unkillable snapshots</span><span style=" font-size:12pt"></span><br><span style=" font-size:12pt"> </span><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">Hi,</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">We

have a snapshot which is stuck in the state "DeleteRequired".

When deleting, it goes through the motions but eventually gives up with:</span></p><br><span style=" font-size:12pt;font-family:Calibri">Unable to quiesce

all nodes; some processes are busy or holding required resources.</span><br><span style=" font-size:12pt;font-family:Calibri">mmdelsnapshot: Command

failed. Examine previous error messages to determine cause.</span><br><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">And

in the mmfslog on the FS manager there are a bunch of retries and "failure

to quesce" on nodes. However in each retry its never the same set

of nodes. I suspect we have one HPC job somewhere killing us.</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">What's

interesting is that we can delete other snapshots OK, it appears to be

one particular fileset.</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">My

old goto "mmfsadm dump tscomm" isn't showing any particular node,

and waiters around just tend to point to the FS manager node.</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">So

... any suggestions? I'm assuming its some workload holding a lock open

or some such, but tracking it down is proving elusive!</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">Generally

the FS is also "lumpy" ... at times it feels like a wifi connection

on a train using a terminal, I guess its all related though.</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">Thanks</span></p><p style="margin-top:0px;margin-Bottom:0px"></p><p style="margin-top:0px;margin-Bottom:0px"><span style=" font-size:12pt;font-family:Calibri">Simon

</span></p><br><tt><span style=" font-size:10pt">_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at spectrumscale.org<br></span></tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><tt><span style=" font-size:10pt">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</span></tt></a><tt><span style=" font-size:10pt"><br></span></tt><br><br><BR>