[gpfsug-discuss] Mmhealth events longwaiters_found and deadlock_detected

Anna Greim Anna.Greim at de.ibm.com
Thu Apr 16 11:55:56 BST 2020


Hi Heiner,

I'm not really able to give you insights into the decision of the events' 
states. Maybe somebody else is able to answer here.

But about your triggering debug data collection question, please have a 
look at this documentation page:
https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.4/com.ibm.spectrum.scale.v5r04.doc/bl1adv_createscriptforevents.htm

This feature is in the product since the 5.0.x versions and should be 
helpful here. 
It will trigger your eventsCallback script when the event is raised. One 
of the script's arguments is the event name. So it is possible to create a 
script, that
checks for the event name longwaiters_found and then triggers a mmdiag 
--deadlock and write it into a txt file. 

The script call has a hard time out of 60 seconds so it does not interfere 
too much with the mmsysmon internals, but better would be a run time less 
than 1 second.

Mit freundlichen Grüßen / Kind regards

Anna Greim

Software Engineer, Spectrum Scale Development
IBM Systems












IBM Data Privacy Statement 
IBM Deutschland Research & Development GmbH / Vorsitzender des 
Aufsichtsrats: Gregor Pillen
Geschäftsführung: Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294 





From:   "Billich  Heinrich Rainer (ID SD)" <heinrich.billich at id.ethz.ch>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   16/04/2020 10:36
Subject:        [EXTERNAL] [gpfsug-discuss] Mmhealth events 
longwaiters_found and   deadlock_detected
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hello,
 
I?m puzzled  about the difference between the two mmhealth events
 
longwaiters_found ERROR Detected Spectrum Scale long-waiters
 
and
 
deadlock_detected         WARNING    The cluster detected a Spectrum Scale 
filesystem deadlock
 
Especially why the later has level WARNING only while the first has level 
ERROR? Longwaiters_found is based on the output of ?mmdiag ?deadlock? and 
occurs much more often on our clusters, while the later probably is 
triggered by an external event and no internal mmsysmon check? Deadlock 
detection is handled by  mmfsd? Whenever  a deadlock is detected some 
debug data is collected, which is not true for longwaiters_detected. Hm, 
so why is no deadlock detected whenever mmdiag ?deadlock shows waiting 
threads? Shouldn?t  the severity be the opposite way?
 
Finally: Can we trigger some debug data collection whenever a 
longwaiters_found event happens ? just getting the output of ?mmdiag 
?deadlock? on the single node could give some hints. Without I don?t see 
any real chance to take any action.
 
Thank you,
 
Heiner
-- 
=======================
Heinrich Billich
ETH Zürich
Informatikdienste
Tel.: +41 44 632 72 56
heinrich.billich at id.ethz.ch
========================
 
 
 _______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=XLDdnBDnIn497KhM7_npStR6ig1r198VHeSBY1WbuHc&m=QAa_5ZRNpy310ikXZzwunhWU4TGKsH_NWDoYwh57MNo&s=dKWX1clbfClbfJb5yKSzhoNC1aqCbT6-7s1DQdx8CzY&e= 





-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200416/3d2373aa/attachment-0002.htm>


More information about the gpfsug-discuss mailing list