[gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Oesterlin, Robert Robert.Oesterlin at nuance.com
Thu Jan 31 15:46:38 GMT 2019


A better way to detect node expels is to install the expelnode into /var/mmfs/etc/ (sample in /usr/lpp/mmfs/samples/expelnode.sample) - put this on your manager nodes. It runs on every expel and you can customize it pretty easily. We generate a Slack message to a specific channel:

GPFS Node Expel nrg1 APP [1:56 AM] nrg1-gpfs01 Expelling node gnj-r05r05u30, other node cnt-r04r08u40


Bob Oesterlin
Sr Principal Storage Engineer, Nuance


From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Thursday, January 31, 2019 at 9:19 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Hi Bob,

We use the nodeLeave callback to detect node expels … for what you’re wanting to do I wonder if nodeJoin might work??  If a node joins the cluster and then has an uptime of a few minutes you could go looking in /tmp/mmfs.  HTH...

--
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633


On Jan 30, 2019, at 3:02 PM, Sanchez, Paul <Paul.Sanchez at deshaw.com<mailto:Paul.Sanchez at deshaw.com>> wrote:

There are some cases which I don’t believe can be caught with callbacks (e.g. DMS = Dead Man Switch).  But you could possibly use preStartup to check the host uptime to make an assumption if GPFS was restarted long after the host booted.  You could also peek in /tmp/mmfs and only report if you find something there.  That said, the docs say that preStartup fires after the node joins the cluster.  So if that means once the node is ‘active’ then you might miss out on nodes stuck in ‘arbitrating’ for a while due to a waiter problem.

We run a script with cron which monitors the myriad things which can go wrong and attempt to right those which are safe to fix, and raise alerts appropriately.  Something like that, outside the reach of GPFS, is often a good choice if you don’t need to know something the moment it happens.

Thx
Paul

From: gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org> <gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>> On Behalf Of Oesterlin, Robert
Sent: Wednesday, January 30, 2019 3:52 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Subject: [gpfsug-discuss] Node ‘crash and restart’ event using GPFS callback?

Anyone crafted a good way to detect a node ‘crash and restart’ event using GPFS callbacks? I’m thinking “preShutdown” but I’m not sure if that’s the best. What I’m really looking for is did the node shutdown (abort) and create a dump in /tmp/mmfs


Bob Oesterlin
Sr Principal Storage Engineer, Nuance

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<https://urldefense.proofpoint.com/v2/url?u=http-3A__spectrumscale.org_&d=DwMGaQ&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=ppdUpGql5rzClFCWb7wAesP1sZuy9scOloPIQsjrVao&s=O81UdWPCUrX00RF0P-UNyLZ-lbTmgIaW-PpK4VrxgHs&e=>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cccd012a939124326a53908d686f64117%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636844789557921185&sdata=9bMPd%2F%2B%2Babt6IdeFYcdznPBQwPrMLFsXHTBYISlyYGM%3D&reserved=0<https://urldefense.proofpoint.com/v2/url?u=https-3A__na01.safelinks.protection.outlook.com_-3Furl-3Dhttp-253A-252F-252Fgpfsug.org-252Fmailman-252Flistinfo-252Fgpfsug-2Ddiscuss-26amp-3Bdata-3D02-257C01-257CKevin.Buterbaugh-2540vanderbilt.edu-257Cccd012a939124326a53908d686f64117-257Cba5a7f39e3be4ab3b45067fa80faecad-257C0-257C0-257C636844789557921185-26amp-3Bsdata-3D9bMPd-252F-252B-252Babt6IdeFYcdznPBQwPrMLFsXHTBYISlyYGM-253D-26amp-3Breserved-3D0&d=DwMGaQ&c=djjh8EKwHtOepW4Bjau0lKhLlu-DxM1dlgP0rrLsOzY&r=LPDewt1Z4o9eKc86MXmhqX-45Cz1yz1ylYELF9olLKU&m=ppdUpGql5rzClFCWb7wAesP1sZuy9scOloPIQsjrVao&s=ZaQTKkyDzA6XWNjMVXKrblv1I7frC1VIVFQ0Y-I1f8c&e=>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190131/d1f94d56/attachment-0002.htm>


More information about the gpfsug-discuss mailing list