[gpfsug-discuss] gpfs waiters debugging

Tue Jun 6 17:45:51 BST 2017

On Tue, 06 Jun 2017 15:06:57 +0200, Stijn De Weirdt said:
> oh sure, i meant waiters that last > 300 seconds or so (something that
> could trigger deadlock). obviously we're not interested in debugging the
> short ones, it's not that gpfs doesn't work or anything ;)

At least at one time, a lot of the mm(whatever) administrative commands
would leave one dangling waiter for the duration of the command - which
could be a while if the command was mmdeldisk or mmrestripefs. I admit
not having specifically checked for gpfs 4.2, but it was true for 3.2 through
4.1....

And my addition to the collective debugging knowledge:  A bash one-liner to
dump all the waiters across a cluster, sorted by wait time.  Note that
our clusters tend to be 5-8 servers, this may be painful for those of you
who have 400+ node clusters. :)

##!/bin/bash
for i in ` mmlsnode | tail -1 | sed 's/^[ ]*[^ ]*[ ]*//'`; do  ssh $i /usr/lpp/mmfs/bin/mmfsadm dump waiters | sed "s/^/$i /"; done | sort -n -r -k 3 -t' '

We've found it useful - if you have 1 waiter on one node that's 1278 seconds
old, and 3 other nodes have waiters that are 1275 seconds old, it's a good
chance the other 3 nodes waiters are waiting on the first node's waiter to
resolve itself....
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 486 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170606/d1f3d484/attachment-0002.sig>