[gpfsug-discuss] Hanging file-systems

Tue Nov 27 16:02:44 GMT 2018

I have seen something like this in the past, and I have resorted to a cluster restart as well.  :-( IBM and I could never really track it down, because I could not get a dump at the time of occurrence. However, you might take a look at your NSD servers, one at a time. As I recall, we thought it was a stuck thread on one of the NSD servers, and when we restarted the “right” one it cleared the block.

The other thing I’ve done in the past to isolate problems like this (since this is related to tokens) is to look at the “token revokes” on each node, looking for ones that are sticking around for a long time. I tossed together a quick script and ran it via mmdsh on all the node. Not pretty, but it got the job done. Run this a few times, see if any of the revokes are sticking around for a long time

#!/bin/sh
rm -f /tmp/revokelist
/usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > /tmp/revokelist 2> /dev/null
if [ $? -eq 0 ]; then
  /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out
  for n in `cat /tmp/revokelist  | grep msgHdr | awk '{print $5}'`; do
   grep $n /tmp/tscomm.out | tail -1
  done
  rm -f /tmp/tscomm.out
fi

Bob Oesterlin
Sr Principal Storage Engineer, Nuance

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson <S.J.Thompson at bham.ac.uk>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Tuesday, November 27, 2018 at 9:27 AM
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Subject: [EXTERNAL] [gpfsug-discuss] Hanging file-systems

I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it.

(I have a ticket with IBM open about this as well)

We see for example:
Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re
ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 <c1n9>

and on that node:
Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719
8 (TokenCondvar), reason 'wait for SubToken to become stable'

On this node, if you dump tscomm, you see entries like:
Pending messages:
  msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1
  this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec
    sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0)
    dest <c0n9>          status pending   , err 0, reply len 0 by TCP connection

c0n9 is itself.

This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster.

Any pointers for next place to look/how to fix?

Simon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181127/8cff96d2/attachment-0002.htm>