[gpfsug-discuss] Hanging file-systems

Tue Nov 27 15:24:25 GMT 2018

I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it.

(I have a ticket with IBM open about this as well)

We see for example:
Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re
ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 <c1n9>

and on that node:
Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719
8 (TokenCondvar), reason 'wait for SubToken to become stable'

On this node, if you dump tscomm, you see entries like:
Pending messages:
  msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1
  this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec
    sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0)
    dest <c0n9>          status pending   , err 0, reply len 0 by TCP connection

c0n9 is itself.

This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster.

Any pointers for next place to look/how to fix?

Simon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181127/31288a1d/attachment-0001.htm>