[gpfsug-discuss] Multiple nodes hung with 'waiting for the flush flag to commit metadata'

Mon Dec 14 00:50:00 GMT 2015

Any idea what this hang condition is all about? I have several nodes all in a sort of deadlock, with the following long waiters. I know I’m probably looking at a PMR, but – any other clues on what be at work? GPFS 4.1.0.7 on Linux, RH 6.6.

They all seem to go back to nodes where 'waiting for the flush flag to commit metadata’ and 'waiting for WW lock’ are the RPCs in question.

0x7F418C0C07D0 (  18869) waiting 203445.829057195 seconds, InodePrefetchWorkerThread: on ThCond 0x7F41FC02A338 (0x7F41FC02A338) (MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.30.105.68 <c1n178>
  0x7F418C0C66D0 (  18876) waiting 196174.410095017 seconds, InodePrefetchWorkerThread: on ThCond 0x7F40AC8AB798 (0x7F40AC8AB798) (MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.30.86.102 <c1n373>
  0x7F9C5C0041F0 (  17394) waiting 218020.428801654 seconds, SyncHandlerThread: on ThCond 0x1801970D678 (0xFFFFC9001970D678) (InodeFlushCondVar), reason 'waiting for the flush flag to commit metadata'
  0x7FEAC0037F10 (  25547) waiting 158003.275282910 seconds, InodePrefetchWorkerThread: on ThCond 0x7FEBA400E398 (0x7FEBA400E398) (MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.30.86.159 <c2n312>
  0x7F04B0028E80 (  11757) waiting 159426.694691653 seconds, InodePrefetchWorkerThread: on ThCond 0x7F0400002A28 (0x7F0400002A28) (MsgRecordCondvar), reason 'RPC wait' for tmMsgTellAcquire1 on node 10.30.43.226 <c1n5>
  0x7F04D0013AA0 (  21781) waiting 157723.199692503 seconds, InodePrefetchWorkerThread: on ThCond 0x7F0454010358 (0x7F0454010358) (MsgRecordCondvar), reason 'RPC wait' for tmMsgTellAcquire1 on node 10.30.43.227 <c1n7>
  0x7F6F480041F0 (  12964) waiting 209491.171775225 seconds, SyncHandlerThread: on ThCond 0x18022F3C490 (0xFFFFC90022F3C490) (InodeFlushCondVar), reason 'waiting for the flush flag to commit metadata'
  0x7F03180041F0 (  12338) waiting 212486.480961641 seconds, SyncHandlerThread: on ThCond 0x18027186220 (0xFFFFC90027186220) (LkObjCondvar), reason 'waiting for WW lock'
  0x7F1EB00041F0 (  12598) waiting 215765.483202551 seconds, SyncHandlerThread: on ThCond 0x18026FDFDD0 (0xFFFFC90026FDFDD0) (InodeFlushCondVar), reason 'waiting for the flush flag to commit metadata'
  0x7F83540041F0 (  12605) waiting 75189.385741859 seconds, SyncHandlerThread: on ThCond 0x18021DAA7F8 (0xFFFFC90021DAA7F8) (InodeFlushCondVar), reason 'waiting for the flush flag to commit metadata'
  0x7FF10C20DA10 (  34836) waiting 202382.680544395 seconds, InodePrefetchWorkerThread: on ThCond 0x7FF1640026C8 (0x7FF1640026C8) (MsgRecordCondvar), reason 'RPC wait' for tmMsgRevoke on node 10.30.86.77 <c1n337>
  0x7F839806DBF0 (  49131) waiting 158295.556723453 seconds, InodePrefetchWorkerThread: on ThCond 0x7F82B0000FF8 (0x7F82B0000FF8) (MsgRecordCondvar), reason 'RPC wait' for tmMsgTellAcquire1 on node 10.30.43.226 <c2n5>

Bob Oesterlin
Sr Storage Engineer, Nuance Communications
507-269-0413

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20151214/4476a02c/attachment-0001.htm>