<html>
<head>
<meta content="text/html; charset=ISO-8859-1"
http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
Another interesting case about a specific waiter:<br>
<br>
was looking the waiters on GSS until i found those( i got those info
collecting from all the servers with a script i did, so i was able
to trace hanging connection while they was happening):<br>
<blockquote>
<blockquote>
<blockquote>
<blockquote><tt>gss03b.ebi.ac.uk:</tt><tt><b> 235.373993397</b></tt><tt>
(MsgRecordCondvar), reason 'RPC wait' for getData on node
10.7.37.109 <c0n675></tt><br>
<tt>gss03b.ebi.ac.uk:</tt><tt><b> 235.152271998</b></tt><tt>
(MsgRecordCondvar), reason 'RPC wait' for getData on node
10.7.37.109 <c0n675></tt><br>
<tt>gss02a.ebi.ac.uk:</tt><tt><b> 214.079093620 </b></tt><tt>(MsgRecordCondvar),
reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109
<c0n656></tt><br>
<tt>gss02a.ebi.ac.uk:</tt><tt><b> 213.580199240 </b></tt><tt>(MsgRecordCondvar),
reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109
<c0n675></tt><br>
<tt>gss03b.ebi.ac.uk:</tt><tt><b> 132.375138082</b></tt><tt>
(MsgRecordCondvar), reason 'RPC wait' for getData on node
10.7.37.109 <c0n675></tt><br>
<tt>gss03b.ebi.ac.uk:</tt><tt><b> 132.374973884 </b></tt><tt>(MsgRecordCondvar),
reason 'RPC wait' for commMsgCheckMessages on node
10.7.37.109 <c0n675></tt><br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br>
<br>
the bolted number are seconds. looking at this page:<br>
<a class="moz-txt-link-freetext"
href="https://www.ibm.com/developerworks/community/wikis/home?lang=en#%21/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information">https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information</a><br>
<br>
The web page claim that's, probably a network congestion, but i
managed to login quick enough to the client and there the waiters
was:<br>
<blockquote>
<blockquote>
<blockquote>
<blockquote><tt>[root@ebi5-236 ~]# mmdiag --waiters</tt><br>
<br>
<tt>=== mmdiag: waiters ===</tt><br>
<tt>0x7F6690073460 waiting 147.973009173 seconds,
RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0
(0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for
LX lock'</tt><br>
<tt>0x7F65100036D0 waiting 140.458589856 seconds,
WritebehindWorkerThread: on ThCond 0x7F6500000F98
(0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F63A0001080 waiting 245.153055801 seconds,
WritebehindWorkerThread: on ThCond 0x7F65D8002CF8
(0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674C03D3D0 waiting 245.750977203 seconds,
CleanBufferThread: on ThCond 0x7F64880079E8
(0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason
'force wait for buffer write to complete'</tt><br>
<tt>0x7F674802E360 waiting 244.159861966 seconds,
WritebehindWorkerThread: on ThCond 0x7F65E0002358
(0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674C038810 waiting 251.086748430 seconds,
SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398
(0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait' for
I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674C036230 waiting 139.556735095 seconds,
CleanBufferThread: on ThCond 0x7F65CC004C78
(0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674C031670 waiting 144.327593052 seconds,
WritebehindWorkerThread: on ThCond 0x7F672402D1A8
(0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674C02A4D0 waiting 145.202712821 seconds,
WritebehindWorkerThread: on ThCond 0x7F65440018F8
(0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674C0291E0 waiting 247.131569232 seconds,
PrefetchWorkerThread: on ThCond 0x7F65740016C8
(0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6748025BD0 waiting 11.631381523 seconds,
replyCleanupThread: on ThCond 0x7F65E000A1F8
(0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait'</tt><br>
<tt>0x7F6748022300 waiting 245.616267612 seconds,
WritebehindWorkerThread: on ThCond 0x7F6470001468
(0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6748021010 waiting 230.769670930 seconds,
InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8
(0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason
'force wait for buffer write to complete'</tt><br>
<tt>0x7F674801B160 waiting 245.830554594 seconds,
UnusedInodePrefetchThread: on ThCond 0x7F65B8004438
(0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F674800A820 waiting 252.332932000 seconds, Msg
handler getData: for poll on sock 109</tt><br>
<tt>0x7F63F4023090 waiting 253.073535042 seconds,
WritebehindWorkerThread: on ThCond 0x7F65C4000CC8
(0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F64A4000CE0 waiting 145.049659249 seconds,
WritebehindWorkerThread: on ThCond 0x7F6560000A98
(0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6778006D00 waiting 142.124664264 seconds,
WritebehindWorkerThread: on ThCond 0x7F63DC000C08
(0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67780046D0 waiting 251.751439453 seconds,
WritebehindWorkerThread: on ThCond 0x7F6454000A98
(0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67780E4B70 waiting 142.431051232 seconds,
WritebehindWorkerThread: on ThCond 0x7F63C80010D8
(0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67780E50D0 waiting 244.339624817 seconds,
WritebehindWorkerThread: on ThCond 0x7F65BC001B98
(0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6434000B40 waiting 145.343700410 seconds,
WritebehindWorkerThread: on ThCond 0x7F63B00036E8
(0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F670C0187A0 waiting 244.903963969 seconds,
WritebehindWorkerThread: on ThCond 0x7F65F0000FB8
(0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F671C04E2F0 waiting 245.837137631 seconds,
PrefetchWorkerThread: on ThCond 0x7F65A4000A98
(0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F671C04AA20 waiting 139.713993908 seconds,
WritebehindWorkerThread: on ThCond 0x7F6454002478
(0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F671C049730 waiting 252.434187472 seconds,
WritebehindWorkerThread: on ThCond 0x7F65F4003708
(0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F671C044B70 waiting 131.515829048 seconds, Msg
handler ccMsgPing: on ThCond 0x7F64DC1D4888
(0x7F64DC1D4888) (InuseCondvar), reason 'waiting for
exclusive use of connection for sending msg'</tt><br>
<tt>0x7F6758008DE0 waiting 149.548547226 seconds, Msg
handler getData: on ThCond 0x7F645C002458 (0x7F645C002458)
(InuseCondvar), reason 'waiting for exclusive use of
connection for sending msg'</tt><br>
<tt>0x7F67580071D0 waiting 149.548543118 seconds, Msg
handler commMsgCheckMessages: on ThCond 0x7F6450001C48
(0x7F6450001C48) (InuseCondvar), reason 'waiting for
exclusive use of connection for sending msg'</tt><br>
<tt>0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler
ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88)
(InuseCondvar), reason 'waiting for exclusive use of
connection for sending msg'</tt><br>
<tt>0x7F6448001620 waiting 139.844870446 seconds,
WritebehindWorkerThread: on ThCond 0x7F65F0003098
(0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F63F4000F80 waiting 245.044791905 seconds,
WritebehindWorkerThread: on ThCond 0x7F6450001188
(0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F659C0033A0 waiting 243.464399305 seconds,
PrefetchWorkerThread: on ThCond 0x7F6554002598
(0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6514001690 waiting 245.826160463 seconds,
PrefetchWorkerThread: on ThCond 0x7F65A4004558
(0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F64800012B0 waiting 253.174835511 seconds,
WritebehindWorkerThread: on ThCond 0x7F65E0000FB8
(0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6510000EE0 waiting 140.746696039 seconds,
WritebehindWorkerThread: on ThCond 0x7F647C000CC8
(0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6754001BB0 waiting 246.336055629 seconds,
PrefetchWorkerThread: on ThCond 0x7F6594002498
(0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6420000930 waiting 140.606777450 seconds,
WritebehindWorkerThread: on ThCond 0x7F6578002498
(0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6744009110 waiting 137.466372831 seconds,
FileBlockReadFetchHandlerThread: on ThCond 0x7F65F4007158
(0x7F65F4007158) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67280119F0 waiting 144.173427360 seconds,
WritebehindWorkerThread: on ThCond 0x7F6504000AE8
(0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F672800BB40 waiting 145.804301887 seconds,
WritebehindWorkerThread: on ThCond 0x7F6550001038
(0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6728000910 waiting 252.601993452 seconds,
WritebehindWorkerThread: on ThCond 0x7F6450000A98
(0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F6744007E20 waiting 251.603329204 seconds,
WritebehindWorkerThread: on ThCond 0x7F6570004C18
(0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F64AC002EF0 waiting 139.205774422 seconds,
FileBlockWriteFetchHandlerThread: on ThCond 0x18020AF0260
(0xFFFFC90020AF0260) (FetchFlowControlCondvar), reason
'wait for buffer for fetch'</tt><br>
<tt>0x7F6724013050 waiting 71.501580932 seconds, Msg handler
ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608)
(InuseCondvar), reason 'waiting for exclusive use of
connection for sending msg'</tt><br>
<tt>0x7F661C000DA0 waiting 245.654985276 seconds,
PrefetchWorkerThread: on ThCond 0x7F6570005288
(0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F671C00F440 waiting 251.096002003 seconds,
FileBlockReadFetchHandlerThread: on ThCond 0x7F65BC002878
(0x7F65BC002878) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F671C00E150 waiting 144.034006970 seconds,
WritebehindWorkerThread: on ThCond 0x7F6528001548
(0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67A02FCD20 waiting 142.324070945 seconds,
WritebehindWorkerThread: on ThCond 0x7F6580002A98
(0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67A02FA330 waiting 200.670114385 seconds,
EEWatchDogThread: on ThCond 0x7F65B0000A98
(0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait'</tt><br>
<tt>0x7F67A02BF050 waiting 252.276161189 seconds,
WritebehindWorkerThread: on ThCond 0x7F6584003998
(0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait' for
NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
<tt>0x7F67A0004160 waiting 251.173651822 seconds,
SyncHandlerThread: on ThCond 0x7F64880079E8
(0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason
'force wait on force active buffer write'</tt><br>
</blockquote>
</blockquote>
</blockquote>
</blockquote>
<br>
So from the client side its the client that's waiting the server. I
managed also to ping, ssh, and tcpdump each other before the node
got expelled and discovered that ping works fine, ssh work fine ,
beside my tests there are 0 packet passing between them, LITERALLY.
<br>
<br>
So there is no congestion, no network issues, but the server waits
for the client and the client waits the server. This happens until
we reach 350 secs ( 10 times the lease time) , then client get
expelled.<br>
There are no local io waiters that indicates that gss is struggling,
there is plenty of bandwith and CPU resources and no network
congestion.<br>
<br>
Seems some sort of deadlock to me, but how can this be explained and
hopefully fixed?<br>
<br>
Regards,<br>
Salvatore<br>
</body>
</html>