<html>
  <head>
    <meta content="text/html; charset=ISO-8859-1"
      http-equiv="Content-Type">
  </head>
  <body bgcolor="#FFFFFF" text="#000000">
    Another interesting case about a specific waiter:<br>
    <br>
    was looking the waiters on GSS until i found those( i got those info
    collecting from all the servers with a script i did, so i was able
    to trace hanging connection while they was happening):<br>
    <blockquote>
      <blockquote>
        <blockquote>
          <blockquote><tt>gss03b.ebi.ac.uk:</tt><tt><b> 235.373993397</b></tt><tt>
              (MsgRecordCondvar), reason 'RPC wait' for getData on node
              10.7.37.109 <c0n675></tt><br>
            <tt>gss03b.ebi.ac.uk:</tt><tt><b> 235.152271998</b></tt><tt>
              (MsgRecordCondvar), reason 'RPC wait' for getData on node
              10.7.37.109 <c0n675></tt><br>
            <tt>gss02a.ebi.ac.uk:</tt><tt><b> 214.079093620 </b></tt><tt>(MsgRecordCondvar),

              reason 'RPC wait' for tmMsgRevoke on node 10.7.34.109
              <c0n656></tt><br>
            <tt>gss02a.ebi.ac.uk:</tt><tt><b> 213.580199240 </b></tt><tt>(MsgRecordCondvar),

              reason 'RPC wait' for tmMsgRevoke on node 10.7.37.109
              <c0n675></tt><br>
            <tt>gss03b.ebi.ac.uk:</tt><tt><b> 132.375138082</b></tt><tt>
              (MsgRecordCondvar), reason 'RPC wait' for getData on node
              10.7.37.109 <c0n675></tt><br>
            <tt>gss03b.ebi.ac.uk:</tt><tt><b> 132.374973884 </b></tt><tt>(MsgRecordCondvar),

              reason 'RPC wait' for commMsgCheckMessages on node
              10.7.37.109 <c0n675></tt><br>
          </blockquote>
        </blockquote>
      </blockquote>
    </blockquote>
    <br>
    <br>
    the bolted number are seconds. looking at this page:<br>
    <a class="moz-txt-link-freetext"
href="https://www.ibm.com/developerworks/community/wikis/home?lang=en#%21/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information">https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+%28GPFS%29/page/Interpreting+GPFS+Waiter+Information</a><br>
    <br>
    The web page claim that's, probably a network congestion, but i
    managed to login quick enough to the client and there the waiters
    was:<br>
    <blockquote>
      <blockquote>
        <blockquote>
          <blockquote><tt>[root@ebi5-236 ~]# mmdiag --waiters</tt><br>
            <br>
            <tt>=== mmdiag: waiters ===</tt><br>
            <tt>0x7F6690073460 waiting 147.973009173 seconds,
              RangeRevokeWorkerThread: on ThCond 0x1801E43F6A0
              (0xFFFFC9001E43F6A0) (LkObjCondvar), reason 'waiting for
              LX lock'</tt><br>
            <tt>0x7F65100036D0 waiting 140.458589856 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6500000F98
              (0x7F6500000F98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F63A0001080 waiting 245.153055801 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65D8002CF8
              (0x7F65D8002CF8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674C03D3D0 waiting 245.750977203 seconds,
              CleanBufferThread: on ThCond 0x7F64880079E8
              (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason
              'force wait for buffer write to complete'</tt><br>
            <tt>0x7F674802E360 waiting 244.159861966 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65E0002358
              (0x7F65E0002358) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674C038810 waiting 251.086748430 seconds,
              SGExceptionLogBufferFullThread: on ThCond 0x7F64EC001398
              (0x7F64EC001398) (MsgRecordCondvar), reason 'RPC wait' for
              I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674C036230 waiting 139.556735095 seconds,
              CleanBufferThread: on ThCond 0x7F65CC004C78
              (0x7F65CC004C78) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674C031670 waiting 144.327593052 seconds,
              WritebehindWorkerThread: on ThCond 0x7F672402D1A8
              (0x7F672402D1A8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674C02A4D0 waiting 145.202712821 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65440018F8
              (0x7F65440018F8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674C0291E0 waiting 247.131569232 seconds,
              PrefetchWorkerThread: on ThCond 0x7F65740016C8
              (0x7F65740016C8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6748025BD0 waiting 11.631381523 seconds,
              replyCleanupThread: on ThCond 0x7F65E000A1F8
              (0x7F65E000A1F8) (MsgRecordCondvar), reason 'RPC wait'</tt><br>
            <tt>0x7F6748022300 waiting 245.616267612 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6470001468
              (0x7F6470001468) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6748021010 waiting 230.769670930 seconds,
              InodeAllocRevokeWorkerThread: on ThCond 0x7F64880079E8
              (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason
              'force wait for buffer write to complete'</tt><br>
            <tt>0x7F674801B160 waiting 245.830554594 seconds,
              UnusedInodePrefetchThread: on ThCond 0x7F65B8004438
              (0x7F65B8004438) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F674800A820 waiting 252.332932000 seconds, Msg
              handler getData: for poll on sock 109</tt><br>
            <tt>0x7F63F4023090 waiting 253.073535042 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65C4000CC8
              (0x7F65C4000CC8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F64A4000CE0 waiting 145.049659249 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6560000A98
              (0x7F6560000A98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6778006D00 waiting 142.124664264 seconds,
              WritebehindWorkerThread: on ThCond 0x7F63DC000C08
              (0x7F63DC000C08) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67780046D0 waiting 251.751439453 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6454000A98
              (0x7F6454000A98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67780E4B70 waiting 142.431051232 seconds,
              WritebehindWorkerThread: on ThCond 0x7F63C80010D8
              (0x7F63C80010D8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67780E50D0 waiting 244.339624817 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65BC001B98
              (0x7F65BC001B98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6434000B40 waiting 145.343700410 seconds,
              WritebehindWorkerThread: on ThCond 0x7F63B00036E8
              (0x7F63B00036E8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F670C0187A0 waiting 244.903963969 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65F0000FB8
              (0x7F65F0000FB8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F671C04E2F0 waiting 245.837137631 seconds,
              PrefetchWorkerThread: on ThCond 0x7F65A4000A98
              (0x7F65A4000A98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F671C04AA20 waiting 139.713993908 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6454002478
              (0x7F6454002478) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F671C049730 waiting 252.434187472 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65F4003708
              (0x7F65F4003708) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F671C044B70 waiting 131.515829048 seconds, Msg
              handler ccMsgPing: on ThCond 0x7F64DC1D4888
              (0x7F64DC1D4888) (InuseCondvar), reason 'waiting for
              exclusive use of connection for sending msg'</tt><br>
            <tt>0x7F6758008DE0 waiting 149.548547226 seconds, Msg
              handler getData: on ThCond 0x7F645C002458 (0x7F645C002458)
              (InuseCondvar), reason 'waiting for exclusive use of
              connection for sending msg'</tt><br>
            <tt>0x7F67580071D0 waiting 149.548543118 seconds, Msg
              handler commMsgCheckMessages: on ThCond 0x7F6450001C48
              (0x7F6450001C48) (InuseCondvar), reason 'waiting for
              exclusive use of connection for sending msg'</tt><br>
            <tt>0x7F65A40052B0 waiting 11.498507001 seconds, Msg handler
              ccMsgPing: on ThCond 0x7F644C103F88 (0x7F644C103F88)
              (InuseCondvar), reason 'waiting for exclusive use of
              connection for sending msg'</tt><br>
            <tt>0x7F6448001620 waiting 139.844870446 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65F0003098
              (0x7F65F0003098) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F63F4000F80 waiting 245.044791905 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6450001188
              (0x7F6450001188) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F659C0033A0 waiting 243.464399305 seconds,
              PrefetchWorkerThread: on ThCond 0x7F6554002598
              (0x7F6554002598) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6514001690 waiting 245.826160463 seconds,
              PrefetchWorkerThread: on ThCond 0x7F65A4004558
              (0x7F65A4004558) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F64800012B0 waiting 253.174835511 seconds,
              WritebehindWorkerThread: on ThCond 0x7F65E0000FB8
              (0x7F65E0000FB8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6510000EE0 waiting 140.746696039 seconds,
              WritebehindWorkerThread: on ThCond 0x7F647C000CC8
              (0x7F647C000CC8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6754001BB0 waiting 246.336055629 seconds,
              PrefetchWorkerThread: on ThCond 0x7F6594002498
              (0x7F6594002498) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6420000930 waiting 140.606777450 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6578002498
              (0x7F6578002498) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6744009110 waiting 137.466372831 seconds,
              FileBlockReadFetchHandlerThread: on ThCond 0x7F65F4007158
              (0x7F65F4007158) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67280119F0 waiting 144.173427360 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6504000AE8
              (0x7F6504000AE8) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F672800BB40 waiting 145.804301887 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6550001038
              (0x7F6550001038) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6728000910 waiting 252.601993452 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6450000A98
              (0x7F6450000A98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F6744007E20 waiting 251.603329204 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6570004C18
              (0x7F6570004C18) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F64AC002EF0 waiting 139.205774422 seconds,
              FileBlockWriteFetchHandlerThread: on ThCond 0x18020AF0260
              (0xFFFFC90020AF0260) (FetchFlowControlCondvar), reason
              'wait for buffer for fetch'</tt><br>
            <tt>0x7F6724013050 waiting 71.501580932 seconds, Msg handler
              ccMsgPing: on ThCond 0x7F6580006608 (0x7F6580006608)
              (InuseCondvar), reason 'waiting for exclusive use of
              connection for sending msg'</tt><br>
            <tt>0x7F661C000DA0 waiting 245.654985276 seconds,
              PrefetchWorkerThread: on ThCond 0x7F6570005288
              (0x7F6570005288) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F671C00F440 waiting 251.096002003 seconds,
              FileBlockReadFetchHandlerThread: on ThCond 0x7F65BC002878
              (0x7F65BC002878) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F671C00E150 waiting 144.034006970 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6528001548
              (0x7F6528001548) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67A02FCD20 waiting 142.324070945 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6580002A98
              (0x7F6580002A98) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67A02FA330 waiting 200.670114385 seconds,
              EEWatchDogThread: on ThCond 0x7F65B0000A98
              (0x7F65B0000A98) (MsgRecordCondvar), reason 'RPC wait'</tt><br>
            <tt>0x7F67A02BF050 waiting 252.276161189 seconds,
              WritebehindWorkerThread: on ThCond 0x7F6584003998
              (0x7F6584003998) (MsgRecordCondvar), reason 'RPC wait' for
              NSD I/O completion on node 10.7.28.35 <c1n5></tt><br>
            <tt>0x7F67A0004160 waiting 251.173651822 seconds,
              SyncHandlerThread: on ThCond 0x7F64880079E8
              (0x7F64880079E8) (LogFileBufferDescriptorCondvar), reason
              'force wait on force active buffer write'</tt><br>
          </blockquote>
        </blockquote>
      </blockquote>
    </blockquote>
    <br>
    So from the client side its the client that's waiting the server. I
    managed also to ping, ssh, and   tcpdump each other before the node
    got expelled and discovered that ping works fine, ssh work fine ,
    beside my tests there are  0 packet passing between them, LITERALLY.
    <br>
    <br>
    So there is no congestion, no network issues, but the server waits
    for the client and the client waits the server. This happens until
    we reach 350 secs ( 10 times the lease time) , then client get
    expelled.<br>
    There are no local io waiters that indicates that gss is struggling,
    there is plenty of bandwith and CPU resources and no network
    congestion.<br>
    <br>
    Seems some sort of deadlock to me, but how can this be explained and
    hopefully fixed?<br>
    <br>
    Regards,<br>
    Salvatore<br>
  </body>
</html>