<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    Thanks for the feedback, but we managed to find a scenario that

    excludes network problems.<br>

    <br>

    we have a file called <b><i>input_file</i></b> of nearly 100GB:<br>

    <br>

    if from <b>client A</b> we do:<br>

    <br>

    cat input_file >> output_file<br>

    <br>

    it start copying.. and we see waiter goeg a bit up,secs but then

    they flushes back to 0, so we xcan say that the copy proceed well...<br>

    <br>

    <br>

    if now we do the same from another client ( or just another shell on

    the same client) <b>client B</b> :<br>

    <br>

    cat input_file >> output_file<br>

    <br>

    <br>

     ( in other words we are trying to write to the same destination)

    all the waiters gets up until one node get expelled.<br>

    <br>

    <br>

    Now, while its understandable that the destination file is locked

    for one of the "cat", so have to wait ( and since the file is BIG ,

    have to wait for a while), its not understandable why it stop the

    renewal lease. <br>

    Why its doen't return just a timeout error on the copy  instead to

    expel the node? We can reproduce this every time, and since our

    users to operations like this on files over 100GB each you can

    imagine the result.<br>

    <br>

    <br>

    <br>

    As you can imagine even if its a bit silly to write at the same time

    to the same destination, its also quite common if we want to dump to

    a log file logs and for some reason one of the writers, write for a

    lot of time keeping the file locked.<br>

    Our expels are not due to network congestion, but because a write

    attempts have to wait another one. What i really dont understand is

    why to take a so expreme mesure to expell jest because a process is

    waiteing "to too much time".<br>

    <br>

    <br>

    I have ticket opened to IBM for this and the issue is under

    investigation, but no luck so far..<br>

    <br>

    Regards,<br>

    Salvatore<br>

    <br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 21/08/14 09:20, Jez Tucker (Chair)

      wrote:<br>

    </div>

    <blockquote cite="mid:53F5ABD7.80107@gpfsug.org" type="cite">

      <meta content="text/html; charset=ISO-8859-1"

        http-equiv="Content-Type">

      Hi there,<br>

      <br>

        I've seen the on several 'stock'?  'core'? GPFS system (we need

      a better term now GSS is out) and seen ping 'working', but

      alongside ejections from the cluster.<br>

      The GPFS internode 'ping' is somewhat more circumspect than unix

      ping - and rightly so.<br>

      <br>

      In my experience this has _always_ been a network issue of one

      sort of another.  If the network is experiencing issues, nodes

      will be ejected.<br>

      Of course it could be unresponsive mmfsd or high loadavg, but I've

      seen that only twice in 10 years over many versions of GPFS.<br>

      <br>

      You need to follow the logs through from each machine in time

      order to determine who could not see who and in what order.<br>

      Your best way forward is to log a SEV2 case with IBM support,

      directly or via your OEM and collect and supply a snap and traces

      as required by support.<br>

      <br>

      Without knowing your full setup, it's hard to help further.<br>

      <br>

      Jez<br>

      <br>

      <div class="moz-cite-prefix">On 20/08/14 08:57, Salvatore Di Nardo

        wrote:<br>

      </div>

      <blockquote cite="mid:53F454E3.40803@ebi.ac.uk" type="cite">

        <meta content="text/html; charset=ISO-8859-1"

          http-equiv="Content-Type">

        Still problems. Here some more detailed examples:<br>

        <br>

        <b>EXAMPLE 1:</b><br>

        <blockquote>

          <blockquote>

            <blockquote><b><tt>EBI5-220</tt></b><b><tt> ( CLIENT)</tt></b><b><br>

              </b><tt>Tue Aug 19 11:03:04.980 2014: <b>Timed out

                  waiting for a reply from node <GSS02B IP> gss02b</b></tt><br>

              <tt>Tue Aug 19 11:03:04.981 2014: Request sent to

                <GSS02A IP> (gss02a in GSS.ebi.ac.uk) to expel

                <GSS02B IP> (gss02b in GSS.ebi.ac.uk) from cluster

                GSS.ebi.ac.uk</tt><br>

              <tt>Tue Aug 19 11:03:04.982 2014: This node will be

                expelled from cluster GSS.ebi.ac.uk due to expel msg

                from <EBI5-220 IP> (ebi5-220)</tt><br>

              <tt>Tue Aug 19 11:03:09.319 2014: Cluster Manager

                connection broke. Probing cluster GSS.ebi.ac.uk</tt><br>

              <tt>Tue Aug 19 11:03:10.321 2014: Unable to contact any

                quorum nodes during cluster probe.</tt><br>

              <tt>Tue Aug 19 11:03:10.322 2014: Lost membership in

                cluster GSS.ebi.ac.uk. Unmounting file systems.</tt><br>

              <tt>Tue Aug 19 11:03:10 BST 2014: mmcommon preunmount

                invoked.  File system: gpfs1  Reason: SGPanic</tt><br>

              <tt>Tue Aug 19 11:03:12.066 2014: Connecting to <GSS02A

                IP> gss02a <c1p687></tt><br>

              <tt>Tue Aug 19 11:03:12.070 2014: Connected to <GSS02A

                IP> gss02a <c1p687></tt><br>

              <tt>Tue Aug 19 11:03:17.071 2014: Connecting to <GSS02B

                IP> gss02b <c1p686></tt><br>

              <tt>Tue Aug 19 11:03:17.072 2014: Connecting to <GSS03B

                IP> gss03b <c1p685></tt><br>

              <tt>Tue Aug 19 11:03:17.079 2014: Connecting to <GSS03A

                IP> gss03a <c1p684></tt><br>

              <tt>Tue Aug 19 11:03:17.080 2014: Connecting to <GSS01B

                IP> gss01b <c1p683></tt><br>

              <tt>Tue Aug 19 11:03:17.079 2014: Connecting to <GSS01A

                IP> gss01a <c1p1></tt><br>

              <tt>Tue Aug 19 11:04:23.105 2014: Connected to <GSS02B

                IP> gss02b <c1p686></tt><br>

              <tt>Tue Aug 19 11:04:23.107 2014: Connected to <GSS03B

                IP> gss03b <c1p685></tt><br>

              <tt>Tue Aug 19 11:04:23.112 2014: Connected to <GSS03A

                IP> gss03a <c1p684></tt><br>

              <tt>Tue Aug 19 11:04:23.115 2014: Connected to <GSS01B

                IP> gss01b <c1p683></tt><br>

              <tt>Tue Aug 19 11:04:23.121 2014: Connected to <GSS01A

                IP> gss01a <c1p1></tt><br>

              <tt>Tue Aug 19 11:12:28.992 2014: Node <GSS02A IP>

                (gss02a in GSS.ebi.ac.uk) is now the Group Leader.</tt><br>

              <br>

              <b><tt>GSS02B ( NSD SERVER)</tt></b><br>

              <tt>...<br>

                Tue Aug 19 11:03:17.070 2014: Killing connection from <b><EBI5-220


                  IP></b> because the group is not ready for it to

                rejoin, err 46<br>

                Tue Aug 19 11:03:25.016 2014: Killing connection from

                <EBI5-102 IP> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:03:28.080 2014: Killing connection from </tt><tt><tt><b><EBI5-220


                    IP></b></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:03:36.019 2014: Killing connection from

                <EBI5-102 IP> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:03:39.083 2014: Killing connection from </tt><tt><tt><b><EBI5-220


                    IP></b></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:03:47.023 2014: Killing connection from

                <EBI5-102 IP> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:03:50.088 2014: Killing connection from </tt><tt><tt><b><EBI5-220


                    IP></b></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:03:52.218 2014: Killing connection from </tt><tt><tt><EBI5-043


                  IP></tt> because the group is not ready for it to

                rejoin, err 46<br>

                Tue Aug 19 11:03:58.030 2014: Killing connection from

                <EBI5-102 IP> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:01.092 2014: Killing connection from </tt><tt><tt><b><EBI5-220


                    IP></b></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:03.220 2014: Killing connection from </tt><tt><tt><tt><EBI5-043


                    IP></tt></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:09.034 2014: Killing connection from

                <EBI5-102 IP> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:12.096 2014: Killing connection from </tt><tt><tt><b><EBI5-220


                    IP></b></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:14.224 2014: Killing connection from </tt><tt><tt><tt><EBI5-043


                    IP></tt></tt> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:20.037 2014: Killing connection from

                <EBI5-102 IP> because the group is not ready for

                it to rejoin, err 46<br>

                Tue Aug 19 11:04:23.103 2014: Accepted and connected to

              </tt><tt><tt><b><EBI5-220 IP></b></tt> ebi5-220

                <c0n618><br>

                ...</tt><br>

              <br>

              <b><tt>GSS02a ( NSD SERVER)</tt></b><br>

              <tt>Tue Aug 19 11:03:04.980 2014: Expel <GSS02B IP>

                (gss02b) request from <EBI5-220 IP> (ebi5-220 in

                ebi-cluster.ebi.ac.uk). Expelling: <EBI5-220 IP>

                (ebi5-220 in ebi-cluster.ebi.ac.uk)</tt><br>

              <tt>Tue Aug 19 11:03:12.069 2014: Accepted and connected

                to <EBI5-220 IP> ebi5-220 <c0n618></tt><br>

              <br>

              <br>

            </blockquote>

          </blockquote>

        </blockquote>

        ===============================================<br>

        <b>EXAMPLE 2</b>:<br>

        <br>

        <blockquote>

          <blockquote>

            <blockquote><b><tt>EBI5-038</tt></b><br>

              <tt>Tue Aug 19 11:32:34.227 2014: <b>Disk lease period

                  expired in cluster GSS.ebi.ac.uk. Attempting to

                  reacquire lease.</b></tt><br>

              <tt>Tue Aug 19 11:33:34.258 2014: <b>Lease is overdue.

                  Probing cluster GSS.ebi.ac.uk</b></tt><br>

              <tt>Tue Aug 19 11:35:24.265 2014: Close connection to

                <GSS02A IP> gss02a <c1n2> (Connection reset

                by peer). Attempting reconnect.</tt><br>

              <tt>Tue Aug 19 11:35:24.865 2014: Close connection to

                <EBI5-014 IP> ebi5-014 <c1n457> (Connection

                reset by peer). Attempting reconnect.</tt><br>

              <tt>...</tt><br>

              <tt>LOT MORE RESETS BY PEER</tt><br>

              <tt>...</tt><br>

              <tt>Tue Aug 19 11:35:25.096 2014: Close connection to

                <EBI5-167 IP> ebi5-167 <c1n155> (Connection

                reset by peer). Attempting reconnect.</tt><br>

              <tt>Tue Aug 19 11:35:25.267 2014: Connecting to <GSS02A

                IP> gss02a <c1n2></tt><br>

              <tt>Tue Aug 19 11:35:25.268 2014: Close connection to

                <GSS02A IP> gss02a <c1n2> (Connection failed

                because destination is still processing previous node

                failure)</tt><br>

              <tt>Tue Aug 19 11:35:26.267 2014: Retry connection to

                <GSS02A IP> gss02a <c1n2></tt><br>

              <tt>Tue Aug 19 11:35:26.268 2014: Close connection to

                <GSS02A IP> gss02a <c1n2> (Connection failed

                because destination is still processing previous node

                failure)</tt><br>

              <tt>Tue Aug 19 11:36:24.276 2014: Unable to contact any

                quorum nodes during cluster probe.</tt><br>

              <tt>Tue Aug 19 11:36:24.277 2014: <b>Lost membership in

                  cluster GSS.ebi.ac.uk. Unmounting file systems.</b></tt><br>

              <br>

              <b><tt>GSS02a</tt></b><br>

              <tt>Tue Aug 19 11:35:24.263 2014: Node <EBI5-038 IP>

                (ebi5-038 in ebi-cluster.ebi.ac.uk) <b>is being

                  expelled because of an expired lease.</b> Pings sent:

                60. Replies received: 60.</tt><br>

            </blockquote>

          </blockquote>

        </blockquote>

        <br>

        <br>

        <br>

        In example 1 seems that an NSD was not repliyng to the client,

        but the servers seems working fine.. how can i trace better ( to

        solve) the problem? <br>

        <br>

        In example 2 it seems to me that for some reason the manager are

        not renewing the lease in time. when this happens , its not a

        single client. <br>

        Loads of them fail to get the lease renewed. Why this is

        happening? how can i trace to the source of the problem?<br>

        <br>

        <br>

        <br>

        Thanks in advance for any tips.<br>

        <br>

        Regards,<br>

        Salvatore<br>

        <br>

        <br>

        <br>

        <br>

        <br>

        <br>

        <br>

        <fieldset class="mimeAttachmentHeader"></fieldset>

        <br>

        <pre wrap="">_______________________________________________

gpfsug-discuss mailing list

gpfsug-discuss at gpfsug.org

<a moz-do-not-send="true" class="moz-txt-link-freetext" href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a>

</pre>

      </blockquote>

      <br>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

gpfsug-discuss mailing list

gpfsug-discuss at gpfsug.org

<a class="moz-txt-link-freetext" href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a>

</pre>

    </blockquote>

    <br>

  </body>

</html>