<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html;
      charset=windows-1252">
  </head>
  <body text="#000000" bgcolor="#FFFFFF">
    <p>Hi Simon,</p>
    <p>We've had to disable the offload's for Intel cards in many
      situations with the i40e drivers - Redhat have an article about
      it: <a class="moz-txt-link-freetext" href="https://access.redhat.com/solutions/3662011">https://access.redhat.com/solutions/3662011</a></p>
    <p>-------<br>
      Orlando<br>
    </p>
    <p><br>
    </p>
    <div class="moz-cite-prefix">On 17/01/2019 19:02, Simon Thompson
      wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:CF45EE16DEF2FE4B9AA7FF2B6EE2654501297B40A5@EX13.adf.bham.ac.uk">
      <meta http-equiv="Content-Type" content="text/html;
        charset=windows-1252">
      <style type="text/css" id="owaParaStyle"></style>
      <div style="direction: ltr;font-family: Tahoma;color:
        #000000;font-size: 10pt;">So we've backed out a bunch of network
        tuning parameters we had set (based on the GPFS wiki pages),
        they've been set a while but um ... maybe they are causing
        issues.
        <div><br>
        </div>
        <div>Secondly, we've noticed in dump tscomm that we see
          connection broken to a node, and then the node ID is usually
          the same node, which is a bit weird to me.</div>
        <div><br>
        </div>
        <div>We've also just updated firmware on the Intel nics (the
          x722) which is part of the Skylake board. And specifically its
          the newer skylake kit we see this problem on. We've a number
          of issues with the x722 firmware (like it won't even bring a
          link up when plugged into some of our 10GbE switches, but
          that's another story).</div>
        <div><br>
        </div>
        <div>We've also dropped the bonded links from these nodes, just
          in case its related...</div>
        <div><br>
        </div>
        <div>Simon</div>
        <div><br>
        </div>
        <div>
          <div style="font-family: Times New Roman; color: #000000;
            font-size: 16px">
            <hr tabindex="-1">
            <div id="divRpF662645" style="direction: ltr;"><font
                size="2" face="Tahoma" color="#000000"><b>From:</b>
                <a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a>
                [<a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a>] on behalf of
                <a class="moz-txt-link-abbreviated" href="mailto:jlewars@us.ibm.com">jlewars@us.ibm.com</a> [<a class="moz-txt-link-abbreviated" href="mailto:jlewars@us.ibm.com">jlewars@us.ibm.com</a>]<br>
                <b>Sent:</b> 17 January 2019 14:30<br>
                <b>To:</b> Tomer Perry; gpfsug main discussion list<br>
                <b>Cc:</b> Yong Ze Chen<br>
                <b>Subject:</b> Re: [gpfsug-discuss] Node expels<br>
              </font><br>
            </div>
            <div><font size="2" face="sans-serif">></font><font
                size="2" face="Calibri">They always appear to be to a
                specific type of hardware with the same Ethernet
                controller,
              </font><br>
              <br>
              <font size="2" face="sans-serif">That makes me think you
                might be seeing packet loss that could require ring
                buffer tuning (the defaults and limits will differ with
                different ethernet adapters).  </font><br>
              <br>
              <font size="2" face="sans-serif">The expel section in the
                slides on this page has been expanded to include a
                'debugging expels section' (slides 19-20, which also
                reference ring buffer tuning):<br>
              </font><a
href="https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/DEBUG%20Expels/comment/7e4f9433-7ca3-430f-b40b-94777c507381"
                target="_blank" rel="noopener noreferrer"
                moz-do-not-send="true"><font size="2" face="sans-serif"
                  color="blue">https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/DEBUG%20Expels/comment/7e4f9433-7ca3-430f-b40b-94777c507381</font></a><font
                size="2" face="sans-serif"><br>
              </font><br>
              <font size="2" face="sans-serif">Regards,<br>
                John Lewars <br>
                Spectrum Scale Performance, IBM Poughkeepsie<br>
              </font><br>
              <br>
              <br>
              <br>
              <font size="1" face="sans-serif" color="#5f5f5f">From:    
                   </font><font size="1" face="sans-serif">Tomer
                Perry/Israel/IBM</font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">To:      
                 </font><font size="1" face="sans-serif">gpfsug main
                discussion list <a class="moz-txt-link-rfc2396E" href="mailto:gpfsug-discuss@spectrumscale.org"><gpfsug-discuss@spectrumscale.org></a></font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">Cc:      
                 </font><font size="1" face="sans-serif">John
                Lewars/Poughkeepsie/IBM@IBMUS, Yong Ze
                Chen/China/IBM@IBMCN</font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">Date:    
                   </font><font size="1" face="sans-serif">01/17/2019
                08:28 AM</font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">Subject:
                       </font><font size="1" face="sans-serif">Re:
                [gpfsug-discuss] Node expels</font><br>
              <hr noshade="noshade">
              <br>
              <br>
              <font size="2" face="sans-serif">Hi,</font><br>
              <br>
              <font size="2" face="sans-serif">I was asked to elaborate
                a bit ( thus also adding John and Yong Ze Chen).</font><br>
              <br>
              <font size="2" face="sans-serif">As written on the slide:</font><br>
              <font size="3">One of the best ways to determine if a
                network layer problem is root cause for an expel is to
                look at the low-level socket details dumped in the
                ‘extra’ log data (mmfs dump all) saved as part of
                automatic data collection on Linux GPFS nodes.
              </font><br>
              <br>
              <font size="2" face="sans-serif">So, the idea is that in
                expel situation, we dump the socket state from the OS (
                you can see the same using 'ss -i' for example).</font><br>
              <font size="2" face="sans-serif">In your example, it shows
                that the ca_state is 4, there are retransmits, high rto
                and all the point to a network problem.</font><br>
              <font size="2" face="sans-serif">You can find more details
                here: </font><a
                href="http://www.yonch.com/tech/linux-tcp-congestion-control-internals"
                target="_blank" rel="noopener noreferrer"
                moz-do-not-send="true"><font size="2" face="sans-serif"
                  color="blue">http://www.yonch.com/tech/linux-tcp-congestion-control-internals</font></a><br>
              <br>
              <font size="2" face="sans-serif"><br>
                Regards,<br>
                <br>
                Tomer Perry<br>
                Scalable I/O Development (Spectrum Scale)<br>
                email: <a class="moz-txt-link-abbreviated" href="mailto:tomp@il.ibm.com">tomp@il.ibm.com</a><br>
                1 Azrieli Center, Tel Aviv 67021, Israel<br>
                Global Tel:    +1 720 3422758<br>
                Israel Tel:      +972 3 9188625<br>
                Mobile:         +972 52 2554625<br>
              </font><br>
              <br>
              <br>
              <br>
              <br>
              <font size="1" face="sans-serif" color="#5f5f5f">From:    
                   </font><font size="1" face="sans-serif">"Tomer Perry"
                <a class="moz-txt-link-rfc2396E" href="mailto:TOMP@il.ibm.com"><TOMP@il.ibm.com></a></font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">To:      
                 </font><font size="1" face="sans-serif">gpfsug main
                discussion list <a class="moz-txt-link-rfc2396E" href="mailto:gpfsug-discuss@spectrumscale.org"><gpfsug-discuss@spectrumscale.org></a></font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">Date:    
                   </font><font size="1" face="sans-serif">17/01/2019
                13:46</font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">Subject:
                       </font><font size="1" face="sans-serif">Re:
                [gpfsug-discuss] Node expels</font><br>
              <font size="1" face="sans-serif" color="#5f5f5f">Sent by:
                       </font><font size="1" face="sans-serif"><a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a></font><br>
              <hr noshade="noshade">
              <br>
              <br>
              <br>
              <font size="2" face="sans-serif">Simon,</font><font
                size="3"><br>
              </font><font size="2" face="sans-serif"><br>
                Take a look at </font><a
href="http://files.gpfsug.org/presentations/2018/USA/Scale_Network_Flow-0.8.pdf"
                target="_blank" rel="noopener noreferrer"
                moz-do-not-send="true"><font size="2" face="sans-serif"
                  color="blue"><u>http://files.gpfsug.org/presentations/2018/USA/Scale_Network_Flow-0.8.pdf</u></font></a><font
                size="2" face="sans-serif">slide 13.</font><font
                size="3"><br>
              </font><font size="2" face="sans-serif"><br>
                <br>
                Regards,<br>
                <br>
                Tomer Perry<br>
                Scalable I/O Development (Spectrum Scale)<br>
                email: <a class="moz-txt-link-abbreviated" href="mailto:tomp@il.ibm.com">tomp@il.ibm.com</a><br>
                1 Azrieli Center, Tel Aviv 67021, Israel<br>
                Global Tel:    +1 720 3422758<br>
                Israel Tel:      +972 3 9188625<br>
                Mobile:         +972 52 2554625</font><font size="3"><br>
                <br>
                <br>
                <br>
              </font><font size="1" face="sans-serif" color="#5f5f5f"><br>
                From:        </font><font size="1" face="sans-serif">Simon
                Thompson <a class="moz-txt-link-rfc2396E" href="mailto:S.J.Thompson@bham.ac.uk"><S.J.Thompson@bham.ac.uk></a></font><font
                size="1" face="sans-serif" color="#5f5f5f"><br>
                To:        </font><font size="1" face="sans-serif"><a class="moz-txt-link-rfc2396E" href="mailto:gpfsug-discuss@spectrumscale.org">"gpfsug-discuss@spectrumscale.org"</a>
                <a class="moz-txt-link-rfc2396E" href="mailto:gpfsug-discuss@spectrumscale.org"><gpfsug-discuss@spectrumscale.org></a></font><font
                size="1" face="sans-serif" color="#5f5f5f"><br>
                Date:        </font><font size="1" face="sans-serif">17/01/2019
                13:35</font><font size="1" face="sans-serif"
                color="#5f5f5f"><br>
                Subject:        </font><font size="1" face="sans-serif">[gpfsug-discuss]
                Node expels</font><font size="1" face="sans-serif"
                color="#5f5f5f"><br>
                Sent by:        </font><font size="1" face="sans-serif"><a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a></font><font
                size="3"><br>
              </font>
              <hr noshade="noshade">
              <font size="3"><br>
                <br>
              </font><font size="2" face="Calibri"><br>
                We’ve recently been seeing quite a few node expels with
                messages of the form:<br>
                <br>
                2019-01-17_11:19:30.882+0000: [W] The TCP connection to
                IP address 10.20.0.58 proto-pg-pf01.bear.cluster
                <c0n236> (socket 153) state is unexpected: state=1
                ca_state=4 snd_cwnd=1 snd_ssthresh=5 unacked=5 probes=0
                backoff=7 retransmits=7 rto=26496000 rcv_ssthresh=102828
                rtt=6729 rttvar=12066 sacked=0 retrans=1 reordering=3
                lost=5<br>
                2019-01-17_11:19:30.882+0000: [I] tscCheckTcpConn:
                Sending debug data collection request to node 10.20.0.58
                proto-pg-pf01.bear.cluster<br>
                2019-01-17_11:19:30.882+0000: Sending request to collect
                TCP debug data to proto-pg-pf01.bear.cluster localNode<br>
                2019-01-17_11:19:30.882+0000: [I] Calling user exit
                script gpfsSendRequestToNodes: event sendRequestToNodes,
                Async command /usr/lpp/mmfs/bin/mmcommon.<br>
                2019-01-17_11:24:52.611+0000: [E] Timed out in 300
                seconds waiting for a commMsgCheckMessages reply from
                node 10.20.0.58 proto-pg-pf01.bear.cluster. Sending
                expel message.<br>
                <br>
                On the client node, we see messages of the form:<br>
                <br>
                2019-01-17_11:19:31.101+0000: [N] sdrServ: Received Tcp
                data collection request from 10.10.0.33<br>
                2019-01-17_11:19:31.102+0000: [N] GPFS will attempt to
                collect Tcp debug data on this node.<br>
                2019-01-17_11:24:52.838+0000: [N] sdrServ: Received
                expel data collection request from 10.10.0.33<br>
                2019-01-17_11:24:52.838+0000: [N] GPFS will attempt to
                collect debug data on this node.<br>
                2019-01-17_11:25:02.741+0000: [N] This node will be
                expelled from cluster rds.gpfs.servers due to expel msg
                from 10.10.12.41 (b<br>
                ber-les-nsd01-data.bb2.cluster in rds.gpfs.server<br>
                2019-01-17_11:25:03.160+0000: [N] sdrServ: Received
                expel data collection request from 10.20.0.56<br>
                <br>
                They always appear to be to a specific type of hardware
                with the same Ethernet controller, though the nodes are
                split across three data centres and we aren’t seeing
                link congestion on the links between them.<br>
                <br>
                On the node I listed above, it’s not actually doing
                anything either as the software on it is still being
                installed (i.e. it’s not doing GPFS or any other IO
                other than a couple of home directories).<br>
                <br>
                Any suggestions on what “(socket 153) state is
                unexpected” means?<br>
                <br>
                Thanks<br>
                <br>
                Simon<br>
                <br>
              </font><tt><font size="2">_______________________________________________<br>
                  gpfsug-discuss mailing list<br>
                  gpfsug-discuss at spectrumscale.org</font></tt><font
                size="3" color="blue"><u><br>
                </u></font><a
                href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"
                target="_blank" rel="noopener noreferrer"
                moz-do-not-send="true"><tt><font size="2" color="blue"><u>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</u></font></tt></a><font
                size="3"><br>
                <br>
                <br>
              </font><tt><font size="2">_______________________________________________<br>
                  gpfsug-discuss mailing list<br>
                  gpfsug-discuss at spectrumscale.org<br>
                </font></tt><a
                href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"
                target="_blank" rel="noopener noreferrer"
                moz-do-not-send="true"><tt><font size="2">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></tt></a><tt><font
                  size="2"><br>
                </font></tt><br>
              <br>
              <br>
              <br>
            </div>
          </div>
        </div>
      </div>
      <br>
      <fieldset class="mimeAttachmentHeader"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
<a class="moz-txt-link-freetext" href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a>
</pre>
    </blockquote>
  </body>
</html>

<br>
<div><a href="http://pixitmedia.com" target="_blank"><img src="http://pixitmedia.com/sig/sig-bve2018.jpg"></a><font face="Arial, Helvetica, sans-serif" size="1"><br>This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email.</font></div>