<html>
  <head>
    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
  </head>
  <body>
    <p>Hi, and thanks, Achim and Olaf, <br>
    </p>
    <p>mmdiag --iohist on the NSD servers (on all 4 of them) shows IO
      sizes in IOs to/from the data NSDs (i.e. to/from storage) of 16384
      512-byte-sectors  throughout, i.e. 8MiB, agreeing with the FS
      block size. (Having that information i do not need to ask the
      clients ...)<br>
    </p>
    <p>iostat on NSD servers as well as the  storage system counters say
      the IOs crafted by the OS layer are 4MiB except for the one
      suspicious NSD server where they were somewhat smaller than 4MiB
      before the reboot, but are now somewhat larger than 4MiB (but by a
      distinct amount). <br>
    </p>
    <p>The data piped through the NSD servers are well balanced between
      the 4 NSD servers, the IO system of the suspicious NSD server just
      issued a higher rate of IO requests when running smaller IOs and
      now, with larger IOs it has a lower IO rate than the other three
      NSD servers.</p>
    <p><br>
    </p>
    <p>So I am pretty sure it is not GPFS (see my initial post :-); but
      still some people using GPFS might have encounterd that as well,
      or might have an idea ;-)</p>
    <p>Cheers</p>
    <p>Uwe<br>
    </p>
    <div class="moz-cite-prefix">On 24.02.22 13:47, Olaf Weiser wrote:<br>
    </div>
    <blockquote type="cite"
cite="mid:OF57AF0CC7.AFEE5AA7-ON002587F3.00464315-002587F3.00464FCF@ibm.com">
      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
      <div class="socmaildefaultfont" dir="ltr"
        style="font-family:Arial, Helvetica, sans-serif;font-size:10pt">
        <div dir="ltr">in addition, to Achim,</div>
        <div dir="ltr">where do you see those "smaller IO"...</div>
        <div dir="ltr">have you checked IO sizes with mmfsadm dump
          iohist on each NSDclient/Server ?... If ok on that level..
          it's not GPFS</div>
        <div dir="ltr"> </div>
        <div dir="ltr">
          <div class="socmaildefaultfont" dir="ltr"
            style="font-family:Arial, Helvetica,
            sans-serif;font-size:10pt">
            <div class="socmaildefaultfont" dir="ltr"
              style="font-family:Arial, Helvetica,
              sans-serif;font-size:10pt">
              <div dir="ltr">
                <div> </div>
                <div> </div>
                <div>Mit freundlichen Grüßen / Kind regards</div>
                <div style="font-size: 8pt; font-family: sans-serif;
                  margin-top: 10px;">
                  <div>
                    <div> <br>
                      Olaf Weiser<br>
                       </div>
                  </div>
                </div>
              </div>
            </div>
          </div>
        </div>
        <div dir="ltr"> </div>
        <div dir="ltr"> </div>
        <blockquote data-history-content-modified="1" dir="ltr"
          style="border-left:solid #aaaaaa 2px; margin-left:5px;
          padding-left:5px; direction:ltr; margin-right:0px">-----
          Ursprüngliche Nachricht -----<br>
          Von: "Achim Rehor" <a class="moz-txt-link-rfc2396E" href="mailto:Achim.Rehor@de.ibm.com"><Achim.Rehor@de.ibm.com></a><br>
          Gesendet von: <a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a><br>
          An: "gpfsug main discussion list"
          <a class="moz-txt-link-rfc2396E" href="mailto:gpfsug-discuss@spectrumscale.org"><gpfsug-discuss@spectrumscale.org></a><br>
          CC:<br>
          Betreff: [EXTERNAL] Re: [gpfsug-discuss] IO sizes<br>
          Datum: Do, 24. Feb 2022 13:41<br>
           
          <p><font size="2" face="sans-serif">Hi Uwe, </font><br>
            <br>
            <font size="2" face="sans-serif">first of all, glad to see
              you back in the GPFS space ;)<br>
              <br>
              agreed, groups of subblocks being written will end up in
              IO sizes, being smaller than the 8MB filesystem blocksize,</font><br>
            <font size="2" face="sans-serif">also agreed, this cannot be
              metadata, since their size is MUCH smaller, like 4k or
              less, mostly.<br>
              <br>
              But why would these grouped subblock reads/writes all end
              up on the same NSD server, while the others do full block
              writes ?<br>
              <br>
              How is your NSD server setup per NSD ? did you
              'round-robin' set the preferred NSD server per NSD ?<br>
              are the client nodes transferring the data in anyway doing
              specifics  ?<br>
              <br>
              Sorry for not having a solution for you, jsut sharing a
              few ideas ;) </font><br>
            <br>
            <br>
            <font size="1" face="Arial">Mit freundlichen Grüßen / Kind
              regards</font><br>
            <br>
            <font size="2" face="Arial"><b>Achim Rehor</b></font><br>
            <br>
            <font size="1" face="Arial" color="#0055AA">Technical
              Support Specialist Spectrum Scale and ESS (SME)</font><br>
            <font size="1" face="Arial" color="#0055AA">Advisory Product
              Services Professional</font><br>
            <font size="1" face="Arial" color="#0055AA">IBM Systems
              Storage Support - EMEA</font></p>
          <table cellspacing="0" cellpadding="0" border="0">
            <tbody>
              <tr valign="top">
                <td colspan="4" width="800" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td colspan="4" width="800" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td width="73" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td width="266" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td width="345" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td rowspan="3" width="116" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td width="73" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td width="266" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td width="345" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td width="73" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td width="266" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
                <td width="345" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td colspan="4" width="800" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td colspan="4" width="800" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
              <tr valign="top">
                <td colspan="4" width="800" valign="middle"><img alt=""
                    src="cid:part1.9B0QKDJd.LkVjp9CX@kit.edu" class=""
                    width="1" height="1" border="0"></td>
              </tr>
            </tbody>
          </table>
          <br>
          <tt><font size="3" face=""><a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a>
              wrote on 23/02/2022 22:20:11:<br>
              <br>
              > From: "Andrew Beattie" <a class="moz-txt-link-rfc2396E" href="mailto:abeattie@au1.ibm.com"><abeattie@au1.ibm.com></a></font></tt><br>
          <tt><font size="3" face="">> To: "gpfsug main discussion
              list" <a class="moz-txt-link-rfc2396E" href="mailto:gpfsug-discuss@spectrumscale.org"><gpfsug-discuss@spectrumscale.org></a></font></tt><br>
          <tt><font size="3" face="">> Date: 23/02/2022 22:20</font></tt><br>
          <tt><font size="3" face="">> Subject: [EXTERNAL] Re:
              [gpfsug-discuss] IO sizes</font></tt><br>
          <tt><font size="3" face="">> Sent by:
              <a class="moz-txt-link-abbreviated" href="mailto:gpfsug-discuss-bounces@spectrumscale.org">gpfsug-discuss-bounces@spectrumscale.org</a></font></tt><br>
          <tt><font size="3" face="">><br>
              > Alex, Metadata will be 4Kib Depending on the
              filesystem version you<br>
              > will also have subblocks to consider V4 filesystems
              have 1/32<br>
              > subblocks, V5 filesystems have 1/1024 subblocks
              (assuming metadata<br>
              > and data block size is the same)
              ‍‍‍‍‍‍‍‍‍‍‍ZjQcmQRYFpfptBannerStart </font></tt><br>
          <tt><font size="3" face="">> This Message Is From an
              External Sender </font></tt><br>
          <tt><font size="3" face="">> This message came from outside
              your organization. </font></tt><br>
          <tt><font size="3" face="">> ZjQcmQRYFpfptBannerEnd<br>
              > Alex,</font></tt><br>
          <tt><font size="3" face="">><br>
              > Metadata will be 4Kib </font></tt><br>
          <tt><font size="3" face="">><br>
              > Depending on the filesystem version you will also
              have subblocks to<br>
              > consider V4 filesystems have 1/32 subblocks, V5
              filesystems have 1/<br>
              > 1024 subblocks (assuming metadata and data block size
              is the same)</font></tt><br>
          <tt><font size="3" face="">><br>
              > My first question would be is “ Are you sure that
              Linux OS is<br>
              > configured the same on all 4 NSD servers?.</font></tt><br>
          <tt><font size="3" face="">><br>
              > My second question would be do you know what your
              average file size<br>
              > is if most of your files are smaller than your
              filesystem block<br>
              > size, then you are always going to be performing
              writes using groups<br>
              > of subblocks rather than a full block writes.</font></tt><br>
          <tt><font size="3" face="">><br>
              > Regards, </font></tt><br>
          <tt><font size="3" face="">><br>
              > Andrew</font></tt><br>
          <tt><font size="3" face="">><br>
              > On 24 Feb 2022, at 04:39, Alex Chekholko
              <a class="moz-txt-link-rfc2396E" href="mailto:alex@calicolabs.com"><alex@calicolabs.com></a> wrote:</font></tt><br>
          <br>
          <tt><font size="3" face="">>  Hi, Metadata I/Os will
              always be smaller than the usual data block<br>
              > size, right? Which version of GPFS? Regards, Alex On
              Wed, Feb 23,<br>
              > 2022 at 10:26 AM Uwe Falke <a class="moz-txt-link-rfc2396E" href="mailto:uwe.falke@kit.edu"><uwe.falke@kit.edu></a>
              wrote: Dear all,<br>
              > sorry for asking a question which seems
              ZjQcmQRYFpfptBannerStart </font></tt><br>
          <tt><font size="3" face="">> This Message Is From an
              External Sender </font></tt><br>
          <tt><font size="3" face="">> This message came from outside
              your organization. </font></tt><br>
          <tt><font size="3" face="">> ZjQcmQRYFpfptBannerEnd</font></tt><br>
          <tt><font size="3" face="">> Hi,</font></tt><br>
          <tt><font size="3" face="">><br>
              > Metadata I/Os will always be smaller than the usual
              data block size, right?</font></tt><br>
          <tt><font size="3" face="">> Which version of GPFS?</font></tt><br>
          <tt><font size="3" face="">><br>
              > Regards,</font></tt><br>
          <tt><font size="3" face="">> Alex</font></tt><br>
          <tt><font size="3" face="">><br>
              > On Wed, Feb 23, 2022 at 10:26 AM Uwe Falke
              <a class="moz-txt-link-rfc2396E" href="mailto:uwe.falke@kit.edu"><uwe.falke@kit.edu></a> wrote:</font></tt><br>
          <tt><font size="3" face="">> Dear all,<br>
              ><br>
              > sorry for asking a question which seems not directly
              GPFS related:<br>
              ><br>
              > In a setup with 4 NSD servers (old-style, with
              storage controllers in<br>
              > the back end), 12 clients and 10 Seagate storage
              systems, I do see in<br>
              > benchmark tests that  just one of the NSD servers
              does send smaller IO<br>
              > requests to the storage  than the other 3 (that is,
              both reads and<br>
              > writes are smaller).<br>
              ><br>
              > The NSD servers form 2 pairs, each pair is connected
              to 5 seagate boxes<br>
              > ( one server to the controllers A, the other one to
              controllers B of the<br>
              > Seagates, resp.).<br>
              ><br>
              > All 4 NSD servers are set up similarly:<br>
              ><br>
              > kernel: 3.10.0-1160.el7.x86_64 #1 SMP<br>
              ><br>
              > HBA: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure
              SAS38xx<br>
              ><br>
              > driver : mpt3sas 31.100.01.00<br>
              ><br>
              > max_sectors_kb=8192 (max_hw_sectors_kb=16383 , not
              16384, as limited by<br>
              > mpt3sas) for all sd devices and all multipath (dm)
              devices built on top.<br>
              ><br>
              > scheduler: deadline<br>
              ><br>
              > multipath (actually we do have 3 paths to each
              volume, so there is some<br>
              > asymmetry, but that should not affect the IOs,
              shouldn't it?, and if it<br>
              > did we would see the same effect in both pairs of NSD
              servers, but we do<br>
              > not).<br>
              ><br>
              > All 4 storage systems are also configured the same
              way (2 disk groups /<br>
              > pools / declustered arrays, one managed by  ctrl A,
              one by ctrl B,  and<br>
              > 8 volumes out of each; makes altogether 2 x 8 x 10 =
              160 NSDs).<br>
              ><br>
              ><br>
              > GPFS BS is 8MiB , according to iohistory (mmdiag) we
              do see clean IO<br>
              > requests of 16384 disk blocks (i.e. 8192kiB) from
              GPFS.<br>
              ><br>
              > The first question I have - but that is not my main
              one: I do see, both<br>
              > in iostat and on the storage systems, that the
              default IO requests are<br>
              > about 4MiB, not 8MiB as I'd expect from above
              settings (max_sectors_kb<br>
              > is really in terms of kiB, not sectors, cf.<br>
              > <a
                href="https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt"
                target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt</a>).<br>
              ><br>
              > But what puzzles me even more: one of the server
              compiles IOs even<br>
              > smaller, varying between 3.2MiB and 3.6MiB mostly -
              both for reads and<br>
              > writes ... I just cannot see why.<br>
              ><br>
              > I have to suspect that this will (in writing to the
              storage) cause<br>
              > incomplete stripe writes on our erasure-coded volumes
              (8+2p)(as long as<br>
              > the controller is not able to re-coalesce the data
              properly; and it<br>
              > seems it cannot do it completely at least)<br>
              ><br>
              ><br>
              > If someone of you has seen that already and/or knows
              a potential<br>
              > explanation I'd be glad to learn about.<br>
              ><br>
              ><br>
              > And if some of you wonder: yes, I (was) moved away
              from IBM and am now<br>
              > at KIT.<br>
              ><br>
              > Many thanks in advance<br>
              ><br>
              > Uwe<br>
              ><br>
              ><br>
              > --<br>
              > Karlsruhe Institute of Technology (KIT)<br>
              > Steinbuch Centre for Computing (SCC)<br>
              > Scientific Data Management (SDM)<br>
              ><br>
              > Uwe Falke<br>
              ><br>
              > Hermann-von-Helmholtz-Platz 1, Building 442, Room 187<br>
              > D-76344 Eggenstein-Leopoldshafen<br>
              ><br>
              > Tel: +49 721 608 28024<br>
              > Email: <a class="moz-txt-link-abbreviated" href="mailto:uwe.falke@kit.edu">uwe.falke@kit.edu</a><br>
              > <a class="moz-txt-link-abbreviated" href="http://www.scc.kit.edu">www.scc.kit.edu</a><br>
              ><br>
              > Registered office:<br>
              > Kaiserstraße 12, 76131 Karlsruhe, Germany<br>
              ><br>
              > KIT – The Research University in the Helmholtz
              Association<br>
              ><br>
              > _______________________________________________<br>
              > gpfsug-discuss mailing list<br>
              > gpfsug-discuss at spectrumscale.org<br>
              > <a
                href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"
                target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a></font></tt><br>
          <tt><font size="3" face="">><br>
              > _______________________________________________<br>
              > gpfsug-discuss mailing list<br>
              > gpfsug-discuss at spectrumscale.org<br>
              > <a href="INVALID URI REMOVED" target="_blank"
                moz-do-not-send="true" class="moz-txt-link-freetext">INVALID
                URI REMOVED</a><br>
              >
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-<br>
              >
              siA1ZOg&r=RGTETs2tk0Kz_VOpznDVDkqChhnfLapOTkxLvgmR2-M&m=-<br>
              >
FdZvYBvHDPnBTu2FtPkLT09ahlYp2QsMutqNV2jWaY&s=S4C2D3_h4FJLAw0PUYLKhKE242vn_fwn-1_EJmHNpE8&e=</font></tt><br>
          <br>
           
          <div><font size="2" face="Default Monospace,Courier
              New,Courier,monospace">_______________________________________________<br>
              gpfsug-discuss mailing list<br>
              gpfsug-discuss at spectrumscale.org<br>
              <a
                href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"
                target="_blank" moz-do-not-send="true"
                class="moz-txt-link-freetext">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a> </font></div>
        </blockquote>
        <div dir="ltr"> </div>
      </div>
      <br>
      <br>
      <br>
      <fieldset class="moz-mime-attachment-header"></fieldset>
      <pre class="moz-quote-pre" wrap="">_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
<a class="moz-txt-link-freetext" href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a>
</pre>
    </blockquote>
    <pre class="moz-signature" cols="72">-- 
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Scientific Data Management (SDM)

Uwe Falke

Hermann-von-Helmholtz-Platz 1, Building 442, Room 187
D-76344 Eggenstein-Leopoldshafen

Tel: +49 721 608 28024
Email: <a class="moz-txt-link-abbreviated" href="mailto:uwe.falke@kit.edu">uwe.falke@kit.edu</a>
<a class="moz-txt-link-abbreviated" href="http://www.scc.kit.edu">www.scc.kit.edu</a>

Registered office:
Kaiserstraße 12, 76131 Karlsruhe, Germany

KIT – The Research University in the Helmholtz Association 
</pre>
  </body>
</html>