<html><body><p><font size="2" face="sans-serif">Hi Uwe, </font><br><br><font size="2" face="sans-serif">first of all, glad to see you back in the GPFS space ;) <br><br>agreed, groups of subblocks being written will end up in IO sizes, being smaller than the 8MB filesystem blocksize,</font><br><font size="2" face="sans-serif">also agreed, this cannot be metadata, since their size is MUCH smaller, like 4k or less, mostly. <br><br>But why would these grouped subblock reads/writes all end up on the same NSD server, while the others do full block writes ? <br><br>How is your NSD server setup per NSD ? did you 'round-robin' set the preferred NSD server per NSD ? <br>are the client nodes transferring the data in anyway doing specifics  ? <br><br>Sorry for not having a solution for you, jsut sharing a few ideas ;) </font><br><br><br><font size="1" face="Arial">Mit freundlichen Grüßen / Kind regards</font><br><br><font size="2" face="Arial"><b>Achim Rehor</b></font><br><br><font size="1" color="#0055AA" face="Arial">Technical Support Specialist Spectrum Scale and ESS (SME)</font><br><font size="1" color="#0055AA" face="Arial">Advisory Product Services Professional</font><br><font size="1" color="#0055AA" face="Arial">IBM Systems Storage Support - EMEA</font><table border="0" cellspacing="0" cellpadding="0"><tr valign="top"><td class="small_font" width="800" colspan="4" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="small_font" width="800" colspan="4" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="inner" width="73" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td class="inner" width="266" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td class="inner" width="345" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td width="116" rowspan="3" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="inner" width="73" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td class="inner" width="266" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td class="inner" width="345" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="inner" width="73" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td class="inner" width="266" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td><td class="inner" width="345" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="small_font" width="800" colspan="4" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="foot" width="800" colspan="4" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr>


<tr valign="top"><td class="foot" width="800" colspan="4" valign="middle"><img width="1" height="1" src="cid:1__=4EBB0D60DFD775728f9e8a93df938690@ibm.com" border="0" alt=""></td></tr></table><br><tt><font size="2">gpfsug-discuss-bounces@spectrumscale.org wrote on 23/02/2022 22:20:11:<br><br>> From: "Andrew Beattie" <abeattie@au1.ibm.com></font></tt><br><tt><font size="2">> To: "gpfsug main discussion list" <gpfsug-discuss@spectrumscale.org></font></tt><br><tt><font size="2">> Date: 23/02/2022 22:20</font></tt><br><tt><font size="2">> Subject: [EXTERNAL] Re: [gpfsug-discuss] IO sizes</font></tt><br><tt><font size="2">> Sent by: gpfsug-discuss-bounces@spectrumscale.org</font></tt><br><tt><font size="2">> <br>> Alex, Metadata will be 4Kib Depending on the filesystem version you <br>> will also have subblocks to consider V4 filesystems have 1/32 <br>> subblocks, V5 filesystems have 1/1024 subblocks (assuming metadata <br>> and data block size is the same) ‍‍‍‍‍‍‍‍‍‍‍ZjQcmQRYFpfptBannerStart </font></tt><br><tt><font size="2">> This Message Is From an External Sender </font></tt><br><tt><font size="2">> This message came from outside your organization. </font></tt><br><tt><font size="2">> ZjQcmQRYFpfptBannerEnd<br>> Alex,</font></tt><br><tt><font size="2">> <br>> Metadata will be 4Kib </font></tt><br><tt><font size="2">> <br>> Depending on the filesystem version you will also have subblocks to <br>> consider V4 filesystems have 1/32 subblocks, V5 filesystems have 1/<br>> 1024 subblocks (assuming metadata and data block size is the same)</font></tt><br><tt><font size="2">> <br>> My first question would be is “ Are you sure that Linux OS is <br>> configured the same on all 4 NSD servers?.</font></tt><br><tt><font size="2">> <br>> My second question would be do you know what your average file size <br>> is if most of your files are smaller than your filesystem block <br>> size, then you are always going to be performing writes using groups<br>> of subblocks rather than a full block writes.</font></tt><br><tt><font size="2">> <br>> Regards, </font></tt><br><tt><font size="2">> <br>> Andrew</font></tt><br><tt><font size="2">> <br>> On 24 Feb 2022, at 04:39, Alex Chekholko <alex@calicolabs.com> wrote:<br></font></tt><br><tt><font size="2">>  Hi, Metadata I/Os will always be smaller than the usual data block<br>> size, right? Which version of GPFS? Regards, Alex On Wed, Feb 23, <br>> 2022 at 10:26 AM Uwe Falke <uwe.falke@kit.edu> wrote: Dear all, <br>> sorry for asking a question which seems ZjQcmQRYFpfptBannerStart </font></tt><br><tt><font size="2">> This Message Is From an External Sender </font></tt><br><tt><font size="2">> This message came from outside your organization. </font></tt><br><tt><font size="2">> ZjQcmQRYFpfptBannerEnd</font></tt><br><tt><font size="2">> Hi,</font></tt><br><tt><font size="2">> <br>> Metadata I/Os will always be smaller than the usual data block size, right?</font></tt><br><tt><font size="2">> Which version of GPFS?</font></tt><br><tt><font size="2">> <br>> Regards,</font></tt><br><tt><font size="2">> Alex</font></tt><br><tt><font size="2">> <br>> On Wed, Feb 23, 2022 at 10:26 AM Uwe Falke <uwe.falke@kit.edu> wrote:</font></tt><br><tt><font size="2">> Dear all,<br>> <br>> sorry for asking a question which seems not directly GPFS related:<br>> <br>> In a setup with 4 NSD servers (old-style, with storage controllers in <br>> the back end), 12 clients and 10 Seagate storage systems, I do see in <br>> benchmark tests that  just one of the NSD servers does send smaller IO <br>> requests to the storage  than the other 3 (that is, both reads and <br>> writes are smaller).<br>> <br>> The NSD servers form 2 pairs, each pair is connected to 5 seagate boxes <br>> ( one server to the controllers A, the other one to controllers B of the <br>> Seagates, resp.).<br>> <br>> All 4 NSD servers are set up similarly:<br>> <br>> kernel: 3.10.0-1160.el7.x86_64 #1 SMP<br>> <br>> HBA: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx<br>> <br>> driver : mpt3sas 31.100.01.00<br>> <br>> max_sectors_kb=8192 (max_hw_sectors_kb=16383 , not 16384, as limited by <br>> mpt3sas) for all sd devices and all multipath (dm) devices built on top.<br>> <br>> scheduler: deadline<br>> <br>> multipath (actually we do have 3 paths to each volume, so there is some <br>> asymmetry, but that should not affect the IOs, shouldn't it?, and if it <br>> did we would see the same effect in both pairs of NSD servers, but we do <br>> not).<br>> <br>> All 4 storage systems are also configured the same way (2 disk groups / <br>> pools / declustered arrays, one managed by  ctrl A, one by ctrl B,  and <br>> 8 volumes out of each; makes altogether 2 x 8 x 10 = 160 NSDs).<br>> <br>> <br>> GPFS BS is 8MiB , according to iohistory (mmdiag) we do see clean IO <br>> requests of 16384 disk blocks (i.e. 8192kiB) from GPFS.<br>> <br>> The first question I have - but that is not my main one: I do see, both <br>> in iostat and on the storage systems, that the default IO requests are <br>> about 4MiB, not 8MiB as I'd expect from above settings (max_sectors_kb <br>> is really in terms of kiB, not sectors, cf. <br>> <a href="https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt">https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt</a>).<br>> <br>> But what puzzles me even more: one of the server compiles IOs even <br>> smaller, varying between 3.2MiB and 3.6MiB mostly - both for reads and <br>> writes ... I just cannot see why.<br>> <br>> I have to suspect that this will (in writing to the storage) cause <br>> incomplete stripe writes on our erasure-coded volumes (8+2p)(as long as <br>> the controller is not able to re-coalesce the data properly; and it <br>> seems it cannot do it completely at least)<br>> <br>> <br>> If someone of you has seen that already and/or knows a potential <br>> explanation I'd be glad to learn about.<br>> <br>> <br>> And if some of you wonder: yes, I (was) moved away from IBM and am now <br>> at KIT.<br>> <br>> Many thanks in advance<br>> <br>> Uwe<br>> <br>> <br>> -- <br>> Karlsruhe Institute of Technology (KIT)<br>> Steinbuch Centre for Computing (SCC)<br>> Scientific Data Management (SDM)<br>> <br>> Uwe Falke<br>> <br>> Hermann-von-Helmholtz-Platz 1, Building 442, Room 187<br>> D-76344 Eggenstein-Leopoldshafen<br>> <br>> Tel: +49 721 608 28024<br>> Email: uwe.falke@kit.edu<br>> www.scc.kit.edu<br>> <br>> Registered office:<br>> Kaiserstraße 12, 76131 Karlsruhe, Germany<br>> <br>> KIT – The Research University in the Helmholtz Association<br>> <br>> _______________________________________________<br>> gpfsug-discuss mailing list<br>> gpfsug-discuss at spectrumscale.org<br>> <a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a></font></tt><br><tt><font size="2">> <br>> _______________________________________________<br>> gpfsug-discuss mailing list<br>> gpfsug-discuss at spectrumscale.org<br>> <a href="INVALID URI REMOVED">INVALID URI REMOVED</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=RGTETs2tk0Kz_VOpznDVDkqChhnfLapOTkxLvgmR2-M&m=-<br>> FdZvYBvHDPnBTu2FtPkLT09ahlYp2QsMutqNV2jWaY&s=S4C2D3_h4FJLAw0PUYLKhKE242vn_fwn-1_EJmHNpE8&e=<br></font></tt><BR>


<BR>


</body></html>