[gpfsug-discuss] IO sizes

Uwe Falke uwe.falke at kit.edu
Mon Feb 28 09:17:26 GMT 2022


Hi, Kumaran,

that would explain the smaller IOs before the reboot, but not the 
larger-than-4MiB IOs afterwards on that machine.

Then, I already saw that the numaMemoryInterleave setting seems to have 
no effect (on that very installation), I just have not yet requested a 
PMR for it. I'd checked memory usage of course and saw that regardless 
of this setting always one socket's memory is almost completely consumed 
while the other one's is rather empty - looks like a bug to me, but that 
needs further investigation.

Uwe


On 24.02.22 15:32, Kumaran Rajaram wrote:
>
> Hi Uwe,
>
> >> But what puzzles me even more: one of the server compiles IOs even 
> smaller, varying between 3.2MiB and 3.6MiB mostly - both for reads and 
> writes ... I just cannot see why.
>
> IMHO, If GPFS on this particular NSD server was restarted often during 
> the setup, then it is possible that the GPFS pagepool may not be 
> contiguous. As a result, GPFS 8MiB buffer in the pagepool might be a 
> scatter-gather (SG) list with many small entries (in the memory) 
> resulting in smaller I/O when these buffers are issued to the disks. 
> The fix would be to reboot the server and start GPFS so that pagepool 
> is contiguous resulting in 8MiB buffer to be comprised of 1 (or fewer) 
> SG entries.
>
> >>In the current situation (i.e. with IOs bit larger than 4MiB) 
> setting max_sectors_kB to 4096 might do the trick, but as I do not 
> know the cause for that behaviour it might well start to issue IOs 
> >>smaller than 4MiB again at some point, so that is not a nice solution.
>
> It will be advised not to restart GPFS often in the NSD servers (in 
> production) to keep the pagepool contiguous. Ensure that there is 
> enough free memory in NSD server and not run any memory intensive jobs 
> so that pagepool is not impacted (e.g. swapped out).
>
> Also, enable GPFS numaMemoryInterleave=yes and verify that pagepool is 
> equally distributed across the NUMA domains for good performance. GPFS 
> numaMemoryInterleave=yes requires that numactl packages are installed 
> and then GPFS restarted.
>
> # mmfsadm dump config | egrep "numaMemory|pagepool "
>
> ! numaMemoryInterleave yes
>
> ! pagepool 282394099712
>
> # pgrep mmfsd | xargs numastat -p
>
> Per-node process memory usage (in MBs) for PID 2120821 (mmfsd)
>
>                            Node 0 Node 1           Total
>
>                   --------------- --------------- ---------------
>
> Huge                         0.00 0.00            0.00
>
> Heap                         1.26 3.26            4.52
>
> Stack                        0.01 0.01            0.02
>
> Private                 137710.43 137709.96       275420.39
>
> ----------------  --------------- --------------- ---------------
>
> Total                   137711.70 137713.23       275424.92
>
> My two cents,
>
> -Kums
>
> Kumaran Rajaram
>
> *From:* gpfsug-discuss-bounces at spectrumscale.org 
> <gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Uwe Falke
> *Sent:* Wednesday, February 23, 2022 8:04 PM
> *To:* gpfsug-discuss at spectrumscale.org
> *Subject:* Re: [gpfsug-discuss] IO sizes
>
> Hi,
>
> the test bench is gpfsperf running on up to 12 clients with 1...64 
> threads doing sequential reads and writes , file size per gpfsperf 
> process is 12TB (with 6TB I saw caching effects in particular for 
> large thread numbers ...)
>
> As I wrote initially: GPFS is issuing nothing but 8MiB IOs to the data 
> disks, as expected in that case.
>
> Interesting thing though:
>
> I have rebooted the suspicious node. Now, it does not issue smaller 
> IOs than the others, but -- unbelievable -- larger ones (up to about 
> 4.7MiB). This is still harmful as also that size is incompatible with 
> full stripe writes on the storage ( 8+2 disk groups, i.e. logically RAID6)
>
> Currently, I draw this information from the storage boxes; I have not 
> yet checked iostat data for that benchmark test after the reboot 
> (before, when IO sizes were smaller, we saw that both in iostat and in 
> the perf data retrieved from the storage controllers).
>
> And: we have a separate data pool , hence dataOnly NSDs, I am just 
> talking about these ...
>
> As for "Are you sure that Linux OS is configured the same on all 4 NSD 
> servers?." - of course there are not two boxes identical in the world. 
> I have actually not installed those machines, and, yes, i also 
> considered reinstalling them (or at least the disturbing one).
>
> However, I do not have reason to assume or expect a difference, the 
> supplier has just implemented these systems recently from scratch.
>
> In the current situation (i.e. with IOs bit larger than 4MiB) setting 
> max_sectors_kB to 4096 might do the trick, but as I do not know the 
> cause for that behaviour it might well start to issue IOs smaller than 
> 4MiB again at some point, so that is not a nice solution.
>
> Thanks
>
> Uwe
>
> On 23.02.22 22:20, Andrew Beattie wrote:
>
>     Alex,
>
>     Metadata will be 4Kib
>
>     Depending on the filesystem version you will also have subblocks
>     to consider V4 filesystems have 1/32 subblocks, V5 filesystems
>     have 1/1024 subblocks (assuming metadata and data block size is
>     the same)
>
>
>     My first question would be is “ Are you sure that Linux OS is
>     configured the same on all 4 NSD servers?.
>
>     My second question would be do you know what your average file
>     size is if most of your files are smaller than your filesystem
>     block size, then you are always going to be performing writes
>     using groups of subblocks rather than a full block writes.
>
>     Regards,
>
>     Andrew
>
>
>
>         On 24 Feb 2022, at 04:39, Alex Chekholko <alex at calicolabs.com>
>         <mailto:alex at calicolabs.com> wrote:
>
>          Hi, Metadata I/Os will always be smaller than the usual data
>         block size, right? Which version of GPFS? Regards, Alex On
>         Wed, Feb 23, 2022 at 10:26 AM Uwe Falke <uwe.falke at kit.edu>
>         <mailto:uwe.falke at kit.edu> wrote: Dear all, sorry for asking a
>         question which seems ZjQcmQRYFpfptBannerStart
>
>         This Message Is From an External Sender
>
>         This message came from outside your organization.
>
>         ZjQcmQRYFpfptBannerEnd
>
>         Hi,
>
>         Metadata I/Os will always be smaller than the usual data block
>         size, right?
>
>         Which version of GPFS?
>
>         Regards,
>
>         Alex
>
>         On Wed, Feb 23, 2022 at 10:26 AM Uwe Falke <uwe.falke at kit.edu>
>         wrote:
>
>             Dear all,
>
>             sorry for asking a question which seems not directly GPFS
>             related:
>
>             In a setup with 4 NSD servers (old-style, with storage
>             controllers in
>             the back end), 12 clients and 10 Seagate storage systems,
>             I do see in
>             benchmark tests that  just one of the NSD servers does
>             send smaller IO
>             requests to the storage  than the other 3 (that is, both
>             reads and
>             writes are smaller).
>
>             The NSD servers form 2 pairs, each pair is connected to 5
>             seagate boxes
>             ( one server to the controllers A, the other one to
>             controllers B of the
>             Seagates, resp.).
>
>             All 4 NSD servers are set up similarly:
>
>             kernel: 3.10.0-1160.el7.x86_64 #1 SMP
>
>             HBA: Broadcom / LSI Fusion-MPT 12GSAS/PCIe Secure SAS38xx
>
>             driver : mpt3sas 31.100.01.00
>
>             max_sectors_kb=8192 (max_hw_sectors_kb=16383 , not 16384,
>             as limited by
>             mpt3sas) for all sd devices and all multipath (dm) devices
>             built on top.
>
>             scheduler: deadline
>
>             multipath (actually we do have 3 paths to each volume, so
>             there is some
>             asymmetry, but that should not affect the IOs, shouldn't
>             it?, and if it
>             did we would see the same effect in both pairs of NSD
>             servers, but we do
>             not).
>
>             All 4 storage systems are also configured the same way (2
>             disk groups /
>             pools / declustered arrays, one managed by  ctrl A, one by
>             ctrl B,  and
>             8 volumes out of each; makes altogether 2 x 8 x 10 = 160
>             NSDs).
>
>
>             GPFS BS is 8MiB , according to iohistory (mmdiag) we do
>             see clean IO
>             requests of 16384 disk blocks (i.e. 8192kiB) from GPFS.
>
>             The first question I have - but that is not my main one: I
>             do see, both
>             in iostat and on the storage systems, that the default IO
>             requests are
>             about 4MiB, not 8MiB as I'd expect from above settings
>             (max_sectors_kb
>             is really in terms of kiB, not sectors, cf.
>             https://www.kernel.org/doc/Documentation/block/queue-sysfs.txt
>             <https://nam02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.kernel.org%2Fdoc%2FDocumentation%2Fblock%2Fqueue-sysfs.txt&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=II8k%2FHzrU7BC%2FVejg9AujgZGk1E0XTz8QCpH6IE6RGM%3D&reserved=0>).
>
>             But what puzzles me even more: one of the server compiles
>             IOs even
>             smaller, varying between 3.2MiB and 3.6MiB mostly - both
>             for reads and
>             writes ... I just cannot see why.
>
>             I have to suspect that this will (in writing to the
>             storage) cause
>             incomplete stripe writes on our erasure-coded volumes
>             (8+2p)(as long as
>             the controller is not able to re-coalesce the data
>             properly; and it
>             seems it cannot do it completely at least)
>
>
>             If someone of you has seen that already and/or knows a
>             potential
>             explanation I'd be glad to learn about.
>
>
>             And if some of you wonder: yes, I (was) moved away from
>             IBM and am now
>             at KIT.
>
>             Many thanks in advance
>
>             Uwe
>
>
>             -- 
>             Karlsruhe Institute of Technology (KIT)
>             Steinbuch Centre for Computing (SCC)
>             Scientific Data Management (SDM)
>
>             Uwe Falke
>
>             Hermann-von-Helmholtz-Platz 1, Building 442, Room 187
>             D-76344 Eggenstein-Leopoldshafen
>
>             Tel: +49 721 608 28024
>             Email: uwe.falke at kit.edu
>             www.scc.kit.edu
>             <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.scc.kit.edu%2F&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=mXwzkLB1EFB1Dh31rVRMwJZBY4CBbHcJduc9gK6M71A%3D&reserved=0>
>
>             Registered office:
>             Kaiserstraße 12, 76131 Karlsruhe, Germany
>
>             KIT – The Research University in the Helmholtz Association
>
>             _______________________________________________
>             gpfsug-discuss mailing list
>             gpfsug-discuss at spectrumscale.org
>             <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org%2F&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=G6bjUWlzkKzR2ptGcLffwD8qF2IT9vkruoevFoTwNE0%3D&reserved=0>
>             http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>             <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=G9l4PMuzdNA%2BwtAtWK%2BApoXxvKn5jZKeP%2FENOVc9xXg%3D&reserved=0>
>
>
>
>
>
>     _______________________________________________
>
>     gpfsug-discuss mailing list
>
>     gpfsug-discuss at spectrumscale.org
>
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss  <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=G9l4PMuzdNA%2BwtAtWK%2BApoXxvKn5jZKeP%2FENOVc9xXg%3D&reserved=0>
>
> -- 
> Karlsruhe Institute of Technology (KIT)
> Steinbuch Centre for Computing (SCC)
> Scientific Data Management (SDM)
> Uwe Falke
> Hermann-von-Helmholtz-Platz 1, Building 442, Room 187
> D-76344 Eggenstein-Leopoldshafen
> Tel: +49 721 608 28024
> Email:uwe.falke at kit.edu
> www.scc.kit.edu  <https://nam02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.scc.kit.edu%2F&data=04%7C01%7Ckrajaram%40geocomputing.net%7C52cc6360e6ea4be737ba08d9f7317d78%7C229a2792a5064f25b3bdbab585cec3ed%7C0%7C0%7C637812615096678246%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000&sdata=mXwzkLB1EFB1Dh31rVRMwJZBY4CBbHcJduc9gK6M71A%3D&reserved=0>
> Registered office:
> Kaiserstraße 12, 76131 Karlsruhe, Germany
> KIT – The Research University in the Helmholtz Association
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-- 
Karlsruhe Institute of Technology (KIT)
Steinbuch Centre for Computing (SCC)
Scientific Data Management (SDM)

Uwe Falke

Hermann-von-Helmholtz-Platz 1, Building 442, Room 187
D-76344 Eggenstein-Leopoldshafen

Tel: +49 721 608 28024
Email:uwe.falke at kit.edu
www.scc.kit.edu

Registered office:
Kaiserstraße 12, 76131 Karlsruhe, Germany

KIT – The Research University in the Helmholtz Association
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20220228/b2fbbac1/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image001.png
Type: image/png
Size: 6469 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20220228/b2fbbac1/attachment-0002.png>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 5814 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20220228/b2fbbac1/attachment-0002.bin>


More information about the gpfsug-discuss mailing list