<html><body><p><font size="2">Originally, GPFS used a strict round robin, first over failure groups, then over volumes within each</font><br><font size="2">failure group. That had performance issues when one or more volumes was low on space. Then </font><br><font size="2">for a while there were a variety of weighted stripe methods including by free space and by capacity.</font><br><font size="2">The file system had an option allowing the user to change the stripe method. That option was</font><br><font size="2">removed when we switched to a "best effort" round robin, which does a round robin over the</font><br><font size="2">failure groups then volumes based on the allocation regions that a node owns. When the stripe width</font><br><font size="2">at a node drops below half of the failure groups or half of the volumes that node will acquire new</font><br><font size="2">allocation regions.</font><br><br><font size="2">Basically we vary the stripe width to avoid searching for free space on specific volumes. It will </font><br><font size="2">eventually even itself out or you could restripe the file system to even it out immediately.</font><br><br><font size="2">-Wayne</font><br><br><font size="2">ps. And of course, allocation in FPO is completely different.</font><br><br><br><tt><font size="2">gpfsug-discuss-bounces@spectrumscale.org wrote on 01/13/2018 09:26:51 AM:<br><br>> From: Aaron Knister <aaron.s.knister@nasa.gov></font></tt><br><tt><font size="2">> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org></font></tt><br><tt><font size="2">> Date: 01/13/2018 09:27 AM</font></tt><br><tt><font size="2">> Subject: Re: [gpfsug-discuss] pool block allocation algorithm</font></tt><br><tt><font size="2">> Sent by: gpfsug-discuss-bounces@spectrumscale.org</font></tt><br><tt><font size="2">> <br>> Thanks, Peter. That definitely makes sense and I was actually wondering<br>> if performance was a factor. Do you know where to look to see what GPFS'<br>> perception of "performance" is for a given NSD?<br>> <br>> -Aaron<br>> <br>> On 1/13/18 12:00 PM, Peter Serocka wrote:<br>> > Within reasonable capacity limits it would also expect<br>> > to direct incoming data to disks that are best “available”<br>> > from a current performance perspective — like doing least<br>> > IOPS, having lowest latency and shortest filled queue length.<br>> > <br>> > You new NSDs, filled only with recent data, might quickly have<br>> > become the most busy units before reaching capacity balance,<br>> > simply because recent data tends to be more active than older stuff.<br>> > <br>> > Makes sense?<br>> > <br>> > — Peter<br>> > <br>> >> On 2018 Jan 13 Sat, at 17:18, Aaron Knister <br>> <aaron.s.knister@nasa.gov> wrote:<br>> >> <br>> >> Thanks Everyone! I whipped up a script to dump the block layout of a<br>> >> file and then join that with mmdf information. As part of my exploration<br>> >> I wrote one 2GB file to each of this particular filesystem's 4 data<br>> >> pools last night (using "touch $file; mmchattr $file -P $pool; dd<br>> >> of=$file") and have attached a dump of the layout/nsd information for<br>> >> each file/pool. The fields for the output are:<br>> >> <br>> >> diskId, numBlocksOnDisk, diskName, diskSize, failureGroup, freeBlocks,<br>> >> freePct, freeKbFragments, freeKbFragmentsPct<br>> >> <br>> >> <br>> >> Here's the highlight from pool1:<br>> >> <br>> >> 36 264 d13_06_006 23437934592 1213 4548935680 (19%)<br>> >> 83304320 (0%)<br>> >> 59 74 d10_41_025 23437934592 1011 6993759232 (30%)<br>> >> 58642816 (0%)<br>> >> <br>> >> For this file (And anecdotally what I've seen looking at NSD I/O data<br>> >> for other files written to this pool) the pattern of more blocks being<br>> >> allocated to the NSDs that are ~20% free vs the NSDs that are 30% free<br>> >> seems to be fairly consistent.<br>> >> <br>> >> <br>> >> Looking at a snippet of pool2:<br>> >> 101 238 d15_15_011 23437934592 1415 2008394752 (9%)<br>> >> 181699328 (1%)<br>> >> 102 235 d15_16_012 23437934592 1415 2009153536 (9%)<br>> >> 182165312 (1%)<br>> >> 116 248 d11_42_026 23437934592 1011 4146111488 (18%)<br>> >> 134941504 (1%)<br>> >> 117 249 d13_42_026 23437934592 1213 4147710976 (18%)<br>> >> 135203776 (1%)<br>> >> <br>> >> there are slightly more blocks allocated in general on the NSDs with<br>> >> twice the amount of free space, but it doesn't seem to be a significant<br>> >> amount relative to the delta in free space. The pattern from pool1<br>> >> certainly doesn't hold true here.<br>> >> <br>> >> Pool4 isn't very interesting because all of the NSDs are well balanced<br>> >> in terms of free space (all 16% free).<br>> >> <br>> >> Pool3, however, *is* particularly interesting. Here's a snippet:<br>> >> <br>> >> 106 222 d15_24_016 23437934592 1415 2957561856 (13%)<br>> >> 37436768 (0%)<br>> >> 107 222 d15_25_017 23437934592 1415 2957537280 (13%)<br>> >> 37353984 (0%)<br>> >> 108 222 d15_26_018 23437934592 1415 2957539328 (13%)<br>> >> 37335872 (0%)<br>> >> 125 222 d11_44_028 23437934592 1011 13297235968 (57%)<br>> >> 20505568 (0%)<br>> >> 126 222 d12_44_028 23437934592 1213 13296712704 (57%)<br>> >> 20632768 (0%)<br>> >> 127 222 d12_45_029 23437934592 1213 13297404928 (57%)<br>> >> 20557408 (0%)<br>> >> <br>> >> GPFS consistently allocated the same number of blocks to disks with ~4x<br>> >> the free space than it did the other disks in the pool.<br>> >> <br>> >> Suffice it to say-- I'm *very* confused :)<br>> >> <br>> >> -Aaron<br>> >> <br>> >> On 1/13/18 8:18 AM, Daniel Kidger wrote:<br>> >>> Aaron,<br>> >>> <br>> >>> Also are your new NSDs the same size as your existing ones?<br>> >>> i.e. the NSDs that are at a higher %age full might have more free blocks<br>> >>> than the other NSDs?<br>> >>> Daniel<br>> >>> <br>> >>> <br>> >>> IBM Storage Professional Badge<br>> >>> <<a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=https-3A__www.youracclaim.com_user_danel-2Dkidger&d=DwIGaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-<br>> Xi8IMuVUeAtBxlTznA&s=hu8pcNGJmsITfq8y9fzxDf9WoXD1Kr5ptVLEEpbwcjU&e=><br>> >>> <br>> >>> <br>> >>> *Dr Daniel Kidger*<br>> >>> IBM Technical Sales Specialist<br>> >>> Software Defined Solution Sales<br>> >>> <br>> >>> +44-(0)7818 522 266<br>> >>> daniel.kidger@uk.ibm.com<br>> >>> <br>> >>> <br>> >>> <br>> >>> <br>> >>> <br>> >>> ----- Original message -----<br>> >>> From: Jan-Frode Myklebust <janfrode@tanso.net><br>> >>> Sent by: gpfsug-discuss-bounces@spectrumscale.org<br>> >>> To: gpfsug main discussion list <gpfsug-discuss@spectrumscale.org><br>> >>> Cc:<br>> >>> Subject: Re: [gpfsug-discuss] pool block allocation algorithm<br>> >>> Date: Sat, Jan 13, 2018 9:25 AM<br>> >>> <br>> >>> Don’t have documentation/whitepaper, but as I recall, it will first<br>> >>> allocate round-robin over failureGroup, then round-robin over<br>> >>> nsdServers, and then round-robin over volumes. So if these new NSDs<br>> >>> are defined as different failureGroup from the old disks, that might<br>> >>> explain it..<br>> >>> <br>> >>> <br>> >>> -jf<br>> >>> lør. 13. jan. 2018 kl. 00:15 skrev Aaron Knister<br>> >>> <aaron.s.knister@nasa.gov <<a href="mailto:aaron.s.knister@nasa.gov">mailto:aaron.s.knister@nasa.gov</a>>>:<br>> >>> <br>> >>> Apologies if this has been covered elsewhere (I couldn't find it<br>> >>> if it<br>> >>> has). I'm curious how GPFS decides where to allocate new blocks.<br>> >>> <br>> >>> We've got a filesystem that we added some NSDs to a while back<br>> >>> and it<br>> >>> hurt there for a little while because it appeared as <br>> though GPFS was<br>> >>> choosing to allocate new blocks much more frequently on the<br>> >>> ~100% free<br>> >>> LUNs than the existing LUNs (I can't recall how free they were<br>> >>> at the<br>> >>> time). Looking at it now, though, it seems GPFS is doing the<br>> >>> opposite.<br>> >>> There's now a ~10% difference between the LUNs added and the<br>> >>> existing<br>> >>> LUNs (20% free vs 30% free) and GPFS is choosing to allocate new<br>> >>> writes<br>> >>> at a ratio of about 3:1 on the disks with *fewer* free blocks<br>> >>> than on<br>> >>> the disks with more free blocks. That's completely <br>> inconsistent with<br>> >>> what we saw when we initially added the disks which makes me<br>> >>> wonder how<br>> >>> GPFS is choosing to allocate new blocks (other than the <br>> obvious bits<br>> >>> about failure group, and replication factor). Could someone<br>> >>> explain (or<br>> >>> point me at a whitepaper) what factors GPFS uses when allocating<br>> >>> blocks,<br>> >>> particularly as it pertains to choosing one NSD over another<br>> >>> within the<br>> >>> same failure group.<br>> >>> <br>> >>> Thanks!<br>> >>> <br>> >>> -Aaron<br>> >>> <br>> >>> --<br>> >>> Aaron Knister<br>> >>> NASA Center for Climate Simulation (Code 606.2)<br>> >>> Goddard Space Flight Center<br>> >>> (301) 286-2776<br>> >>> _______________________________________________<br>> >>> gpfsug-discuss mailing list<br>> >>> gpfsug-discuss at spectrumscale.org<br>> >>> <<a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__spectrumscale.org&d=DwMFaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=QYsXVDOdNRcII7FPtAbCXEyYJzNSd_UXq8bmreALKxs&e=<br>> ><br>> >>> <a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-<br>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=<br>> >>> <<a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMFaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-<br>> GTf8EwJ6AkZQiTsRQZ73UH20&e=><br>> >>> <br>> >>> _______________________________________________<br>> >>> gpfsug-discuss mailing list<br>> >>> gpfsug-discuss at spectrumscale.org<br>> >>> <a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-<br>> GTf8EwJ6AkZQiTsRQZ73UH20&e=<br>> >>> <br>> >>> <br>> >>> Unless stated otherwise above:<br>> >>> IBM United Kingdom Limited - Registered in England and Wales with number<br>> >>> 741598.<br>> >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU<br>> >>> <br>> >>> <br>> >>> <br>> >>> _______________________________________________<br>> >>> gpfsug-discuss mailing list<br>> >>> gpfsug-discuss at spectrumscale.org<br>> >>> <a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-<br>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=<br>> >>> <br>> >> <br>> >> -- <br>> >> Aaron Knister<br>> >> NASA Center for Climate Simulation (Code 606.2)<br>> >> Goddard Space Flight Center<br>> >> (301) 286-2776<br>> >> _______________________________________________<br>> >> gpfsug-discuss mailing list<br>> >> gpfsug-discuss at spectrumscale.org<br>> >> <a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-<br>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=<br>> > <br>> > _______________________________________________<br>> > gpfsug-discuss mailing list<br>> > gpfsug-discuss at spectrumscale.org<br>> > <a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-<br>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=<br>> <br>> -- <br>> Aaron Knister<br>> NASA Center for Climate Simulation (Code 606.2)<br>> Goddard Space Flight Center<br>> (301) 286-2776<br>> _______________________________________________<br>> gpfsug-discuss mailing list<br>> gpfsug-discuss at spectrumscale.org<br>> <a href="https://urldefense.proofpoint.com/v2/url?">https://urldefense.proofpoint.com/v2/url?</a><br>> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-<br>> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-<br>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=<br>> <br></font></tt><BR>
</body></html>