[gpfsug-discuss] pool block allocation algorithm
Aaron Knister
aaron.s.knister at nasa.gov
Sun Jan 14 22:22:15 GMT 2018
Thanks, Wayne. What you said makes sense although I'm not sure I
completely grok it.
Can you comment on whether or not historic LUN performance factors into
allocation decisions?
-Aaron
On 1/13/18 2:43 PM, Wayne Sawdon wrote:
> Originally, GPFS used a strict round robin, first over failure groups,
> then over volumes within each
> failure group. That had performance issues when one or more volumes was
> low on space. Then
> for a while there were a variety of weighted stripe methods including by
> free space and by capacity.
> The file system had an option allowing the user to change the stripe
> method. That option was
> removed when we switched to a "best effort" round robin, which does a
> round robin over the
> failure groups then volumes based on the allocation regions that a node
> owns. When the stripe width
> at a node drops below half of the failure groups or half of the volumes
> that node will acquire new
> allocation regions.
>
> Basically we vary the stripe width to avoid searching for free space on
> specific volumes. It will
> eventually even itself out or you could restripe the file system to even
> it out immediately.
>
> -Wayne
>
> ps. And of course, allocation in FPO is completely different.
>
>
> gpfsug-discuss-bounces at spectrumscale.org wrote on 01/13/2018 09:26:51 AM:
>
>> From: Aaron Knister <aaron.s.knister at nasa.gov>
>> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> Date: 01/13/2018 09:27 AM
>> Subject: Re: [gpfsug-discuss] pool block allocation algorithm
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>
>> Thanks, Peter. That definitely makes sense and I was actually wondering
>> if performance was a factor. Do you know where to look to see what GPFS'
>> perception of "performance" is for a given NSD?
>>
>> -Aaron
>>
>> On 1/13/18 12:00 PM, Peter Serocka wrote:
>> > Within reasonable capacity limits it would also expect
>> > to direct incoming data to disks that are best “available”
>> > from a current performance perspective — like doing least
>> > IOPS, having lowest latency and shortest filled queue length.
>> >
>> > You new NSDs, filled only with recent data, might quickly have
>> > become the most busy units before reaching capacity balance,
>> > simply because recent data tends to be more active than older stuff.
>> >
>> > Makes sense?
>> >
>> > — Peter
>> >
>> >> On 2018 Jan 13 Sat, at 17:18, Aaron Knister
>> <aaron.s.knister at nasa.gov> wrote:
>> >>
>> >> Thanks Everyone! I whipped up a script to dump the block layout of a
>> >> file and then join that with mmdf information. As part of my
> exploration
>> >> I wrote one 2GB file to each of this particular filesystem's 4 data
>> >> pools last night (using "touch $file; mmchattr $file -P $pool; dd
>> >> of=$file") and have attached a dump of the layout/nsd information for
>> >> each file/pool. The fields for the output are:
>> >>
>> >> diskId, numBlocksOnDisk, diskName, diskSize, failureGroup, freeBlocks,
>> >> freePct, freeKbFragments, freeKbFragmentsPct
>> >>
>> >>
>> >> Here's the highlight from pool1:
>> >>
>> >> 36 264 d13_06_006 23437934592 1213 4548935680 (19%)
>> >> 83304320 (0%)
>> >> 59 74 d10_41_025 23437934592 1011 6993759232 (30%)
>> >> 58642816 (0%)
>> >>
>> >> For this file (And anecdotally what I've seen looking at NSD I/O data
>> >> for other files written to this pool) the pattern of more blocks being
>> >> allocated to the NSDs that are ~20% free vs the NSDs that are 30% free
>> >> seems to be fairly consistent.
>> >>
>> >>
>> >> Looking at a snippet of pool2:
>> >> 101 238 d15_15_011 23437934592 1415 2008394752 (9%)
>> >> 181699328 (1%)
>> >> 102 235 d15_16_012 23437934592 1415 2009153536 (9%)
>> >> 182165312 (1%)
>> >> 116 248 d11_42_026 23437934592 1011 4146111488 (18%)
>> >> 134941504 (1%)
>> >> 117 249 d13_42_026 23437934592 1213 4147710976 (18%)
>> >> 135203776 (1%)
>> >>
>> >> there are slightly more blocks allocated in general on the NSDs with
>> >> twice the amount of free space, but it doesn't seem to be a significant
>> >> amount relative to the delta in free space. The pattern from pool1
>> >> certainly doesn't hold true here.
>> >>
>> >> Pool4 isn't very interesting because all of the NSDs are well balanced
>> >> in terms of free space (all 16% free).
>> >>
>> >> Pool3, however, *is* particularly interesting. Here's a snippet:
>> >>
>> >> 106 222 d15_24_016 23437934592 1415 2957561856 (13%)
>> >> 37436768 (0%)
>> >> 107 222 d15_25_017 23437934592 1415 2957537280 (13%)
>> >> 37353984 (0%)
>> >> 108 222 d15_26_018 23437934592 1415 2957539328 (13%)
>> >> 37335872 (0%)
>> >> 125 222 d11_44_028 23437934592 1011 13297235968 (57%)
>> >> 20505568 (0%)
>> >> 126 222 d12_44_028 23437934592 1213 13296712704 (57%)
>> >> 20632768 (0%)
>> >> 127 222 d12_45_029 23437934592 1213 13297404928 (57%)
>> >> 20557408 (0%)
>> >>
>> >> GPFS consistently allocated the same number of blocks to disks with ~4x
>> >> the free space than it did the other disks in the pool.
>> >>
>> >> Suffice it to say-- I'm *very* confused :)
>> >>
>> >> -Aaron
>> >>
>> >> On 1/13/18 8:18 AM, Daniel Kidger wrote:
>> >>> Aaron,
>> >>>
>> >>> Also are your new NSDs the same size as your existing ones?
>> >>> i.e. the NSDs that are at a higher %age full might have more free
> blocks
>> >>> than the other NSDs?
>> >>> Daniel
>> >>>
>> >>>
>> >>> IBM Storage Professional Badge
>> >>> <https://urldefense.proofpoint.com/v2/url?
>>
> u=https-3A__www.youracclaim.com_user_danel-2Dkidger&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=hu8pcNGJmsITfq8y9fzxDf9WoXD1Kr5ptVLEEpbwcjU&e=>
>> >>>
>> >>>
>> >>> *Dr Daniel Kidger*
>> >>> IBM Technical Sales Specialist
>> >>> Software Defined Solution Sales
>> >>>
>> >>> +44-(0)7818 522 266
>> >>> daniel.kidger at uk.ibm.com
>> >>>
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> ----- Original message -----
>> >>> From: Jan-Frode Myklebust <janfrode at tanso.net>
>> >>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> >>> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> >>> Cc:
>> >>> Subject: Re: [gpfsug-discuss] pool block allocation algorithm
>> >>> Date: Sat, Jan 13, 2018 9:25 AM
>> >>>
>> >>> Don’t have documentation/whitepaper, but as I recall, it will first
>> >>> allocate round-robin over failureGroup, then round-robin over
>> >>> nsdServers, and then round-robin over volumes. So if these new NSDs
>> >>> are defined as different failureGroup from the old disks, that
> might
>> >>> explain it..
>> >>>
>> >>>
>> >>> -jf
>> >>> lør. 13. jan. 2018 kl. 00:15 skrev Aaron Knister
>> >>> <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>:
>> >>>
>> >>> Apologies if this has been covered elsewhere (I couldn't
> find it
>> >>> if it
>> >>> has). I'm curious how GPFS decides where to allocate new
> blocks.
>> >>>
>> >>> We've got a filesystem that we added some NSDs to a while back
>> >>> and it
>> >>> hurt there for a little while because it appeared as
>> though GPFS was
>> >>> choosing to allocate new blocks much more frequently on the
>> >>> ~100% free
>> >>> LUNs than the existing LUNs (I can't recall how free they were
>> >>> at the
>> >>> time). Looking at it now, though, it seems GPFS is doing the
>> >>> opposite.
>> >>> There's now a ~10% difference between the LUNs added and the
>> >>> existing
>> >>> LUNs (20% free vs 30% free) and GPFS is choosing to
> allocate new
>> >>> writes
>> >>> at a ratio of about 3:1 on the disks with *fewer* free blocks
>> >>> than on
>> >>> the disks with more free blocks. That's completely
>> inconsistent with
>> >>> what we saw when we initially added the disks which makes me
>> >>> wonder how
>> >>> GPFS is choosing to allocate new blocks (other than the
>> obvious bits
>> >>> about failure group, and replication factor). Could someone
>> >>> explain (or
>> >>> point me at a whitepaper) what factors GPFS uses when
> allocating
>> >>> blocks,
>> >>> particularly as it pertains to choosing one NSD over another
>> >>> within the
>> >>> same failure group.
>> >>>
>> >>> Thanks!
>> >>>
>> >>> -Aaron
>> >>>
>> >>> --
>> >>> Aaron Knister
>> >>> NASA Center for Climate Simulation (Code 606.2)
>> >>> Goddard Space Flight Center
>> >>> (301) 286-2776
>> >>> _______________________________________________
>> >>> gpfsug-discuss mailing list
>> >>> gpfsug-discuss at spectrumscale.org
>> >>> <https://urldefense.proofpoint.com/v2/url?
>> u=http-3A__spectrumscale.org&d=DwMFaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=QYsXVDOdNRcII7FPtAbCXEyYJzNSd_UXq8bmreALKxs&e=
>> >
>> >>> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>> >>> <https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMFaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-
>> GTf8EwJ6AkZQiTsRQZ73UH20&e=>
>> >>>
>> >>> _______________________________________________
>> >>> gpfsug-discuss mailing list
>> >>> gpfsug-discuss at spectrumscale.org
>> >>> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-
>> GTf8EwJ6AkZQiTsRQZ73UH20&e=
>> >>>
>> >>>
>> >>> Unless stated otherwise above:
>> >>> IBM United Kingdom Limited - Registered in England and Wales with
> number
>> >>> 741598.
>> >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6 3AU
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> gpfsug-discuss mailing list
>> >>> gpfsug-discuss at spectrumscale.org
>> >>> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>> >>>
>> >>
>> >> --
>> >> Aaron Knister
>> >> NASA Center for Climate Simulation (Code 606.2)
>> >> Goddard Space Flight Center
>> >> (301) 286-2776
>> >> _______________________________________________
>> >> gpfsug-discuss mailing list
>> >> gpfsug-discuss at spectrumscale.org
>> >> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>> >
>> > _______________________________________________
>> > gpfsug-discuss mailing list
>> > gpfsug-discuss at spectrumscale.org
>> > https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>>
>> --
>> Aaron Knister
>> NASA Center for Climate Simulation (Code 606.2)
>> Goddard Space Flight Center
>> (301) 286-2776
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
More information about the gpfsug-discuss
mailing list