[gpfsug-discuss] pool block allocation algorithm

Aaron Knister aaron.s.knister at nasa.gov
Sun Jan 14 22:22:15 GMT 2018


Thanks, Wayne. What you said makes sense although I'm not sure I
completely grok it.

Can you comment on whether or not historic LUN performance factors into
allocation decisions?

-Aaron

On 1/13/18 2:43 PM, Wayne Sawdon wrote:
> Originally, GPFS used a strict round robin, first over failure groups,
> then over volumes within each
> failure group. That had performance issues when one or more volumes was
> low on space. Then
> for a while there were a variety of weighted stripe methods including by
> free space and by capacity.
> The file system had an option allowing the user to change the stripe
> method. That option was
> removed when we switched to a "best effort" round robin, which does a
> round robin over the
> failure groups then volumes based on the allocation regions that a node
> owns. When the stripe width
> at a node drops below half of the failure groups or half of the volumes
> that node will acquire new
> allocation regions.
> 
> Basically we vary the stripe width to avoid searching for free space on
> specific volumes. It will
> eventually even itself out or you could restripe the file system to even
> it out immediately.
> 
> -Wayne
> 
> ps. And of course, allocation in FPO is completely different.
> 
> 
> gpfsug-discuss-bounces at spectrumscale.org wrote on 01/13/2018 09:26:51 AM:
> 
>> From: Aaron Knister <aaron.s.knister at nasa.gov>
>> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> Date: 01/13/2018 09:27 AM
>> Subject: Re: [gpfsug-discuss] pool block allocation algorithm
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>
>> Thanks, Peter. That definitely makes sense and I was actually wondering
>> if performance was a factor. Do you know where to look to see what GPFS'
>> perception of "performance" is for a given NSD?
>>
>> -Aaron
>>
>> On 1/13/18 12:00 PM, Peter Serocka wrote:
>> > Within reasonable capacity limits it would also expect
>> > to direct incoming data to disks that are best “available”
>> > from a current performance perspective — like doing least
>> > IOPS, having lowest latency and shortest filled queue length.
>> >
>> > You new NSDs, filled only with recent data, might quickly have
>> > become the most busy units before reaching capacity balance,
>> > simply because recent data tends to be more active than older stuff.
>> >
>> > Makes sense?
>> >
>> > — Peter
>> >
>> >> On 2018 Jan 13 Sat, at 17:18, Aaron Knister
>> <aaron.s.knister at nasa.gov> wrote:
>> >>
>> >> Thanks Everyone! I whipped up a script to dump the block layout of a
>> >> file and then join that with mmdf information. As part of my
> exploration
>> >> I wrote one 2GB file to each of this particular filesystem's 4 data
>> >> pools last night (using "touch $file; mmchattr $file -P $pool; dd
>> >> of=$file") and have attached a dump of the layout/nsd information for
>> >> each file/pool. The fields for the output are:
>> >>
>> >> diskId, numBlocksOnDisk, diskName, diskSize, failureGroup, freeBlocks,
>> >> freePct, freeKbFragments, freeKbFragmentsPct
>> >>
>> >>
>> >> Here's the highlight from pool1:
>> >>
>> >> 36  264  d13_06_006    23437934592  1213    4548935680  (19%)
>> >> 83304320   (0%)
>> >> 59   74  d10_41_025    23437934592  1011    6993759232  (30%)
>> >> 58642816   (0%)
>> >>
>> >> For this file (And anecdotally what I've seen looking at NSD I/O data
>> >> for other files written to this pool) the pattern of more blocks being
>> >> allocated to the NSDs that are ~20% free vs the NSDs that are 30% free
>> >> seems to be fairly consistent.
>> >>
>> >>
>> >> Looking at a snippet of pool2:
>> >> 101  238  d15_15_011    23437934592  1415    2008394752   (9%)
>> >> 181699328   (1%)
>> >> 102  235  d15_16_012    23437934592  1415    2009153536   (9%)
>> >> 182165312   (1%)
>> >> 116  248  d11_42_026    23437934592  1011    4146111488  (18%)
>> >> 134941504   (1%)
>> >> 117  249  d13_42_026    23437934592  1213    4147710976  (18%)
>> >> 135203776   (1%)
>> >>
>> >> there are slightly more blocks allocated in general on the NSDs with
>> >> twice the amount of free space, but it doesn't seem to be a significant
>> >> amount relative to the delta in free space. The pattern from pool1
>> >> certainly doesn't hold true here.
>> >>
>> >> Pool4 isn't very interesting because all of the NSDs are well balanced
>> >> in terms of free space (all 16% free).
>> >>
>> >> Pool3, however, *is* particularly interesting. Here's a snippet:
>> >>
>> >> 106  222  d15_24_016    23437934592  1415    2957561856  (13%)
>> >> 37436768   (0%)
>> >> 107  222  d15_25_017    23437934592  1415    2957537280  (13%)
>> >> 37353984   (0%)
>> >> 108  222  d15_26_018    23437934592  1415    2957539328  (13%)
>> >> 37335872   (0%)
>> >> 125  222  d11_44_028    23437934592  1011   13297235968  (57%)
>> >> 20505568   (0%)
>> >> 126  222  d12_44_028    23437934592  1213   13296712704  (57%)
>> >> 20632768   (0%)
>> >> 127  222  d12_45_029    23437934592  1213   13297404928  (57%)
>> >> 20557408   (0%)
>> >>
>> >> GPFS consistently allocated the same number of blocks to disks with ~4x
>> >> the free space than it did the other disks in the pool.
>> >>
>> >> Suffice it to say-- I'm *very* confused :)
>> >>
>> >> -Aaron
>> >>
>> >> On 1/13/18 8:18 AM, Daniel Kidger wrote:
>> >>> Aaron,
>> >>> 
>> >>> Also are your new NSDs the same size as your existing ones?
>> >>> i.e. the NSDs that are at a higher %age full might have more free
> blocks
>> >>> than the other NSDs?
>> >>> Daniel
>> >>>
>> >>> 
>> >>> IBM Storage Professional Badge
>> >>> <https://urldefense.proofpoint.com/v2/url?
>>
> u=https-3A__www.youracclaim.com_user_danel-2Dkidger&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=hu8pcNGJmsITfq8y9fzxDf9WoXD1Kr5ptVLEEpbwcjU&e=>
>> >>> 
>> >>>                
>> >>> *Dr Daniel Kidger*
>> >>> IBM Technical Sales Specialist
>> >>> Software Defined Solution Sales
>> >>>
>> >>> +44-(0)7818 522 266
>> >>> daniel.kidger at uk.ibm.com
>> >>>
>> >>> 
>> >>> 
>> >>> 
>> >>>
>> >>>    ----- Original message -----
>> >>>    From: Jan-Frode Myklebust <janfrode at tanso.net>
>> >>>    Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> >>>    To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> >>>    Cc:
>> >>>    Subject: Re: [gpfsug-discuss] pool block allocation algorithm
>> >>>    Date: Sat, Jan 13, 2018 9:25 AM
>> >>>    
>> >>>    Don’t have documentation/whitepaper, but as I recall, it will first
>> >>>    allocate round-robin over failureGroup, then round-robin over
>> >>>    nsdServers, and then round-robin over volumes. So if these new NSDs
>> >>>    are defined as different failureGroup from the old disks, that
> might
>> >>>    explain it..
>> >>>
>> >>>
>> >>>    -jf
>> >>>    lør. 13. jan. 2018 kl. 00:15 skrev Aaron Knister
>> >>>    <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>:
>> >>>
>> >>>        Apologies if this has been covered elsewhere (I couldn't
> find it
>> >>>        if it
>> >>>        has). I'm curious how GPFS decides where to allocate new
> blocks.
>> >>>
>> >>>        We've got a filesystem that we added some NSDs to a while back
>> >>>        and it
>> >>>        hurt there for a little while because it appeared as
>> though GPFS was
>> >>>        choosing to allocate new blocks much more frequently on the
>> >>>        ~100% free
>> >>>        LUNs than the existing LUNs (I can't recall how free they were
>> >>>        at the
>> >>>        time). Looking at it now, though, it seems GPFS is doing the
>> >>>        opposite.
>> >>>        There's now a ~10% difference between the LUNs added and the
>> >>>        existing
>> >>>        LUNs (20% free vs 30% free) and GPFS is choosing to
> allocate new
>> >>>        writes
>> >>>        at a ratio of about 3:1 on the disks with *fewer* free blocks
>> >>>        than on
>> >>>        the disks with more free blocks. That's completely
>> inconsistent with
>> >>>        what we saw when we initially added the disks which makes me
>> >>>        wonder how
>> >>>        GPFS is choosing to allocate new blocks (other than the
>> obvious bits
>> >>>        about failure group, and replication factor). Could someone
>> >>>        explain (or
>> >>>        point me at a whitepaper) what factors GPFS uses when
> allocating
>> >>>        blocks,
>> >>>        particularly as it pertains to choosing one NSD over another
>> >>>        within the
>> >>>        same failure group.
>> >>>
>> >>>        Thanks!
>> >>>
>> >>>        -Aaron
>> >>>
>> >>>        --
>> >>>        Aaron Knister
>> >>>        NASA Center for Climate Simulation (Code 606.2)
>> >>>        Goddard Space Flight Center
>> >>>        (301) 286-2776
>> >>>        _______________________________________________
>> >>>        gpfsug-discuss mailing list
>> >>>        gpfsug-discuss at spectrumscale.org
>> >>>        <https://urldefense.proofpoint.com/v2/url?
>> u=http-3A__spectrumscale.org&d=DwMFaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=QYsXVDOdNRcII7FPtAbCXEyYJzNSd_UXq8bmreALKxs&e=
>> >
>> >>>        https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>> >>>        <https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMFaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-
>> GTf8EwJ6AkZQiTsRQZ73UH20&e=>
>> >>>
>> >>>    _______________________________________________
>> >>>    gpfsug-discuss mailing list
>> >>>    gpfsug-discuss at spectrumscale.org
>> >>>    https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-
>> GTf8EwJ6AkZQiTsRQZ73UH20&e=
>> >>>
>> >>> 
>> >>> Unless stated otherwise above:
>> >>> IBM United Kingdom Limited - Registered in England and Wales with
> number
>> >>> 741598.
>> >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
> PO6 3AU
>> >>>
>> >>>
>> >>>
>> >>> _______________________________________________
>> >>> gpfsug-discuss mailing list
>> >>> gpfsug-discuss at spectrumscale.org
>> >>> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>> >>>
>> >>
>> >> --
>> >> Aaron Knister
>> >> NASA Center for Climate Simulation (Code 606.2)
>> >> Goddard Space Flight Center
>> >> (301) 286-2776
>> >> _______________________________________________
>> >> gpfsug-discuss mailing list
>> >> gpfsug-discuss at spectrumscale.org
>> >> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>> >
>> > _______________________________________________
>> > gpfsug-discuss mailing list
>> > gpfsug-discuss at spectrumscale.org
>> > https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>>
>> --
>> Aaron Knister
>> NASA Center for Climate Simulation (Code 606.2)
>> Goddard Space Flight Center
>> (301) 286-2776
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> https://urldefense.proofpoint.com/v2/url?
>>
> u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-
>>
> siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-
>> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>>
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list