[gpfsug-discuss] pool block allocation algorithm

Wayne Sawdon wsawdon at us.ibm.com
Sat Jan 13 19:43:22 GMT 2018


Originally, GPFS used a strict round robin, first over failure groups, then
over volumes within each
failure group. That had performance issues when one or more volumes was low
on space. Then
for a while there were a variety of weighted stripe methods including by
free space and by capacity.
The file system had an option allowing the user to change the stripe
method. That option was
removed when we switched to a "best effort" round robin, which does a round
robin over the
failure groups then volumes based on the allocation regions that a node
owns. When the stripe width
at a node drops below half of the failure groups or half of the volumes
that node will acquire new
allocation regions.

Basically we vary the stripe width to avoid searching for free space on
specific volumes. It will
eventually even itself out or you could restripe the file system to even it
out immediately.

-Wayne

ps. And of course, allocation in FPO is completely different.


gpfsug-discuss-bounces at spectrumscale.org wrote on 01/13/2018 09:26:51 AM:

> From: Aaron Knister <aaron.s.knister at nasa.gov>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date: 01/13/2018 09:27 AM
> Subject: Re: [gpfsug-discuss] pool block allocation algorithm
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> Thanks, Peter. That definitely makes sense and I was actually wondering
> if performance was a factor. Do you know where to look to see what GPFS'
> perception of "performance" is for a given NSD?
>
> -Aaron
>
> On 1/13/18 12:00 PM, Peter Serocka wrote:
> > Within reasonable capacity limits it would also expect
> > to direct incoming data to disks that are best “available”
> > from a current performance perspective — like doing least
> > IOPS, having lowest latency and shortest filled queue length.
> >
> > You new NSDs, filled only with recent data, might quickly have
> > become the most busy units before reaching capacity balance,
> > simply because recent data tends to be more active than older stuff.
> >
> > Makes sense?
> >
> > — Peter
> >
> >> On 2018 Jan 13 Sat, at 17:18, Aaron Knister
> <aaron.s.knister at nasa.gov> wrote:
> >>
> >> Thanks Everyone! I whipped up a script to dump the block layout of a
> >> file and then join that with mmdf information. As part of my
exploration
> >> I wrote one 2GB file to each of this particular filesystem's 4 data
> >> pools last night (using "touch $file; mmchattr $file -P $pool; dd
> >> of=$file") and have attached a dump of the layout/nsd information for
> >> each file/pool. The fields for the output are:
> >>
> >> diskId, numBlocksOnDisk, diskName, diskSize, failureGroup, freeBlocks,
> >> freePct, freeKbFragments, freeKbFragmentsPct
> >>
> >>
> >> Here's the highlight from pool1:
> >>
> >> 36  264  d13_06_006    23437934592  1213    4548935680  (19%)
> >> 83304320   (0%)
> >> 59   74  d10_41_025    23437934592  1011    6993759232  (30%)
> >> 58642816   (0%)
> >>
> >> For this file (And anecdotally what I've seen looking at NSD I/O data
> >> for other files written to this pool) the pattern of more blocks being
> >> allocated to the NSDs that are ~20% free vs the NSDs that are 30% free
> >> seems to be fairly consistent.
> >>
> >>
> >> Looking at a snippet of pool2:
> >> 101  238  d15_15_011    23437934592  1415    2008394752   (9%)
> >> 181699328   (1%)
> >> 102  235  d15_16_012    23437934592  1415    2009153536   (9%)
> >> 182165312   (1%)
> >> 116  248  d11_42_026    23437934592  1011    4146111488  (18%)
> >> 134941504   (1%)
> >> 117  249  d13_42_026    23437934592  1213    4147710976  (18%)
> >> 135203776   (1%)
> >>
> >> there are slightly more blocks allocated in general on the NSDs with
> >> twice the amount of free space, but it doesn't seem to be a
significant
> >> amount relative to the delta in free space. The pattern from pool1
> >> certainly doesn't hold true here.
> >>
> >> Pool4 isn't very interesting because all of the NSDs are well balanced
> >> in terms of free space (all 16% free).
> >>
> >> Pool3, however, *is* particularly interesting. Here's a snippet:
> >>
> >> 106  222  d15_24_016    23437934592  1415    2957561856  (13%)
> >> 37436768   (0%)
> >> 107  222  d15_25_017    23437934592  1415    2957537280  (13%)
> >> 37353984   (0%)
> >> 108  222  d15_26_018    23437934592  1415    2957539328  (13%)
> >> 37335872   (0%)
> >> 125  222  d11_44_028    23437934592  1011   13297235968  (57%)
> >> 20505568   (0%)
> >> 126  222  d12_44_028    23437934592  1213   13296712704  (57%)
> >> 20632768   (0%)
> >> 127  222  d12_45_029    23437934592  1213   13297404928  (57%)
> >> 20557408   (0%)
> >>
> >> GPFS consistently allocated the same number of blocks to disks with
~4x
> >> the free space than it did the other disks in the pool.
> >>
> >> Suffice it to say-- I'm *very* confused :)
> >>
> >> -Aaron
> >>
> >> On 1/13/18 8:18 AM, Daniel Kidger wrote:
> >>> Aaron,
> >>>
> >>> Also are your new NSDs the same size as your existing ones?
> >>> i.e. the NSDs that are at a higher %age full might have more free
blocks
> >>> than the other NSDs?
> >>> Daniel
> >>>
> >>>
> >>> IBM Storage Professional Badge
> >>> <https://urldefense.proofpoint.com/v2/url?
>
u=https-3A__www.youracclaim.com_user_danel-2Dkidger&d=DwIGaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-

> Xi8IMuVUeAtBxlTznA&s=hu8pcNGJmsITfq8y9fzxDf9WoXD1Kr5ptVLEEpbwcjU&e=>
> >>>
> >>>
> >>> *Dr Daniel Kidger*
> >>> IBM Technical Sales Specialist
> >>> Software Defined Solution Sales
> >>>
> >>> +44-(0)7818 522 266
> >>> daniel.kidger at uk.ibm.com
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>    ----- Original message -----
> >>>    From: Jan-Frode Myklebust <janfrode at tanso.net>
> >>>    Sent by: gpfsug-discuss-bounces at spectrumscale.org
> >>>    To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> >>>    Cc:
> >>>    Subject: Re: [gpfsug-discuss] pool block allocation algorithm
> >>>    Date: Sat, Jan 13, 2018 9:25 AM
> >>>
> >>>    Don’t have documentation/whitepaper, but as I recall, it will
first
> >>>    allocate round-robin over failureGroup, then round-robin over
> >>>    nsdServers, and then round-robin over volumes. So if these new
NSDs
> >>>    are defined as different failureGroup from the old disks, that
might
> >>>    explain it..
> >>>
> >>>
> >>>    -jf
> >>>    lør. 13. jan. 2018 kl. 00:15 skrev Aaron Knister
> >>>    <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>:
> >>>
> >>>        Apologies if this has been covered elsewhere (I couldn't find
it
> >>>        if it
> >>>        has). I'm curious how GPFS decides where to allocate new
blocks.
> >>>
> >>>        We've got a filesystem that we added some NSDs to a while back
> >>>        and it
> >>>        hurt there for a little while because it appeared as
> though GPFS was
> >>>        choosing to allocate new blocks much more frequently on the
> >>>        ~100% free
> >>>        LUNs than the existing LUNs (I can't recall how free they were
> >>>        at the
> >>>        time). Looking at it now, though, it seems GPFS is doing the
> >>>        opposite.
> >>>        There's now a ~10% difference between the LUNs added and the
> >>>        existing
> >>>        LUNs (20% free vs 30% free) and GPFS is choosing to allocate
new
> >>>        writes
> >>>        at a ratio of about 3:1 on the disks with *fewer* free blocks
> >>>        than on
> >>>        the disks with more free blocks. That's completely
> inconsistent with
> >>>        what we saw when we initially added the disks which makes me
> >>>        wonder how
> >>>        GPFS is choosing to allocate new blocks (other than the
> obvious bits
> >>>        about failure group, and replication factor). Could someone
> >>>        explain (or
> >>>        point me at a whitepaper) what factors GPFS uses when
allocating
> >>>        blocks,
> >>>        particularly as it pertains to choosing one NSD over another
> >>>        within the
> >>>        same failure group.
> >>>
> >>>        Thanks!
> >>>
> >>>        -Aaron
> >>>
> >>>        --
> >>>        Aaron Knister
> >>>        NASA Center for Climate Simulation (Code 606.2)
> >>>        Goddard Space Flight Center
> >>>        (301) 286-2776
> >>>        _______________________________________________
> >>>        gpfsug-discuss mailing list
> >>>        gpfsug-discuss at spectrumscale.org
> >>>        <https://urldefense.proofpoint.com/v2/url?
> u=http-3A__spectrumscale.org&d=DwMFaQ&c=jf_iaSHvJObTbx-
>
siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=QYsXVDOdNRcII7FPtAbCXEyYJzNSd_UXq8bmreALKxs&e=

> >
> >>>        https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-

> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
> >>>        <https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwMFaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-

> GTf8EwJ6AkZQiTsRQZ73UH20&e=>
> >>>
> >>>    _______________________________________________
> >>>    gpfsug-discuss mailing list
> >>>    gpfsug-discuss at spectrumscale.org
> >>>    https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=f89xvht1uMUzAcLpusakZb1snMOgweGu0KTkKp9oedI&s=NSd3e2hwIKBCwSxsKe-

> GTf8EwJ6AkZQiTsRQZ73UH20&e=
> >>>
> >>>
> >>> Unless stated otherwise above:
> >>> IBM United Kingdom Limited - Registered in England and Wales with
number
> >>> 741598.
> >>> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
PO6 3AU
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> gpfsug-discuss mailing list
> >>> gpfsug-discuss at spectrumscale.org
> >>> https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-

> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
> >>>
> >>
> >> --
> >> Aaron Knister
> >> NASA Center for Climate Simulation (Code 606.2)
> >> Goddard Space Flight Center
> >> (301) 286-2776
> >> _______________________________________________
> >> gpfsug-discuss mailing list
> >> gpfsug-discuss at spectrumscale.org
> >> https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-

> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-

> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> https://urldefense.proofpoint.com/v2/url?
>
u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-

>
siA1ZOg&r=GtPIT10cORUM6qwFnTVtIiDUFmESkxW3I0wu8GDxmgc&m=JJtwh_OoxH4AMzXA5IDarV4i-

> Xi8IMuVUeAtBxlTznA&s=-7RY2fbJ0kmb6CgKfJIoXbHkE4pIZ4L9IDEap4AbyIQ&e=
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180113/69e68a3e/attachment-0002.htm>


More information about the gpfsug-discuss mailing list