[gpfsug-discuss] Blocksize

Sven Oehme oehmes at gmail.com
Sun Sep 25 18:11:12 BST 2016


well, its not that easy and there is no perfect answer here. so lets start
with some data points that might help decide:

inodes, directory blocks, allocation maps for data as well as metadata
don't follow the same restrictions as data 'fragments' or subblocks, means
they are not bond to the 1/32 of the blocksize. they rather get organized
on calculated sized blocks which can be very small (significant smaller
than 1/32th) or close to the max of the blocksize for a single object.
therefore the space waste concern doesn't really apply here.

policy scans loves larger blocks as the blocks will be randomly scattered
across the NSD's and therefore larger contiguous blocks for inode scan will
perform significantly faster on  larger metadata blocksizes than on smaller
(assuming this is disk, with SSD's this doesn't matter that much)

so for disk based systems it is advantageous to use larger blocks , for SSD
based its less of an issue. you shouldn't choose on the other hand too
large blocks even for disk drive based systems as there is one catch to all
this. small updates on metadata typically end up writing the whole metadata
block e.g. 256k for a directory block which now need to be destaged and
read back from another node changing the same block.

hope this helps. Sven





On Sat, Sep 24, 2016 at 7:18 AM Buterbaugh, Kevin L <
Kevin.Buterbaugh at vanderbilt.edu> wrote:

> Hi Sven,
>
> I am confused by your statement that the metadata block size should be 1
> MB and am very interested in learning the rationale behind this as I am
> currently looking at all aspects of our current GPFS configuration and the
> possibility of making major changes.
>
> If you have a filesystem with only metadataOnly disks in the system pool
> and the default size of an inode is 4K (which we would do, since we have
> recently discovered that even on our scratch filesystem we have a bazillion
> files that are 4K or smaller and could therefore have their data stored in
> the inode, right?), then why would you set the metadata block size to
> anything larger than 128K when a sub-block is 1/32nd of a block?  I.e.,
> with a 1 MB block size for metadata wouldn’t you be wasting  a massive
> amount of space?
>
> What am I missing / confused about there?
>
> Oh, and here’s a related question … let’s just say I have the above
> configuration … my system pool is metadata only and is on SSD’s.  Then I
> have two other dataOnly pools that are spinning disk.  One is for “regular”
> access and the other is the “capacity” pool … i.e. a pool of slower storage
> where we move files with large access times.  I have a policy that says
> something like “move all files with an access time > 6 months to the
> capacity pool.”  Of those bazillion files less than 4K in size that are
> fitting in the inode currently, probably half a bazillion (<grin>) of them
> would be subject to that rule.  Will they get moved to the spinning disk
> capacity pool or will they stay in the inode??
>
> Thanks!  This is a very timely and interesting discussion for me as well...
>
> Kevin
>
> On Sep 23, 2016, at 4:35 PM, Sven Oehme <oehmes at us.ibm.com> wrote:
>
> your metadata block size these days should be 1 MB and there are only very
> few workloads for which you should run with a filesystem blocksize below 1
> MB. so if you don't know exactly what to pick, 1 MB is a good starting
> point.
> the general rule still applies that your filesystem blocksize (metadata or
> data pool) should match your raid controller (or GNR vdisk) stripe size of
> the particular pool.
>
> so if you use a 128k strip size(defaut in many midrange storage
> controllers) in a 8+2p raid array, your stripe or track size is 1 MB and
> therefore the blocksize of this pool should be 1 MB. i see many customers
> in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or
> above and your performance will be significant impacted by that.
>
> Sven
>
> ------------------------------------------
> Sven Oehme
> Scalable Storage Research
> email: oehmes at us.ibm.com
> Phone: +1 (408) 824-8904
> IBM Almaden Research Lab
> ------------------------------------------
>
> <graycol.gif>Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too
> pedantic, but I believe the the subblock size is 1/32 of the block size
> (which strengt
>
>
>
> From: Stephen Ulmer <ulmer at ulmer.org>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date: 09/23/2016 12:16 PM
> Subject: Re: [gpfsug-discuss] Blocksize
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> ------------------------------
>
>
>
> Not to be too pedantic, but I believe the the subblock size is 1/32 of the
> block size (which strengthens Luis’s arguments below).
>
> I thought the the original question was NOT about inode size, but about
> metadata block size. You can specify that the system pool have a different
> block size from the rest of the filesystem, providing that it ONLY holds
> metadata (—metadata-block-size option to mmcrfs).
>
> So with 4K inodes (which should be used for all new filesystems without
> some counter-indication), I would think that we’d want to use a metadata
> block size of 4K*32=128K. This is independent of the regular block size,
> which you can calculate based on the workload if you’re lucky.
>
> There could be a great reason NOT to use 128K metadata block size, but I
> don’t know what it is. I’d be happy to be corrected about this if it’s out
> of whack.
>
> --
> Stephen
>
>
>
>    On Sep 22, 2016, at 3:37 PM, Luis Bolinches <
>       *luis.bolinches at fi.ibm.com* <luis.bolinches at fi.ibm.com>> wrote:
>
>       Hi
>
>       My 2 cents.
>
>       Leave at least 4K inodes, then you get massive improvement on small
>       files (less 3.5K minus whatever you use on xattr)
>
>       About blocksize for data, unless you have actual data that suggest
>       that you will actually benefit from smaller than 1MB block, leave there.
>       GPFS uses sublocks where 1/16th of the BS can be allocated to different
>       files, so the "waste" is much less than you think on 1MB and you get the
>       throughput and less structures of much more data blocks.
>
>       No* warranty at all* but I try to do this when the BS talk comes
>       in: (might need some clean up it could not be last note but you get the
>       idea)
>
>       POSIX
>       find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
>       GPFS
>       cd /usr/lpp/mmfs/samples/ilm
>       gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
>       ./mmfind /gpfs/shared -ls -type f > find_ls_files.out
>       CONVERT to CSV
>
>       POSIX
>       cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv
>       GPFS
>       cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
>       LOAD in octave
>
>       FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
>       Clean the second column (OPTIONAL as the next clean up will do the
>       same)
>
>       FILESIZE(:,[2]) = [];
>       If we are on 4K aligment we need to clean the files that go to
>       inodes (WELL not exactly ... extended attributes! so maybe use a lower
>       number!)
>
>       FILESIZE(FILESIZE<=3584) =[];
>       If we are not we need to clean the 0 size files
>
>       FILESIZE(FILESIZE==0) =[];
>       Median
>
>       FILESIZEMEDIAN = int32 (median (FILESIZE))
>       Mean
>
>       FILESIZEMEAN = int32 (mean (FILESIZE))
>       Variance
>
>       int32 (var (FILESIZE))
>       iqr interquartile range, i.e., the difference between the upper and
>       lower quartile, of the input data.
>
>       int32 (iqr (FILESIZE))
>       Standard deviation
>
>
>       For some FS with lots of files you might need a rather powerful
>       machine to run the calculations on octave, I never hit anything could not
>       manage on a 64GB RAM Power box. Most of the times it is enough with my
>       laptop.
>
>
>
>       --
>       Ystävällisin terveisin / Kind regards / Saludos cordiales /
>       Salutations
>
>       Luis Bolinches
>       Lab Services
> *http://www-03.ibm.com/systems/services/labservices/*
>       <http://www-03.ibm.com/systems/services/labservices/>
>
>       IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
>       Phone: +358 503112585
>
>       "If you continually give you will continually have." Anonymous
>
>
>       ----- Original message -----
>       From: Stef Coene <*stef.coene at docum.org* <stef.coene at docum.org>>
>       Sent by: *gpfsug-discuss-bounces at spectrumscale.org*
>       <gpfsug-discuss-bounces at spectrumscale.org>
>       To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org*
>       <gpfsug-discuss at spectrumscale.org>>
>       Cc:
>       Subject: Re: [gpfsug-discuss] Blocksize
>       Date: Thu, Sep 22, 2016 10:30 PM
>
>       On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:
>       > It defaults to 4k:
>       > mmlsfs testbs8M -i
>       > flag                value                    description
>       > ------------------- ------------------------
>       > -----------------------------------
>       >  -i                 4096                     Inode size in bytes
>       >
>       > I think you can make as small as 512b.   Gpfs will store very
>       small
>       > files in the inode.
>       >
>       > Typically you want your average file size to be your blocksize
>       and your
>       > filesystem has one blocksize and one inodesize.
>
>       The files are not small, but around 20 MB on average.
>       So I calculated with IBM that a 1 MB or 2 MB block size is best.
>
>       But I'm not sure if it's better to use a smaller block size for the
>       metadata.
>
>       The file system is not that large (400 TB) and will hold backup data
>       from CommVault.
>
>
>       Stef
>       _______________________________________________
>       gpfsug-discuss mailing list
>       gpfsug-discuss at *spectrumscale.org* <http://spectrumscale.org/>
> *http://gpfsug.org/mailman/listinfo/gpfsug-discuss*
>       <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
>
>
>       Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
>       Oy IBM Finland Ab
>       PL 265, 00101 Helsinki, Finland
>       Business ID, Y-tunnus: 0195876-3
>       Registered in Finland
>
>       _______________________________________________
>       gpfsug-discuss mailing list
>       gpfsug-discuss at *spectrumscale.org* <http://spectrumscale.org/>
> *http://gpfsug.org/mailman/listinfo/gpfsug-discuss*
>       <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160925/f8435f3d/attachment-0002.htm>


More information about the gpfsug-discuss mailing list