[gpfsug-discuss] Blocksize

Yuri L Volobuev volobuev at us.ibm.com
Mon Sep 26 19:18:15 BST 2016


It's important to understand the differences between different metadata
types, in particular where it comes to space allocation.

System metadata files (inode file, inode and block allocation maps, ACL
file, fileset metadata file, EA file in older versions) are allocated at
well-defined moments (file system format,  new storage pool creation in the
case of block allocation map, etc), and those contain multiple records
packed into a single block.  From the block allocator point of view, the
individual metadata record size is invisible, only larger blocks get
actually allocated, and space usage efficiency generally isn't an issue.

For user metadata (indirect blocks, directory blocks, EA overflow blocks)
the situation is different.  Those get allocated as the need arises,
generally one at a time.  So the size of an individual metadata structure
matters, a lot.  The smallest unit of allocation in GPFS is a subblock
(1/32nd of a block).  If an IB or a directory block is smaller than a
subblock, the unused space in the subblock is wasted.  So if one chooses to
use, say, 16 MiB block size for metadata, the smallest unit of space that
can be allocated is 512 KiB.  If one chooses 1 MiB block size, the smallest
allocation unit is 32 KiB.  IBs are generally 16 KiB or 32 KiB in size (32
KiB with any reasonable data block size); directory blocks used to be
limited to 32 KiB, but in the current code can be as large as 256 KiB.  As
one can observe, using 16 MiB metadata block size would lead to a
considerable amount of wasted space for IBs and large directories (small
directories can live in inodes).  On the other hand, with 1 MiB block size,
there'll be no wasted metadata space.  Does any of this actually make a
practical difference?  That depends on the file system composition, namely
the number of IBs (which is a function of the number of large files) and
larger directories.  Calculating this scientifically can be pretty
involved, and really should be the job of a tool that ought to exist, but
doesn't (yet).  A more practical approach is doing a ballpark estimate
using local file counts and typical fractions of large files and
directories, using statistics available from published papers.

The performance implications of a given metadata block size choice is a
subject of nearly infinite depth, and this question ultimately can only be
answered by doing experiments with a specific workload on specific
hardware.  The metadata space utilization efficiency is something that can
be answered conclusively though.

yuri



From:	"Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date:	09/24/2016 07:19 AM
Subject:	Re: [gpfsug-discuss] Blocksize
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Hi Sven,

I am confused by your statement that the metadata block size should be 1 MB
and am very interested in learning the rationale behind this as I am
currently looking at all aspects of our current GPFS configuration and the
possibility of making major changes.

If you have a filesystem with only metadataOnly disks in the system pool
and the default size of an inode is 4K (which we would do, since we have
recently discovered that even on our scratch filesystem we have a bazillion
files that are 4K or smaller and could therefore have their data stored in
the inode, right?), then why would you set the metadata block size to
anything larger than 128K when a sub-block is 1/32nd of a block?  I.e.,
with a 1 MB block size for metadata wouldn’t you be wasting  a massive
amount of space?

What am I missing / confused about there?

Oh, and here’s a related question … let’s just say I have the above
configuration … my system pool is metadata only and is on SSD’s.  Then I
have two other dataOnly pools that are spinning disk.  One is for “regular”
access and the other is the “capacity” pool … i.e. a pool of slower storage
where we move files with large access times.  I have a policy that says
something like “move all files with an access time > 6 months to the
capacity pool.”  Of those bazillion files less than 4K in size that are
fitting in the inode currently, probably half a bazillion (<grin>) of them
would be subject to that rule.  Will they get moved to the spinning disk
capacity pool or will they stay in the inode??

Thanks!  This is a very timely and interesting discussion for me as well...

Kevin

      On Sep 23, 2016, at 4:35 PM, Sven Oehme <oehmes at us.ibm.com> wrote:



      your metadata block size these days should be 1 MB and there are only
      very few workloads for which you should run with a filesystem
      blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB
      is a good starting point.
      the general rule still applies that your filesystem blocksize
      (metadata or data pool) should match your raid controller (or GNR
      vdisk) stripe size of the particular pool.

      so if you use a 128k strip size(defaut in many midrange storage
      controllers) in a 8+2p raid array, your stripe or track size is 1 MB
      and therefore the blocksize of this pool should be 1 MB. i see many
      customers in the field using 1MB or even smaller blocksize on RAID
      stripes of 2 MB or above and your performance will be significant
      impacted by that.

      Sven

      ------------------------------------------
      Sven Oehme
      Scalable Storage Research
      email: oehmes at us.ibm.com
      Phone: +1 (408) 824-8904
      IBM Almaden Research Lab
      ------------------------------------------

      <graycol.gif>Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too
      pedantic, but I believe the the subblock size is 1/32 of the block
      size (which strengt

      From: Stephen Ulmer <ulmer at ulmer.org>
      To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
      Date: 09/23/2016 12:16 PM
      Subject: Re: [gpfsug-discuss] Blocksize
      Sent by: gpfsug-discuss-bounces at spectrumscale.org





      Not to be too pedantic, but I believe the the subblock size is 1/32
      of the block size (which strengthens Luis’s arguments below).

      I thought the the original question was NOT about inode size, but
      about metadata block size. You can specify that the system pool have
      a different block size from the rest of the filesystem, providing
      that it ONLY holds metadata (—metadata-block-size option to mmcrfs).

      So with 4K inodes (which should be used for all new filesystems
      without some counter-indication), I would think that we’d want to use
      a metadata block size of 4K*32=128K. This is independent of the
      regular block size, which you can calculate based on the workload if
      you’re lucky.

      There could be a great reason NOT to use 128K metadata block size,
      but I don’t know what it is. I’d be happy to be corrected about this
      if it’s out of whack.

      --
      Stephen


                  On Sep 22, 2016, at 3:37 PM, Luis Bolinches <
                  luis.bolinches at fi.ibm.com> wrote:

                  Hi

                  My 2 cents.

                  Leave at least 4K inodes, then you get massive
                  improvement on small files (less 3.5K minus whatever you
                  use on xattr)

                  About blocksize for data, unless you have actual data
                  that suggest that you will actually benefit from smaller
                  than 1MB block, leave there. GPFS uses sublocks where
                  1/16th of the BS can be allocated to different files, so
                  the "waste" is much less than you think on 1MB and you
                  get the throughput and less structures of much more data
                  blocks.

                  No warranty at all but I try to do this when the BS talk
                  comes in: (might need some clean up it could not be last
                  note but you get the idea)

                  POSIX
                  find . -type f -name '*' -exec ls -l {} \; >
                  find_ls_files.out
                  GPFS
                  cd /usr/lpp/mmfs/samples/ilm
                  gcc mmfindUtil_processOutputFile.c -o
                  mmfindUtil_processOutputFile
                  ./mmfind /gpfs/shared -ls -type f > find_ls_files.out
                  CONVERT to CSV

                  POSIX
                  cat find_ls_files.out | awk '{print $5","}' >
                  find_ls_files.out.csv
                  GPFS
                  cat find_ls_files.out | awk '{print $7","}' >
                  find_ls_files.out.csv
                  LOAD in octave

                  FILESIZE = int32 (dlmread ("find_ls_files.out.csv",
                  ","));
                  Clean the second column (OPTIONAL as the next clean up
                  will do the same)

                  FILESIZE(:,[2]) = [];
                  If we are on 4K aligment we need to clean the files that
                  go to inodes (WELL not exactly ... extended attributes!
                  so maybe use a lower number!)

                  FILESIZE(FILESIZE<=3584) =[];
                  If we are not we need to clean the 0 size files

                  FILESIZE(FILESIZE==0) =[];
                  Median

                  FILESIZEMEDIAN = int32 (median (FILESIZE))
                  Mean

                  FILESIZEMEAN = int32 (mean (FILESIZE))
                  Variance

                  int32 (var (FILESIZE))
                  iqr interquartile range, i.e., the difference between the
                  upper and lower quartile, of the input data.

                  int32 (iqr (FILESIZE))
                  Standard deviation


                  For some FS with lots of files you might need a rather
                  powerful machine to run the calculations on octave, I
                  never hit anything could not manage on a 64GB RAM Power
                  box. Most of the times it is enough with my laptop.



                  --
                  Ystävällisin terveisin / Kind regards / Saludos
                  cordiales / Salutations

                  Luis Bolinches
                  Lab Services
                  http://www-03.ibm.com/systems/services/labservices/

                  IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330
                  Finland
                  Phone: +358 503112585

                  "If you continually give you will continually have."
                  Anonymous


                  ----- Original message -----
                  From: Stef Coene <stef.coene at docum.org>
                  Sent by: gpfsug-discuss-bounces at spectrumscale.org
                  To: gpfsug main discussion list <
                  gpfsug-discuss at spectrumscale.org>
                  Cc:
                  Subject: Re: [gpfsug-discuss] Blocksize
                  Date: Thu, Sep 22, 2016 10:30 PM

                  On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:
                  > It defaults to 4k:
                  > mmlsfs testbs8M -i
                  > flag                value
                  description
                  > ------------------- ------------------------
                  > -----------------------------------
                  >  -i                 4096                     Inode size
                  in bytes
                  >
                  > I think you can make as small as 512b.   Gpfs will
                  store very small
                  > files in the inode.
                  >
                  > Typically you want your average file size to be your
                  blocksize and your
                  > filesystem has one blocksize and one inodesize.

                  The files are not small, but around 20 MB on average.
                  So I calculated with IBM that a 1 MB or 2 MB block size
                  is best.

                  But I'm not sure if it's better to use a smaller block
                  size for the
                  metadata.

                  The file system is not that large (400 TB) and will hold
                  backup data
                  from CommVault.


                  Stef
                  _______________________________________________
                  gpfsug-discuss mailing list
                  gpfsug-discuss at spectrumscale.org
                  http://gpfsug.org/mailman/listinfo/gpfsug-discuss



                  Ellei edellä ole toisin mainittu: / Unless stated
                  otherwise above:
                  Oy IBM Finland Ab
                  PL 265, 00101 Helsinki, Finland
                  Business ID, Y-tunnus: 0195876-3
                  Registered in Finland

                  _______________________________________________
                  gpfsug-discuss mailing list
                  gpfsug-discuss at spectrumscale.org
                  http://gpfsug.org/mailman/listinfo/gpfsug-discuss
      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss



      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160926/038c9664/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160926/038c9664/attachment-0002.gif>


More information about the gpfsug-discuss mailing list