[gpfsug-discuss] Blocksize

Yuri L Volobuev volobuev at us.ibm.com
Mon Sep 26 20:29:18 BST 2016


I would put the net summary this way: in GPFS, the "Goldilocks zone" for
metadata block size is 256K - 1M.  If one plans to create a new file system
using GPFS V4.2+, 1M is a sound choice.

In an ideal world, block size choice shouldn't really be a choice.  It's a
low-level implementation detail that one day should go the way of the
manual ignition timing adjustment -- something that used to be necessary in
the olden days, and something that select enthusiasts like to tweak to this
day, but something that's irrelevant for the overwhelming majority of the
folks who just want the engine to run.  There's work being done in that
general direction in GPFS, but we aren't there yet.

yuri



From:	Stephen Ulmer <ulmer at ulmer.org>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date:	09/26/2016 12:02 PM
Subject:	Re: [gpfsug-discuss] Blocksize
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Now I’ve got anther question… which I’ll let bake for a while.

Okay, to (poorly) summarize:
      There are items OTHER THAN INODES stored as metadata in GPFS.
      These items have a VARIETY OF SIZES, but are packed in such a way
      that we should just not worry about wasted space unless we pick a
      LARGE metadata block size — or if we don’t pick a “reasonable”
      metadata block size after picking a “large” file system block size
      that applies to both.
      Performance is hard, and the gain from calculating exactly the best
      metadata block size is much smaller than performance gains attained
      through code optimization.
      If we were to try and calculate the appropriate metadata block size
      we would likely be wrong anyway, since none of us get our data at the
      idealized physics shop that sells massless rulers and frictionless
      pulleys.
      We should probably all use a metadata block size around 1MB. Nobody
      has said this outright, but it’s been the example as the “good” size
      at least three times in this thread.
      Under no circumstances should we do what many of us would have done
      and pick 128K, which made sense based on all of our previous
      education that is no longer applicable.

Did I miss anything? :)

Liberty,

--
Stephen



      On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev <volobuev at us.ibm.com>
      wrote:



      It's important to understand the differences between different
      metadata types, in particular where it comes to space allocation.

      System metadata files (inode file, inode and block allocation maps,
      ACL file, fileset metadata file, EA file in older versions) are
      allocated at well-defined moments (file system format, new storage
      pool creation in the case of block allocation map, etc), and those
      contain multiple records packed into a single block. From the block
      allocator point of view, the individual metadata record size is
      invisible, only larger blocks get actually allocated, and space usage
      efficiency generally isn't an issue.

      For user metadata (indirect blocks, directory blocks, EA overflow
      blocks) the situation is different. Those get allocated as the need
      arises, generally one at a time. So the size of an individual
      metadata structure matters, a lot. The smallest unit of allocation in
      GPFS is a subblock (1/32nd of a block). If an IB or a directory block
      is smaller than a subblock, the unused space in the subblock is
      wasted. So if one chooses to use, say, 16 MiB block size for
      metadata, the smallest unit of space that can be allocated is 512
      KiB. If one chooses 1 MiB block size, the smallest allocation unit is
      32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any
      reasonable data block size); directory blocks used to be limited to
      32 KiB, but in the current code can be as large as 256 KiB. As one
      can observe, using 16 MiB metadata block size would lead to a
      considerable amount of wasted space for IBs and large directories
      (small directories can live in inodes). On the other hand, with 1 MiB
      block size, there'll be no wasted metadata space. Does any of this
      actually make a practical difference? That depends on the file system
      composition, namely the number of IBs (which is a function of the
      number of large files) and larger directories. Calculating this
      scientifically can be pretty involved, and really should be the job
      of a tool that ought to exist, but doesn't (yet). A more practical
      approach is doing a ballpark estimate using local file counts and
      typical fractions of large files and directories, using statistics
      available from published papers.

      The performance implications of a given metadata block size choice is
      a subject of nearly infinite depth, and this question ultimately can
      only be answered by doing experiments with a specific workload on
      specific hardware. The metadata space utilization efficiency is
      something that can be answered conclusively though.

      yuri

      <graycol.gif>"Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi
      Sven, I am confused by your statement that the metadata block size
      should be 1 MB and am very int

      From: "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
      To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
      Date: 09/24/2016 07:19 AM
      Subject: Re: [gpfsug-discuss] Blocksize
      Sent by: gpfsug-discuss-bounces at spectrumscale.org





      Hi Sven,

      I am confused by your statement that the metadata block size should
      be 1 MB and am very interested in learning the rationale behind this
      as I am currently looking at all aspects of our current GPFS
      configuration and the possibility of making major changes.

      If you have a filesystem with only metadataOnly disks in the system
      pool and the default size of an inode is 4K (which we would do, since
      we have recently discovered that even on our scratch filesystem we
      have a bazillion files that are 4K or smaller and could therefore
      have their data stored in the inode, right?), then why would you set
      the metadata block size to anything larger than 128K when a sub-block
      is 1/32nd of a block? I.e., with a 1 MB block size for metadata
      wouldn’t you be wasting a massive amount of space?

      What am I missing / confused about there?

      Oh, and here’s a related question … let’s just say I have the above
      configuration … my system pool is metadata only and is on SSD’s. Then
      I have two other dataOnly pools that are spinning disk. One is for
      “regular” access and the other is the “capacity” pool … i.e. a pool
      of slower storage where we move files with large access times. I have
      a policy that says something like “move all files with an access time
      > 6 months to the capacity pool.” Of those bazillion files less than
      4K in size that are fitting in the inode currently, probably half a
      bazillion (<grin>) of them would be subject to that rule. Will they
      get moved to the spinning disk capacity pool or will they stay in the
      inode??

      Thanks! This is a very timely and interesting discussion for me as
      well...

      Kevin
                  On Sep 23, 2016, at 4:35 PM, Sven Oehme <
                  oehmes at us.ibm.com> wrote:


                  your metadata block size these days should be 1 MB and
                  there are only very few workloads for which you should
                  run with a filesystem blocksize below 1 MB. so if you
                  don't know exactly what to pick, 1 MB is a good starting
                  point.
                  the general rule still applies that your filesystem
                  blocksize (metadata or data pool) should match your raid
                  controller (or GNR vdisk) stripe size of the particular
                  pool.

                  so if you use a 128k strip size(defaut in many midrange
                  storage controllers) in a 8+2p raid array, your stripe or
                  track size is 1 MB and therefore the blocksize of this
                  pool should be 1 MB. i see many customers in the field
                  using 1MB or even smaller blocksize on RAID stripes of 2
                  MB or above and your performance will be significant
                  impacted by that.

                  Sven

                  ------------------------------------------
                  Sven Oehme
                  Scalable Storage Research
                  email: oehmes at us.ibm.com
                  Phone: +1 (408) 824-8904
                  IBM Almaden Research Lab
                  ------------------------------------------

                  <graycol.gif>Stephen Ulmer ---09/23/2016 12:16:34
                  PM---Not to be too pedantic, but I believe the the
                  subblock size is 1/32 of the block size (which strengt

                  From: Stephen Ulmer <ulmer at ulmer.org>
                  To: gpfsug main discussion list <
                  gpfsug-discuss at spectrumscale.org>
                  Date: 09/23/2016 12:16 PM
                  Subject: Re: [gpfsug-discuss] Blocksize
                  Sent by: gpfsug-discuss-bounces at spectrumscale.org






                  Not to be too pedantic, but I believe the the subblock
                  size is 1/32 of the block size (which strengthens Luis’s
                  arguments below).

                  I thought the the original question was NOT about inode
                  size, but about metadata block size. You can specify that
                  the system pool have a different block size from the rest
                  of the filesystem, providing that it ONLY holds metadata
                  (—metadata-block-size option to mmcrfs).

                  So with 4K inodes (which should be used for all new
                  filesystems without some counter-indication), I would
                  think that we’d want to use a metadata block size of
                  4K*32=128K. This is independent of the regular block
                  size, which you can calculate based on the workload if
                  you’re lucky.

                  There could be a great reason NOT to use 128K metadata
                  block size, but I don’t know what it is. I’d be happy to
                  be corrected about this if it’s out of whack.

                  --
                  Stephen

                                          On Sep 22, 2016, at 3:37 PM, Luis
                                          Bolinches <
                                          luis.bolinches at fi.ibm.com> wrote:

                                          Hi

                                          My 2 cents.

                                          Leave at least 4K inodes, then
                                          you get massive improvement on
                                          small files (less 3.5K minus
                                          whatever you use on xattr)

                                          About blocksize for data, unless
                                          you have actual data that suggest
                                          that you will actually benefit
                                          from smaller than 1MB block,
                                          leave there. GPFS uses sublocks
                                          where 1/16th of the BS can be
                                          allocated to different files, so
                                          the "waste" is much less than you
                                          think on 1MB and you get the
                                          throughput and less structures of
                                          much more data blocks.

                                          No warranty at all but I try to
                                          do this when the BS talk comes
                                          in: (might need some clean up it
                                          could not be last note but you
                                          get the idea)

                                          POSIX
                                          find . -type f -name '*' -exec ls
                                          -l {} \; > find_ls_files.out
                                          GPFS
                                          cd /usr/lpp/mmfs/samples/ilm
                                          gcc
                                          mmfindUtil_processOutputFile.c -o
                                          mmfindUtil_processOutputFile
                                          ./mmfind /gpfs/shared -ls -type f
                                          > find_ls_files.out
                                          CONVERT to CSV

                                          POSIX
                                          cat find_ls_files.out | awk
                                          '{print $5","}' >
                                          find_ls_files.out.csv
                                          GPFS
                                          cat find_ls_files.out | awk
                                          '{print $7","}' >
                                          find_ls_files.out.csv
                                          LOAD in octave

                                          FILESIZE = int32 (dlmread
                                          ("find_ls_files.out.csv", ","));
                                          Clean the second column (OPTIONAL
                                          as the next clean up will do the
                                          same)

                                          FILESIZE(:,[2]) = [];
                                          If we are on 4K aligment we need
                                          to clean the files that go to
                                          inodes (WELL not exactly ...
                                          extended attributes! so maybe use
                                          a lower number!)

                                          FILESIZE(FILESIZE<=3584) =[];
                                          If we are not we need to clean
                                          the 0 size files

                                          FILESIZE(FILESIZE==0) =[];
                                          Median

                                          FILESIZEMEDIAN = int32 (median
                                          (FILESIZE))
                                          Mean

                                          FILESIZEMEAN = int32 (mean
                                          (FILESIZE))
                                          Variance

                                          int32 (var (FILESIZE))
                                          iqr interquartile range, i.e.,
                                          the difference between the upper
                                          and lower quartile, of the input
                                          data.

                                          int32 (iqr (FILESIZE))
                                          Standard deviation


                                          For some FS with lots of files
                                          you might need a rather powerful
                                          machine to run the calculations
                                          on octave, I never hit anything
                                          could not manage on a 64GB RAM
                                          Power box. Most of the times it
                                          is enough with my laptop.



                                          --
                                          Ystävällisin terveisin / Kind
                                          regards / Saludos cordiales /
                                          Salutations

                                          Luis Bolinches
                                          Lab Services
                                          http://www-03.ibm.com/systems/services/labservices/


                                          IBM Laajalahdentie 23 (main
                                          Entrance) Helsinki, 00330 Finland
                                          Phone: +358 503112585

                                          "If you continually give you will
                                          continually have." Anonymous


                                          ----- Original message -----
                                          From: Stef Coene <
                                          stef.coene at docum.org>
                                          Sent by:
                                          gpfsug-discuss-bounces at spectrumscale.org

                                          To: gpfsug main discussion list <
                                          gpfsug-discuss at spectrumscale.org>
                                          Cc:
                                          Subject: Re: [gpfsug-discuss]
                                          Blocksize
                                          Date: Thu, Sep 22, 2016 10:30 PM

                                          On 09/22/2016 09:07 PM, J. Eric
                                          Wonderley wrote:
                                          > It defaults to 4k:
                                          > mmlsfs testbs8M -i
                                          > flag                value
                                          description
                                          > -------------------
                                          ------------------------
                                          >
                                          -----------------------------------

                                          >  -i                 4096
                                          Inode size in bytes
                                          >
                                          > I think you can make as small
                                          as 512b.   Gpfs will store very
                                          small
                                          > files in the inode.
                                          >
                                          > Typically you want your average
                                          file size to be your blocksize
                                          and your
                                          > filesystem has one blocksize
                                          and one inodesize.

                                          The files are not small, but
                                          around 20 MB on average.
                                          So I calculated with IBM that a 1
                                          MB or 2 MB block size is best.

                                          But I'm not sure if it's better
                                          to use a smaller block size for
                                          the
                                          metadata.

                                          The file system is not that large
                                          (400 TB) and will hold backup
                                          data
                                          from CommVault.


                                          Stef
                                          _______________________________________________

                                          gpfsug-discuss mailing list
                                          gpfsug-discuss at
                                          spectrumscale.org
                                          http://gpfsug.org/mailman/listinfo/gpfsug-discuss




                                          Ellei edellä ole toisin
                                          mainittu: / Unless stated
                                          otherwise above:
                                          Oy IBM Finland Ab
                                          PL 265, 00101 Helsinki, Finland
                                          Business ID, Y-tunnus: 0195876-3
                                          Registered in Finland

                                          _______________________________________________

                                          gpfsug-discuss mailing list
                                          gpfsug-discuss at
                                          spectrumscale.org
                                          http://gpfsug.org/mailman/listinfo/gpfsug-discuss
                  _______________________________________________
                  gpfsug-discuss mailing list
                  gpfsug-discuss at spectrumscale.org
                  http://gpfsug.org/mailman/listinfo/gpfsug-discuss



                  _______________________________________________
                  gpfsug-discuss mailing list
                  gpfsug-discuss at spectrumscale.org
                  http://gpfsug.org/mailman/listinfo/gpfsug-discuss
      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss


      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160926/d6262be6/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160926/d6262be6/attachment-0002.gif>


More information about the gpfsug-discuss mailing list