[gpfsug-discuss] Blocksize

Buterbaugh, Kevin L Kevin.Buterbaugh at Vanderbilt.Edu
Tue Sep 27 18:02:45 BST 2016


Yuri / Sven / anyone else who wants to jump in,

First off, thank you very much for your answers.  I’d like to follow up with a couple of more questions.

1) Let’s assume that our overarching goal in configuring the block size for metadata is performance from the user perspective … i.e. how fast is an “ls -l” on my directory?  Space savings aren’t important, and how long policy scans or other “administrative” type tasks take is not nearly as important as that directory listing.  Does that change the recommended metadata block size?

2)  Let’s assume we have 3 filesystems, /home, /scratch (traditional HPC use for those two) and /data (project space).  Our storage arrays are 24-bay units with two 8+2P RAID 6 LUNs, one RAID 1 mirror, and two hot spare drives.  The RAID 1 mirrors are for /home, the RAID 6 LUNs are for /scratch or /data.  /home has tons of small files - so small that a 64K block size is currently used.  /scratch and /data have a mixture, but a 1 MB block size is the “sweet spot” there.

If you could “start all over” with the same hardware being the only restriction, would you:

a) merge /scratch and /data into one filesystem but keep /home separate since the LUN sizes are so very different, or
b) merge all three into one filesystem and use storage pools so that /home is just a separate pool within the one filesystem?  And if you chose this option would you assign different block sizes to the pools?

Again, I’m asking these questions because I may have the opportunity to effectively “start all over” and want to make sure I’m doing things as optimally as possible.  Thanks…

Kevin

On Sep 26, 2016, at 2:29 PM, Yuri L Volobuev <volobuev at us.ibm.com<mailto:volobuev at us.ibm.com>> wrote:


I would put the net summary this way: in GPFS, the "Goldilocks zone" for metadata block size is 256K - 1M. If one plans to create a new file system using GPFS V4.2+, 1M is a sound choice.

In an ideal world, block size choice shouldn't really be a choice. It's a low-level implementation detail that one day should go the way of the manual ignition timing adjustment -- something that used to be necessary in the olden days, and something that select enthusiasts like to tweak to this day, but something that's irrelevant for the overwhelming majority of the folks who just want the engine to run. There's work being done in that general direction in GPFS, but we aren't there yet.

yuri

<graycol.gif>Stephen Ulmer ---09/26/2016 12:02:25 PM---Now I’ve got anther question… which I’ll let bake for a while. Okay, to (poorly) summarize:

From: Stephen Ulmer <ulmer at ulmer.org<mailto:ulmer at ulmer.org>>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>,
Date: 09/26/2016 12:02 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>

________________________________



Now I’ve got anther question… which I’ll let bake for a while.

Okay, to (poorly) summarize:

     *   There are items OTHER THAN INODES stored as metadata in GPFS.
     *   These items have a VARIETY OF SIZES, but are packed in such a way that we should just not worry about wasted space unless we pick a LARGE metadata block size — or if we don’t pick a “reasonable” metadata block size after picking a “large” file system block size that applies to both.
     *   Performance is hard, and the gain from calculating exactly the best metadata block size is much smaller than performance gains attained through code optimization.
     *   If we were to try and calculate the appropriate metadata block size we would likely be wrong anyway, since none of us get our data at the idealized physics shop that sells massless rulers and frictionless pulleys.
     *   We should probably all use a metadata block size around 1MB. Nobody has said this outright, but it’s been the example as the “good” size at least three times in this thread.
     *   Under no circumstances should we do what many of us would have done and pick 128K, which made sense based on all of our previous education that is no longer applicable.

Did I miss anything? :)

Liberty,

--
Stephen



On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev <volobuev at us.ibm.com<mailto:volobuev at us.ibm.com>> wrote:

It's important to understand the differences between different metadata types, in particular where it comes to space allocation.

System metadata files (inode file, inode and block allocation maps, ACL file, fileset metadata file, EA file in older versions) are allocated at well-defined moments (file system format, new storage pool creation in the case of block allocation map, etc), and those contain multiple records packed into a single block. From the block allocator point of view, the individual metadata record size is invisible, only larger blocks get actually allocated, and space usage efficiency generally isn't an issue.

For user metadata (indirect blocks, directory blocks, EA overflow blocks) the situation is different. Those get allocated as the need arises, generally one at a time. So the size of an individual metadata structure matters, a lot. The smallest unit of allocation in GPFS is a subblock (1/32nd of a block). If an IB or a directory block is smaller than a subblock, the unused space in the subblock is wasted. So if one chooses to use, say, 16 MiB block size for metadata, the smallest unit of space that can be allocated is 512 KiB. If one chooses 1 MiB block size, the smallest allocation unit is 32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any reasonable data block size); directory blocks used to be limited to 32 KiB, but in the current code can be as large as 256 KiB. As one can observe, using 16 MiB metadata block size would lead to a considerable amount of wasted space for IBs and large directories (small directories can live in inodes). On the other hand, with 1 MiB block size, there'll be no wasted metadata space. Does any of this actually make a practical difference? That depends on the file system composition, namely the number of IBs (which is a function of the number of large files) and larger directories. Calculating this scientifically can be pretty involved, and really should be the job of a tool that ought to exist, but doesn't (yet). A more practical approach is doing a ballpark estimate using local file counts and typical fractions of large files and directories, using statistics available from published papers.

The performance implications of a given metadata block size choice is a subject of nearly infinite depth, and this question ultimately can only be answered by doing experiments with a specific workload on specific hardware. The metadata space utilization efficiency is something that can be answered conclusively though.

yuri

<graycol.gif>"Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi Sven, I am confused by your statement that the metadata block size should be 1 MB and am very int

From: "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu<mailto:Kevin.Buterbaugh at vanderbilt.edu>>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>,
Date: 09/24/2016 07:19 AM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>

________________________________



Hi Sven,

I am confused by your statement that the metadata block size should be 1 MB and am very interested in learning the rationale behind this as I am currently looking at all aspects of our current GPFS configuration and the possibility of making major changes.

If you have a filesystem with only metadataOnly disks in the system pool and the default size of an inode is 4K (which we would do, since we have recently discovered that even on our scratch filesystem we have a bazillion files that are 4K or smaller and could therefore have their data stored in the inode, right?), then why would you set the metadata block size to anything larger than 128K when a sub-block is 1/32nd of a block? I.e., with a 1 MB block size for metadata wouldn’t you be wasting a massive amount of space?

What am I missing / confused about there?

Oh, and here’s a related question … let’s just say I have the above configuration … my system pool is metadata only and is on SSD’s. Then I have two other dataOnly pools that are spinning disk. One is for “regular” access and the other is the “capacity” pool … i.e. a pool of slower storage where we move files with large access times. I have a policy that says something like “move all files with an access time > 6 months to the capacity pool.” Of those bazillion files less than 4K in size that are fitting in the inode currently, probably half a bazillion (<grin>) of them would be subject to that rule. Will they get moved to the spinning disk capacity pool or will they stay in the inode??

Thanks! This is a very timely and interesting discussion for me as well...

Kevin
On Sep 23, 2016, at 4:35 PM, Sven Oehme <oehmes at us.ibm.com<mailto:oehmes at us.ibm.com>> wrote:

your metadata block size these days should be 1 MB and there are only very few workloads for which you should run with a filesystem blocksize below 1 MB. so if you don't know exactly what to pick, 1 MB is a good starting point.
the general rule still applies that your filesystem blocksize (metadata or data pool) should match your raid controller (or GNR vdisk) stripe size of the particular pool.

so if you use a 128k strip size(defaut in many midrange storage controllers) in a 8+2p raid array, your stripe or track size is 1 MB and therefore the blocksize of this pool should be 1 MB. i see many customers in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB or above and your performance will be significant impacted by that.

Sven

------------------------------------------
Sven Oehme
Scalable Storage Research
email: oehmes at us.ibm.com<mailto:oehmes at us.ibm.com>
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
------------------------------------------

<graycol.gif>Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengt

From: Stephen Ulmer <ulmer at ulmer.org<mailto:ulmer at ulmer.org>>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Date: 09/23/2016 12:16 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>


________________________________



Not to be too pedantic, but I believe the the subblock size is 1/32 of the block size (which strengthens Luis’s arguments below).

I thought the the original question was NOT about inode size, but about metadata block size. You can specify that the system pool have a different block size from the rest of the filesystem, providing that it ONLY holds metadata (—metadata-block-size option to mmcrfs).

So with 4K inodes (which should be used for all new filesystems without some counter-indication), I would think that we’d want to use a metadata block size of 4K*32=128K. This is independent of the regular block size, which you can calculate based on the workload if you’re lucky.

There could be a great reason NOT to use 128K metadata block size, but I don’t know what it is. I’d be happy to be corrected about this if it’s out of whack.

--
Stephen
On Sep 22, 2016, at 3:37 PM, Luis Bolinches <luis.bolinches at fi.ibm.com<mailto:luis.bolinches at fi.ibm.com>> wrote:

Hi

My 2 cents.

Leave at least 4K inodes, then you get massive improvement on small files (less 3.5K minus whatever you use on xattr)

About blocksize for data, unless you have actual data that suggest that you will actually benefit from smaller than 1MB block, leave there. GPFS uses sublocks where 1/16th of the BS can be allocated to different files, so the "waste" is much less than you think on 1MB and you get the throughput and less structures of much more data blocks.

No warranty at all but I try to do this when the BS talk comes in: (might need some clean up it could not be last note but you get the idea)

POSIX
find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
GPFS
cd /usr/lpp/mmfs/samples/ilm
gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
./mmfind /gpfs/shared -ls -type f > find_ls_files.out
CONVERT to CSV

POSIX
cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv
GPFS
cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
LOAD in octave

FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
Clean the second column (OPTIONAL as the next clean up will do the same)

FILESIZE(:,[2]) = [];
If we are on 4K aligment we need to clean the files that go to inodes (WELL not exactly ... extended attributes! so maybe use a lower number!)

FILESIZE(FILESIZE<=3584) =[];
If we are not we need to clean the 0 size files

FILESIZE(FILESIZE==0) =[];
Median

FILESIZEMEDIAN = int32 (median (FILESIZE))
Mean

FILESIZEMEAN = int32 (mean (FILESIZE))
Variance

int32 (var (FILESIZE))
iqr interquartile range, i.e., the difference between the upper and lower quartile, of the input data.

int32 (iqr (FILESIZE))
Standard deviation


For some FS with lots of files you might need a rather powerful machine to run the calculations on octave, I never hit anything could not manage on a 64GB RAM Power box. Most of the times it is enough with my laptop.



--
Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations

Luis Bolinches
Lab Services
http://www-03.ibm.com/systems/services/labservices/

IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
Phone: +358 503112585

"If you continually give you will continually have." Anonymous


----- Original message -----
From: Stef Coene <stef.coene at docum.org<mailto:stef.coene at docum.org>>
Sent by: gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Cc:
Subject: Re: [gpfsug-discuss] Blocksize
Date: Thu, Sep 22, 2016 10:30 PM

On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:
> It defaults to 4k:
> mmlsfs testbs8M -i
> flag                value                    description
> ------------------- ------------------------
> -----------------------------------
>  -i                 4096                     Inode size in bytes
>
> I think you can make as small as 512b.   Gpfs will store very small
> files in the inode.
>
> Typically you want your average file size to be your blocksize and your
> filesystem has one blocksize and one inodesize.

The files are not small, but around 20 MB on average.
So I calculated with IBM that a 1 MB or 2 MB block size is best.

But I'm not sure if it's better to use a smaller block size for the
metadata.

The file system is not that large (400 TB) and will hold backup data
from CommVault.


Stef
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org/>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160927/a20cac35/attachment-0002.htm>


More information about the gpfsug-discuss mailing list