[gpfsug-discuss] Blocksize
Yuri L Volobuev
volobuev at us.ibm.com
Mon Sep 26 20:29:18 BST 2016
I would put the net summary this way: in GPFS, the "Goldilocks zone" for
metadata block size is 256K - 1M. If one plans to create a new file system
using GPFS V4.2+, 1M is a sound choice.
In an ideal world, block size choice shouldn't really be a choice. It's a
low-level implementation detail that one day should go the way of the
manual ignition timing adjustment -- something that used to be necessary in
the olden days, and something that select enthusiasts like to tweak to this
day, but something that's irrelevant for the overwhelming majority of the
folks who just want the engine to run. There's work being done in that
general direction in GPFS, but we aren't there yet.
yuri
From: Stephen Ulmer <ulmer at ulmer.org>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date: 09/26/2016 12:02 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Now I’ve got anther question… which I’ll let bake for a while.
Okay, to (poorly) summarize:
There are items OTHER THAN INODES stored as metadata in GPFS.
These items have a VARIETY OF SIZES, but are packed in such a way
that we should just not worry about wasted space unless we pick a
LARGE metadata block size — or if we don’t pick a “reasonable”
metadata block size after picking a “large” file system block size
that applies to both.
Performance is hard, and the gain from calculating exactly the best
metadata block size is much smaller than performance gains attained
through code optimization.
If we were to try and calculate the appropriate metadata block size
we would likely be wrong anyway, since none of us get our data at the
idealized physics shop that sells massless rulers and frictionless
pulleys.
We should probably all use a metadata block size around 1MB. Nobody
has said this outright, but it’s been the example as the “good” size
at least three times in this thread.
Under no circumstances should we do what many of us would have done
and pick 128K, which made sense based on all of our previous
education that is no longer applicable.
Did I miss anything? :)
Liberty,
--
Stephen
On Sep 26, 2016, at 2:18 PM, Yuri L Volobuev <volobuev at us.ibm.com>
wrote:
It's important to understand the differences between different
metadata types, in particular where it comes to space allocation.
System metadata files (inode file, inode and block allocation maps,
ACL file, fileset metadata file, EA file in older versions) are
allocated at well-defined moments (file system format, new storage
pool creation in the case of block allocation map, etc), and those
contain multiple records packed into a single block. From the block
allocator point of view, the individual metadata record size is
invisible, only larger blocks get actually allocated, and space usage
efficiency generally isn't an issue.
For user metadata (indirect blocks, directory blocks, EA overflow
blocks) the situation is different. Those get allocated as the need
arises, generally one at a time. So the size of an individual
metadata structure matters, a lot. The smallest unit of allocation in
GPFS is a subblock (1/32nd of a block). If an IB or a directory block
is smaller than a subblock, the unused space in the subblock is
wasted. So if one chooses to use, say, 16 MiB block size for
metadata, the smallest unit of space that can be allocated is 512
KiB. If one chooses 1 MiB block size, the smallest allocation unit is
32 KiB. IBs are generally 16 KiB or 32 KiB in size (32 KiB with any
reasonable data block size); directory blocks used to be limited to
32 KiB, but in the current code can be as large as 256 KiB. As one
can observe, using 16 MiB metadata block size would lead to a
considerable amount of wasted space for IBs and large directories
(small directories can live in inodes). On the other hand, with 1 MiB
block size, there'll be no wasted metadata space. Does any of this
actually make a practical difference? That depends on the file system
composition, namely the number of IBs (which is a function of the
number of large files) and larger directories. Calculating this
scientifically can be pretty involved, and really should be the job
of a tool that ought to exist, but doesn't (yet). A more practical
approach is doing a ballpark estimate using local file counts and
typical fractions of large files and directories, using statistics
available from published papers.
The performance implications of a given metadata block size choice is
a subject of nearly infinite depth, and this question ultimately can
only be answered by doing experiments with a specific workload on
specific hardware. The metadata space utilization efficiency is
something that can be answered conclusively though.
yuri
<graycol.gif>"Buterbaugh, Kevin L" ---09/24/2016 07:19:09 AM---Hi
Sven, I am confused by your statement that the metadata block size
should be 1 MB and am very int
From: "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
Date: 09/24/2016 07:19 AM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hi Sven,
I am confused by your statement that the metadata block size should
be 1 MB and am very interested in learning the rationale behind this
as I am currently looking at all aspects of our current GPFS
configuration and the possibility of making major changes.
If you have a filesystem with only metadataOnly disks in the system
pool and the default size of an inode is 4K (which we would do, since
we have recently discovered that even on our scratch filesystem we
have a bazillion files that are 4K or smaller and could therefore
have their data stored in the inode, right?), then why would you set
the metadata block size to anything larger than 128K when a sub-block
is 1/32nd of a block? I.e., with a 1 MB block size for metadata
wouldn’t you be wasting a massive amount of space?
What am I missing / confused about there?
Oh, and here’s a related question … let’s just say I have the above
configuration … my system pool is metadata only and is on SSD’s. Then
I have two other dataOnly pools that are spinning disk. One is for
“regular” access and the other is the “capacity” pool … i.e. a pool
of slower storage where we move files with large access times. I have
a policy that says something like “move all files with an access time
> 6 months to the capacity pool.” Of those bazillion files less than
4K in size that are fitting in the inode currently, probably half a
bazillion (<grin>) of them would be subject to that rule. Will they
get moved to the spinning disk capacity pool or will they stay in the
inode??
Thanks! This is a very timely and interesting discussion for me as
well...
Kevin
On Sep 23, 2016, at 4:35 PM, Sven Oehme <
oehmes at us.ibm.com> wrote:
your metadata block size these days should be 1 MB and
there are only very few workloads for which you should
run with a filesystem blocksize below 1 MB. so if you
don't know exactly what to pick, 1 MB is a good starting
point.
the general rule still applies that your filesystem
blocksize (metadata or data pool) should match your raid
controller (or GNR vdisk) stripe size of the particular
pool.
so if you use a 128k strip size(defaut in many midrange
storage controllers) in a 8+2p raid array, your stripe or
track size is 1 MB and therefore the blocksize of this
pool should be 1 MB. i see many customers in the field
using 1MB or even smaller blocksize on RAID stripes of 2
MB or above and your performance will be significant
impacted by that.
Sven
------------------------------------------
Sven Oehme
Scalable Storage Research
email: oehmes at us.ibm.com
Phone: +1 (408) 824-8904
IBM Almaden Research Lab
------------------------------------------
<graycol.gif>Stephen Ulmer ---09/23/2016 12:16:34
PM---Not to be too pedantic, but I believe the the
subblock size is 1/32 of the block size (which strengt
From: Stephen Ulmer <ulmer at ulmer.org>
To: gpfsug main discussion list <
gpfsug-discuss at spectrumscale.org>
Date: 09/23/2016 12:16 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Not to be too pedantic, but I believe the the subblock
size is 1/32 of the block size (which strengthens Luis’s
arguments below).
I thought the the original question was NOT about inode
size, but about metadata block size. You can specify that
the system pool have a different block size from the rest
of the filesystem, providing that it ONLY holds metadata
(—metadata-block-size option to mmcrfs).
So with 4K inodes (which should be used for all new
filesystems without some counter-indication), I would
think that we’d want to use a metadata block size of
4K*32=128K. This is independent of the regular block
size, which you can calculate based on the workload if
you’re lucky.
There could be a great reason NOT to use 128K metadata
block size, but I don’t know what it is. I’d be happy to
be corrected about this if it’s out of whack.
--
Stephen
On Sep 22, 2016, at 3:37 PM, Luis
Bolinches <
luis.bolinches at fi.ibm.com> wrote:
Hi
My 2 cents.
Leave at least 4K inodes, then
you get massive improvement on
small files (less 3.5K minus
whatever you use on xattr)
About blocksize for data, unless
you have actual data that suggest
that you will actually benefit
from smaller than 1MB block,
leave there. GPFS uses sublocks
where 1/16th of the BS can be
allocated to different files, so
the "waste" is much less than you
think on 1MB and you get the
throughput and less structures of
much more data blocks.
No warranty at all but I try to
do this when the BS talk comes
in: (might need some clean up it
could not be last note but you
get the idea)
POSIX
find . -type f -name '*' -exec ls
-l {} \; > find_ls_files.out
GPFS
cd /usr/lpp/mmfs/samples/ilm
gcc
mmfindUtil_processOutputFile.c -o
mmfindUtil_processOutputFile
./mmfind /gpfs/shared -ls -type f
> find_ls_files.out
CONVERT to CSV
POSIX
cat find_ls_files.out | awk
'{print $5","}' >
find_ls_files.out.csv
GPFS
cat find_ls_files.out | awk
'{print $7","}' >
find_ls_files.out.csv
LOAD in octave
FILESIZE = int32 (dlmread
("find_ls_files.out.csv", ","));
Clean the second column (OPTIONAL
as the next clean up will do the
same)
FILESIZE(:,[2]) = [];
If we are on 4K aligment we need
to clean the files that go to
inodes (WELL not exactly ...
extended attributes! so maybe use
a lower number!)
FILESIZE(FILESIZE<=3584) =[];
If we are not we need to clean
the 0 size files
FILESIZE(FILESIZE==0) =[];
Median
FILESIZEMEDIAN = int32 (median
(FILESIZE))
Mean
FILESIZEMEAN = int32 (mean
(FILESIZE))
Variance
int32 (var (FILESIZE))
iqr interquartile range, i.e.,
the difference between the upper
and lower quartile, of the input
data.
int32 (iqr (FILESIZE))
Standard deviation
For some FS with lots of files
you might need a rather powerful
machine to run the calculations
on octave, I never hit anything
could not manage on a 64GB RAM
Power box. Most of the times it
is enough with my laptop.
--
Ystävällisin terveisin / Kind
regards / Saludos cordiales /
Salutations
Luis Bolinches
Lab Services
http://www-03.ibm.com/systems/services/labservices/
IBM Laajalahdentie 23 (main
Entrance) Helsinki, 00330 Finland
Phone: +358 503112585
"If you continually give you will
continually have." Anonymous
----- Original message -----
From: Stef Coene <
stef.coene at docum.org>
Sent by:
gpfsug-discuss-bounces at spectrumscale.org
To: gpfsug main discussion list <
gpfsug-discuss at spectrumscale.org>
Cc:
Subject: Re: [gpfsug-discuss]
Blocksize
Date: Thu, Sep 22, 2016 10:30 PM
On 09/22/2016 09:07 PM, J. Eric
Wonderley wrote:
> It defaults to 4k:
> mmlsfs testbs8M -i
> flag value
description
> -------------------
------------------------
>
-----------------------------------
> -i 4096
Inode size in bytes
>
> I think you can make as small
as 512b. Gpfs will store very
small
> files in the inode.
>
> Typically you want your average
file size to be your blocksize
and your
> filesystem has one blocksize
and one inodesize.
The files are not small, but
around 20 MB on average.
So I calculated with IBM that a 1
MB or 2 MB block size is best.
But I'm not sure if it's better
to use a smaller block size for
the
metadata.
The file system is not that large
(400 TB) and will hold backup
data
from CommVault.
Stef
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at
spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
Ellei edellä ole toisin
mainittu: / Unless stated
otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at
spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160926/d6262be6/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160926/d6262be6/attachment-0002.gif>
More information about the gpfsug-discuss
mailing list