[gpfsug-discuss] Blocksize - consider IO transfer efficiency above your other prejudices

Marc A Kaplan makaplan at us.ibm.com
Sat Sep 24 18:31:37 BST 2016


(I can answer your basic questions, Sven has more experience with tuning 
very large file systems, so perhaps he will have more to say...)

1. Inodes are packed into the file of inodes. (There is one file of all 
the inodes in a filesystem). 

If you have metadata-blocksize 1MB you will have 256 of 4KB inodes per 
block.   Forget about sub-blocks when it comes to the file of inodes.

2. IF a file's data fits in its inode, then migrating that file from one 
pool to another just changes the preferred pool name in the inode.  No 
data movement.  Should the file later "grow" to require a data block, that 
data block will be allocated from whatever pool is named in the inode at 
that time.

See the email I posted earlier today.  Basically: FORGET what you thought 
you knew about optimal metadata blocksize (perhaps based on how you 
thought metadata was laid out on disk) and just stick to optimal IO 
transfer blocksizes. 

Yes, there may be contrived scenarios or even a few real live special 
cases, but those would be few and far between. 
Try following the newer general, easier, rule and see how well it works.




From:   "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   09/24/2016 10:19 AM
Subject:        Re: [gpfsug-discuss] Blocksize
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi Sven, 

I am confused by your statement that the metadata block size should be 1 
MB and am very interested in learning the rationale behind this as I am 
currently looking at all aspects of our current GPFS configuration and the 
possibility of making major changes.

If you have a filesystem with only metadataOnly disks in the system pool 
and the default size of an inode is 4K (which we would do, since we have 
recently discovered that even on our scratch filesystem we have a 
bazillion files that are 4K or smaller and could therefore have their data 
stored in the inode, right?), then why would you set the metadata block 
size to anything larger than 128K when a sub-block is 1/32nd of a block? 
I.e., with a 1 MB block size for metadata wouldn’t you be wasting  a 
massive amount of space?

What am I missing / confused about there?

Oh, and here’s a related question … let’s just say I have the above 
configuration … my system pool is metadata only and is on SSD’s.  Then I 
have two other dataOnly pools that are spinning disk.  One is for 
“regular” access and the other is the “capacity” pool … i.e. a pool of 
slower storage where we move files with large access times.  I have a 
policy that says something like “move all files with an access time > 6 
months to the capacity pool.”  Of those bazillion files less than 4K in 
size that are fitting in the inode currently, probably half a bazillion 
(<grin>) of them would be subject to that rule.  Will they get moved to 
the spinning disk capacity pool or will they stay in the inode??

Thanks!  This is a very timely and interesting discussion for me as 
well...

Kevin

On Sep 23, 2016, at 4:35 PM, Sven Oehme <oehmes at us.ibm.com> wrote:

your metadata block size these days should be 1 MB and there are only very 
few workloads for which you should run with a filesystem blocksize below 1 
MB. so if you don't know exactly what to pick, 1 MB is a good starting 
point. 
the general rule still applies that your filesystem blocksize (metadata or 
data pool) should match your raid controller (or GNR vdisk) stripe size of 
the particular pool.

so if you use a 128k strip size(defaut in many midrange storage 
controllers) in a 8+2p raid array, your stripe or track size is 1 MB and 
therefore the blocksize of this pool should be 1 MB. i see many customers 
in the field using 1MB or even smaller blocksize on RAID stripes of 2 MB 
or above and your performance will be significant impacted by that. 

Sven

------------------------------------------
Sven Oehme 
Scalable Storage Research 
email: oehmes at us.ibm.com 
Phone: +1 (408) 824-8904 
IBM Almaden Research Lab 
------------------------------------------

<graycol.gif>Stephen Ulmer ---09/23/2016 12:16:34 PM---Not to be too 
pedantic, but I believe the the subblock size is 1/32 of the block size 
(which strengt

From: Stephen Ulmer <ulmer at ulmer.org>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 09/23/2016 12:16 PM
Subject: Re: [gpfsug-discuss] Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org



Not to be too pedantic, but I believe the the subblock size is 1/32 of the 
block size (which strengthens Luis’s arguments below).

I thought the the original question was NOT about inode size, but about 
metadata block size. You can specify that the system pool have a different 
block size from the rest of the filesystem, providing that it ONLY holds 
metadata (—metadata-block-size option to mmcrfs).

So with 4K inodes (which should be used for all new filesystems without 
some counter-indication), I would think that we’d want to use a metadata 
block size of 4K*32=128K. This is independent of the regular block size, 
which you can calculate based on the workload if you’re lucky.

There could be a great reason NOT to use 128K metadata block size, but I 
don’t know what it is. I’d be happy to be corrected about this if it’s out 
of whack.

-- 
Stephen


On Sep 22, 2016, at 3:37 PM, Luis Bolinches <luis.bolinches at fi.ibm.com> 
wrote:

Hi

My 2 cents.

Leave at least 4K inodes, then you get massive improvement on small files 
(less 3.5K minus whatever you use on xattr)

About blocksize for data, unless you have actual data that suggest that 
you will actually benefit from smaller than 1MB block, leave there. GPFS 
uses sublocks where 1/16th of the BS can be allocated to different files, 
so the "waste" is much less than you think on 1MB and you get the 
throughput and less structures of much more data blocks.

No warranty at all but I try to do this when the BS talk comes in: (might 
need some clean up it could not be last note but you get the idea)

POSIX
find . -type f -name '*' -exec ls -l {} \; > find_ls_files.out
GPFS
cd /usr/lpp/mmfs/samples/ilm
gcc mmfindUtil_processOutputFile.c -o mmfindUtil_processOutputFile
./mmfind /gpfs/shared -ls -type f > find_ls_files.out
CONVERT to CSV

POSIX
cat find_ls_files.out | awk '{print $5","}' > find_ls_files.out.csv
GPFS
cat find_ls_files.out | awk '{print $7","}' > find_ls_files.out.csv
LOAD in octave

FILESIZE = int32 (dlmread ("find_ls_files.out.csv", ","));
Clean the second column (OPTIONAL as the next clean up will do the same)

FILESIZE(:,[2]) = [];
If we are on 4K aligment we need to clean the files that go to inodes 
(WELL not exactly ... extended attributes! so maybe use a lower number!)

FILESIZE(FILESIZE<=3584) =[];
If we are not we need to clean the 0 size files

FILESIZE(FILESIZE==0) =[];
Median

FILESIZEMEDIAN = int32 (median (FILESIZE))
Mean

FILESIZEMEAN = int32 (mean (FILESIZE))
Variance

int32 (var (FILESIZE))
iqr interquartile range, i.e., the difference between the upper and lower 
quartile, of the input data.

int32 (iqr (FILESIZE))
Standard deviation


For some FS with lots of files you might need a rather powerful machine to 
run the calculations on octave, I never hit anything could not manage on a 
64GB RAM Power box. Most of the times it is enough with my laptop.



--
Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations

Luis Bolinches
Lab Services
http://www-03.ibm.com/systems/services/labservices/

IBM Laajalahdentie 23 (main Entrance) Helsinki, 00330 Finland
Phone: +358 503112585

"If you continually give you will continually have." Anonymous


----- Original message -----
From: Stef Coene <stef.coene at docum.org>
Sent by: gpfsug-discuss-bounces at spectrumscale.org
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Cc:
Subject: Re: [gpfsug-discuss] Blocksize
Date: Thu, Sep 22, 2016 10:30 PM

On 09/22/2016 09:07 PM, J. Eric Wonderley wrote:
> It defaults to 4k:
> mmlsfs testbs8M -i
> flag                value                    description
> ------------------- ------------------------
> -----------------------------------
>  -i                 4096                     Inode size in bytes
>
> I think you can make as small as 512b.   Gpfs will store very small
> files in the inode.
>
> Typically you want your average file size to be your blocksize and your
> filesystem has one blocksize and one inodesize.

The files are not small, but around 20 MB on average.
So I calculated with IBM that a 1 MB or 2 MB block size is best.

But I'm not sure if it's better to use a smaller block size for the
metadata.

The file system is not that large (400 TB) and will hold backup data
from CommVault.


Stef
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3 
Registered in Finland

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160924/ad5db56c/attachment-0002.htm>


More information about the gpfsug-discuss mailing list