[gpfsug-discuss] Inode size, and system pool subblock

Alec anacreo at gmail.com
Wed Aug 2 17:07:17 BST 2023


I think things are conflated here...

The inode size is really just a call on how much functionality you need in
an inode.  I wouldn't even think about disk block size when setting this.
Essentially the smaller the inode the less space I need for metadata but
also the less capacity I have in my inode.

The default is 4k and if you don't change it then GPFS will put up to a
3.8k file in the inode itself vs going to an indirect disk allocation.
Someone mentioned encryption will bypass this feature, but it's actually
encryption that perhaps requires larger inode sizes to store all the key
meta info (you can have up to 8 keys per inode I believe).

So essentially it you've got a smaller inode size your directories max size
will max out sooner, your ACLs could be constrained, large file names can
exhaust, you may not have enough space for Encryption details.  But the
upshot is you need to dedicate less space to metadata and can handle more
file entries.  So if you've got billions of files and are managing replicas
then you should consider fine tuning inode size down.

You can go from 3.5% of space going to inodes to 1% if you went from 4k to
512 bytes.. but there is a reason GPFS defaults to 4k... And doesn't expand
on it too much.  If you've guessed wrong you're kind of hosed.

None of this has to do with hardware block sizes, subblock allocation and
fragment sizes.  And further compounded by 4k native block sizes vs
emulated 512 block size some disk hardware does.

For GPFS you generally will have a very large block size 256kb or 1MB and
GPFS will divide those blocks into 32 fragments.  So you may have your
smallest unit being a 8kb or 32kb fragment.  If you have a dedicated MD
pool (highly recommended) you'd definitely specify a smaller block size
than 1MB (128kb = 4kb fragments).

The balance you're trying to strike here is the least amount of commands to
retrieve your data efficiently.  Think about the roundtrip on the bus being
the same for a 4kb read vs a 1mb read so try to maximize this.

Generally the goal of the file system is to ensure that the excess data
that is read when trying to pull fragments is as useless as possible.

I may also be confused but I wouldn't worry so much about inode size to
block size.. just worry about getting large blocks working well for regular
storage pool if your data is huge and using a smaller block size in MD if
dedicate pool which is almost always recommended.

Be very careful of specifying a small inode size because it's not just max
filenames and max file counts in a directory.. it is much more.. and if you
have a lot of small files don't underestimate the advantage of those files
being stored directly in the inode.  A 512 byte inode could only store
about a 380byte file vs a 4k file storing 3800 byte file.  These files tend
to be shell scripts and config files which you really don't want to be
waiting around for and occupying a huge 1mb read for and waisting a
potentially larger 64kb fragment allocation on.

Alec



On Wed, Aug 2, 2023, 4:47 AM Olaf Weiser <olaf.weiser at de.ibm.com> wrote:

> Hallo Peter,
>
> [1] *[...] having a smaller inode size than the subblock size means*
> * there's a big wastage on disk usage, with no performance benefit to
> doing so[...] *
> in short - yes 😉
>
>
>
> [2]
> *[...]  I believe I'm correct in saying that inodes are not the only
> things to live on the metadata pool, so I assume that some other metadata
> might benefit from the larger block/subblock size. But looking at the
> number of inodes, the inode size, and the space consumed in the system
> pool, it really looks like the majority of space consumed is by
> inodes.[...] *
> you may need to consider snapshots and directories , which all contributes
> to MD space
>
> predicting the space requirements for MD for directories is always hard,
> because the size of a directory  is depending on the file's name length,
> the users will create...
>
>
> further more,  using a less than 4k  inode size makes also not much sense,
> when taking into account, that NVMEs and other modern block storage devices
> comes with a hardware block size of 4k (even though GPFS still can deal
> with 512 Bytes per sector)
>
>
> hope this helps ..
>
>
>
>
>
> ------------------------------
> *Von:* gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> im Auftrag von
> Peter Chase <peter.chase at metoffice.gov.uk>
> *Gesendet:* Mittwoch, 2. August 2023 11:09
> *An:* gpfsug-discuss at gpfsug.org <gpfsug-discuss at gpfsug.org>
> *Betreff:* [EXTERNAL] [gpfsug-discuss] Inode size, and system pool
> subblock
>
> Good Morning, I have a question about inode size vs subblock size. Can
> anyone think of a reason that the chosen inode size of a scale filesystem
> should be smaller than the subblock size for the metadata pool? I'm looking
> at an existing filesystem,
> ZjQcmQRYFpfptBannerStart
> This Message Is From an External Sender
> This message came from outside your organization.
> Report Suspicious
>
> <https://us-phishalarm-ewt.proofpoint.com/EWT/v1/PjiDSg!1g-uTV5zSvlaFYv7uMHkoFeQxN2NCCmKt8m8bO-cETpoqWzqjw44ZGDhXsa5HhRl_zpqhYnKttgvkXxxkccvwPudkw6wVRSXHaJbtLOtA_jXzsNaKGOzzQZE69zpJnR6tbWVCjgrkUVf$>
>
> ZjQcmQRYFpfptBannerEnd
> Good Morning,
>
> I have a question about inode size vs subblock size. Can anyone think of a
> reason that the chosen inode size of a scale filesystem should be smaller
> than the subblock size for the metadata pool?
> I'm looking at an existing filesystem, the inode size is 2KiB, and the
> subblock is 4KiB.
> It feels like I'm missing something. If I've understood the docs on
> blocks and subblocks correctly, it sounds like the subblock is the smallest
> atomic access size. Meaning with a 4K subblock, and a 2K inode, reading the
> inode would return its contents and 2K of empty subblock every time. So, in
> my head (and maybe only there), having a smaller inode size than the
> subblock size means there's a big wastage on disk usage, with no
> performance benefit to doing so.
> I believe I'm correct in saying that inodes are not the only things to
> live on the metadata pool, so I assume that some other metadata might
> benefit from the larger block/subblock size. But looking at the number of
> inodes, the inode size, and the space consumed in the system pool, it
> really looks like the majority of space consumed is by inodes.
>
> As I said, I feel like I'm missing something, so if anyone can tell me
> where I'm wrong it would be greatly appreciated!
>
> Sincerely,
>
> Pete Chase
>
> UKMO
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at gpfsug.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20230802/72ed3f28/attachment-0001.htm>


More information about the gpfsug-discuss mailing list