[gpfsug-discuss] internal details on GPFS inode expansion

Wayne Sawdon wsawdon at us.ibm.com
Tue Dec 1 22:41:49 GMT 2020


Dave Johnson at ddj at brown.edu asks:

   When GPFS needs to add inodes to the filesystem, it seems to pre-create
   about 4 million of them.
   Judging by the logs, it seems it only takes a few (13 maybe) seconds to
   do this.
   However we are suspecting that this might only be to request the
   additional inodes and
   that there is some background activity for some time afterwards.
   Would someone who has knowledge of the actual internals be willing to
   confirm or deny this,
   and if there is background activity, is it on all nodes in the cluster,
   NSD nodes, "default worker nodes"?

Inodes are typically 4KB and reside ondisk in full blocks in the "inode 0
file". For every inode there is also an entry in the "inode allocation map"
which indicates the inode's status (eg free, inuse). To add inodes we have
to add data to both. First we determine how many inodes to add (eg always
add full blocks of inodes, etc),  then how many "passes" will it take to
add them (the "passes" are an artifact of the inode map layout).   Adding
the inodes themselves involves writing blocks of free inodes. This is
multi-threaded on a single node. Adding to the inode map, may involve
adding more inode map "segments" or just using free space in the current
segments. If adding segments these are formatted and written by multiple
threads on a single node, Once the on-disk data structures are complete we
update the in-memory structures to reflect that all of the new inodes are
free and we update the "stripe group descriptor"  and broadcast it to all
the nodes that have the file system mounted.

In old code - say pre 4.1 or 4.2 -- we went through another step to reread
all of the inode allocation map back into memory to recompute the number of
free inodes. That would have been in parallel on all the nodes that had the
file system mounted. Around 4.2 or so this was changed to simply update the
in-memory counters (since we know how many inodes were added, there is no
need to recount them).

So, adding 4M inodes involves writing a little more than 16 GB of metadata
to the disk, then cycle through the in-memory data structures. Writing 16
GB in 13 seconds works out to a little more than 1 GB/s. Sounds reasonable.

-Wayne

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20201201/adef1285/attachment-0001.htm>


More information about the gpfsug-discuss mailing list