[gpfsug-discuss] Fwd: Blocksize
Marc A Kaplan
makaplan at us.ibm.com
Thu Sep 29 16:32:47 BST 2016
Frankly, I just don't "get" what it is you seem not to be "getting" -
perhaps someone else who does "get" it can rephrase: FORGET about
Subblocks when thinking about inodes being packed into the file of all
inodes.
Additional facts that may address some of the other concerns:
I started working on GPFS at version 3.1 or so. AFAIK GPFS always had and
has one file of inodes, "packed", with no wasted space between inodes.
Period. Full Stop.
RAID! Now we come to a mistake that I've seen made by more than a handful
of customers!
It is generally a mistake to use RAID with parity (such as classic RAID5)
to store metadata.
Why? Because metadata is often updated with "small writes" - for example
suppose we have to update some fields in an inode, or an indirect block,
or append a log record...
For RAID with parity and large stripe sizes -- this means that updating
just one disk sector can cost a full stripe read + writing the changed
data and parity sectors.
SO, if you want protection against storage failures for your metadata, use
either RAID mirroring/replication and/or GPFS metadata replication. (belt
and/or suspenders)
(Arguments against relying solely on RAID mirroring: single enclosure/box
failure (fire!), single hardware design (bugs or defects), single
firmware/microcode(bugs.))
Yes, GPFS is part of "the cyber." We're making it stronger everyday. But
it already is great.
--marc
From: "Buterbaugh, Kevin L" <Kevin.Buterbaugh at Vanderbilt.Edu>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 09/29/2016 11:03 AM
Subject: [gpfsug-discuss] Fwd: Blocksize
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Resending from the right e-mail address...
Begin forwarded message:
From: gpfsug-discuss-owner at spectrumscale.org
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:36 AM CDT
To: klb at accre.vanderbilt.edu
You are not allowed to post to this mailing list, and your message has
been automatically rejected. If you think that your messages are
being rejected in error, contact the mailing list owner at
gpfsug-discuss-owner at spectrumscale.org.
From: "Kevin L. Buterbaugh" <klb at accre.vanderbilt.edu>
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:29 AM CDT
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Hi Marc and others,
I understand … I guess I did a poor job of wording my question, so I’ll
try again. The IBM recommendation for metadata block size seems to be
somewhere between 256K - 1 MB, depending on who responds to the question.
If I were to hypothetically use a 256K metadata block size, does the
“1/32nd of a block” come into play like it does for “not metadata”? I.e.
256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode
size) minimum?
And here’s a really off the wall question … yesterday we were discussing
the fact that there is now a single inode file. Historically, we have
always used RAID 1 mirrors (first with spinning disk, as of last fall now
on SSD) for metadata and then use GPFS replication on top of that. But
given that there is a single inode file is that “old way” of doing things
still the right way? In other words, could we potentially be better off
by using a couple of 8+2P RAID 6 LUNs?
One potential downside of that would be that we would then only have two
NSD servers serving up metadata, so we discussed the idea of taking each
RAID 6 LUN and splitting it up into multiple logical volumes (all that
done on the storage array, of course) and then presenting those to GPFS as
NSDs???
Or have I gone from merely asking stupid questions to Trump-level
craziness???? ;-)
Kevin
On Sep 28, 2016, at 10:23 AM, Marc A Kaplan <makaplan at us.ibm.com> wrote:
OKAY, I'll say it again. inodes are PACKED into a single inode file. So
a 4KB inode takes 4KB, REGARDLESS of metadata blocksize. There is no
wasted space.
(Of course if you have metadata replication = 2, then yes, double that.
And yes, there overhead for indirect blocks (indices), allocation maps,
etc, etc.)
And your choice is not just 512 or 4096. Maybe 1KB or 2KB is a good
choice for your data distribution, to optimize packing of data and/or
directories into inodes...
Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...
mmcrfs x2K -i 2048
[root at n2 charts]# mmlsfs x2K -i
flag value description
------------------- ------------------------
-----------------------------------
-i 2048 Inode size in bytes
Works for me!
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160929/c4b04709/attachment-0002.htm>
More information about the gpfsug-discuss
mailing list