<font size=2 face="sans-serif">Hi - let me try to explain....</font><br><br><font size=2 face="sans-serif"><i>[...]  I.e. 256 / 32 = 8K, so

am I reading / writing *2* inodes (assuming 4K inode size) minimum? [...]

</i></font><br><font size=2 face="sans-serif">Answer .. your inodes are written in

an separate (hidden) file .. with an MD blocksize of 256K ... you can access

 64 inodes with one IO to the file system </font><br><font size=2 face="sans-serif">so e.g. a policy ran need to initiate

1 IO to read 64 inodes...  <font size=2 face="sans-serif">if your MD blocksize would be 1MB ..

you could access 256 Inodes with one IO to the file system meta data...(policy

runs)</font><br><br><font size=2 face="sans-serif">if you write an  new  regular

file, an inode gets created for it  .. and gets written into your

inode file... forget about MD blocksize here...</font><br><font size=2 face="sans-serif"> it gets written directly so you

will see an 8n of segments (512 segment size) IO   to the MD (in case

your inode size is 4k) </font><br><br><font size=2 face="sans-serif">in addition...  other meta data

is stored in the system pool, like directory blocks or indirect blocks..

these blocks are 32K .. and so.. if you would choose a blocksize for MD

> 1 MB ... you would waste some space because of the rule 1/32 of blocksize

 is the smallest allocatable space </font><br><font size=2 face="sans-serif">in one line: my advice ...select 1MB

blocksize for MD ... </font><br><br><font size=2 face="sans-serif"><i>[..] disk layout.. </i></font><br><font size=2 face="sans-serif">keep in mind.. your #IOPS is most likely

limited by your storage backend .. with spinning drives you can estimate

around 100 IOPS per drive.. </font><br><font size=2 face="sans-serif">even though the metaData is stored in

a hidden file.. inodes are access directly from/to disk during normal operation

.. you backend should be able to cache these IOs accordingly... but you

won't be able to avoid , that Inodes have to be flushed to disk and -other

way round-, red from disk  without accessing a full stripe of your

RAID .. so depending on the BE .. an N-Way replication is more efficient

here than a RAID6 or 8+2p</font><br><br><font size=2 face="sans-serif">in addition keep in mind.. if you divide

1 MB (blocksize from FS) into a RAID 6 or 8+2p ...the data transfer size

to each physical disk is rather small and will hurt your performance  </font><br><br><font size=2 face="sans-serif">l.b.n.l. .. in terms of IO .. you can

save a lot of IOPS to the physical disk layer.. if you go with an nWay

replication in comparison to RAID6 .. because every physical disk these

days... can satisfy an 1MB IO request ... So if you initiate 1 IO with

1MB size from GPFS .. it can be answered with exactly 1 IO from physical

disk .. </font><br><font size=2 face="sans-serif">(compared to RAID 6 .. - your storage

backend would have to satisfy this single IO with 1MB with at least 4 or

8 IOs ... )</font><br><br><font size=2 face="sans-serif">.. MD is rather small so the trade off

(waste of space ) can be ignored .. so go with RAID 1  or nWay replication...for

MD </font><br><br><br><font size=2 face="sans-serif">hope this helps..</font><br><br><br><br><br><div><font size=2 face="sans-serif">Mit freundlichen Grüßen / Kind regards</font><br><br><font size=2 face="sans-serif"> <br>Olaf Weiser<br> <br>EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,<br>-------------------------------------------------------------------------------------------------------------------------------------------<br>IBM Deutschland<br>IBM Allee 1<br>71139 Ehningen<br>Phone: +49-170-579-44-66<br>E-Mail: olaf.weiser@de.ibm.com<br>-------------------------------------------------------------------------------------------------------------------------------------------<br>IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter<br>Geschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert

Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,

HRB 14562 / WEEE-Reg.-Nr. DE 99369940 </font><br><br><br><br><font size=1 color=#5f5f5f face="sans-serif">From:      

 </font><font size=1 face="sans-serif">"Buterbaugh, Kevin

L" <Kevin.Buterbaugh@Vanderbilt.Edu></font><br><font size=1 color=#5f5f5f face="sans-serif">To:      

 </font><font size=1 face="sans-serif">gpfsug main discussion

list <gpfsug-discuss@spectrumscale.org></font><br><font size=1 color=#5f5f5f face="sans-serif">Date:      

 </font><font size=1 face="sans-serif">09/29/2016 05:03 PM</font><br><font size=1 color=#5f5f5f face="sans-serif">Subject:    

   </font><font size=1 face="sans-serif">[gpfsug-discuss]

Fwd:  Blocksize</font><br><font size=1 color=#5f5f5f face="sans-serif">Sent by:    

   </font><font size=1 face="sans-serif">gpfsug-discuss-bounces@spectrumscale.org</font><br><hr noshade><br><br><br><font size=3>Resending from the right e-mail address...</font><br><br><font size=3>Begin forwarded message:</font><br><br><font size=3 face="sans-serif"><b>From: </b></font><a href="mailto:gpfsug-discuss-owner@spectrumscale.org"><font size=3 color=blue face="sans-serif"><u>gpfsug-discuss-owner@spectrumscale.org</u></font></a><br><font size=3 face="sans-serif"><b>Subject: Re: [gpfsug-discuss] Blocksize</b></font><br><font size=3 face="sans-serif"><b>Date: </b>September 29, 2016 at 10:00:36

AM CDT</font><br><font size=3 face="sans-serif"><b>To: </b></font><a href=mailto:klb@accre.vanderbilt.edu><font size=3 color=blue face="sans-serif"><u>klb@accre.vanderbilt.edu</u></font></a><br><br><font size=3>You are not allowed to post to this mailing list, and

your message has<br>been automatically rejected.  If you think that your messages are<br>being rejected in error, contact the mailing list owner at</font><font size=3 color=blue><u><br></u></font><a href="mailto:gpfsug-discuss-owner@spectrumscale.org"><font size=3 color=blue><u>gpfsug-discuss-owner@spectrumscale.org</u></font></a><font size=3>.<br><br></font><br><font size=3 face="sans-serif"><b>From: </b>"Kevin L. Buterbaugh"

<</font><a href=mailto:klb@accre.vanderbilt.edu><font size=3 color=blue face="sans-serif"><u>klb@accre.vanderbilt.edu</u></font></a><font size=3 face="sans-serif">></font><br><font size=3 face="sans-serif"><b>Subject: Re: [gpfsug-discuss] BlocksizeHi

Marc and others, </b></font><br><br><font size=3 face="sans-serif"><b>I understand … I guess I did a poor

job of wording my question, so I’ll try again.  The IBM recommendation

for metadata block size seems to be somewhere between 256K - 1 MB, depending

on who responds to the question.  If I were to hypothetically use

a 256K metadata block size, does the “1/32nd of a block” come into play

like it does for “not metadata”?  I.e. 256 / 32 = 8K, so am I reading

/ writing *2* inodes (assuming 4K inode size) minimum?</b></font><br><font size=3 face="sans-serif"><b>Date: </b>September 29, 2016 at 10:00:29

AM CDT</font><br><font size=3 face="sans-serif"><b>To: </b>gpfsug main discussion list

<</font><a href="mailto:gpfsug-discuss@spectrumscale.org"><font size=3 color=blue face="sans-serif"><u>gpfsug-discuss@spectrumscale.org</u></font></a><font size=3 face="sans-serif">></font><br><font size=3><br></font><br><font size=3>Hi Marc and others, </font><br><br><font size=3>I understand … I guess I did a poor job of wording my

question, so I’ll try again.  The IBM recommendation for metadata

block size seems to be somewhere between 256K - 1 MB, depending on who

responds to the question.  If I were to hypothetically use a 256K

metadata block size, does the “1/32nd of a block” come into play like

it does for “not metadata”?  I.e. 256 / 32 = 8K, so am I reading

/ writing *2* inodes (assuming 4K inode size) minimum?</font><br><br><font size=3>And here’s a really off the wall question … yesterday

we were discussing the fact that there is now a single inode file.  Historically,

we have always used RAID 1 mirrors (first with spinning disk, as of last

fall now on SSD) for metadata and then use GPFS replication on top of that.

 But given that there is a single inode file is that “old way” of

doing things still the right way?  In other words, could we potentially

be better off by using a couple of 8+2P RAID 6 LUNs?</font><br><br><font size=3>One potential downside of that would be that we would

then only have two NSD servers serving up metadata, so we discussed the

idea of taking each RAID 6 LUN and splitting it up into multiple logical

volumes (all that done on the storage array, of course) and then presenting

those to GPFS as NSDs???</font><br><br><font size=3>Or have I gone from merely asking stupid questions to

Trump-level craziness????  ;-)</font><br><br><font size=3>Kevin</font><br><br><font size=3>On Sep 28, 2016, at 10:23 AM, Marc A Kaplan <</font><a href=mailto:makaplan@us.ibm.com><font size=3 color=blue><u>makaplan@us.ibm.com</u></font></a><font size=3>>

wrote:</font><br><br><font size=2 face="sans-serif">OKAY, I'll say it again.  inodes

are PACKED into a single inode file.  So a 4KB inode takes 4KB, REGARDLESS

of metadata blocksize.  There is no wasted space.<font size=3> <font size=2 face="sans-serif"> (Of course if you have metadata replication = 2, then yes, double that.

 And yes, there overhead for indirect blocks (indices), allocation

maps, etc, etc.)<br><br>And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good

choice for your data distribution, to optimize packing of data and/or directories

into inodes...</font><font size=3><br></font><font size=2 face="sans-serif"><br>Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...</font><font size=3><br></font><font size=2 face="sans-serif"><br>mmcrfs x2K -i 2048</font><font size=3><br></font><font size=2 face="sans-serif"><br>[root@n2 charts]# mmlsfs x2K -i<br>flag                value  

                 description<br>------------------- ------------------------ -----------------------------------<br> -i                 2048  

                  Inode size

in bytes</font><font size=3><br></font><font size=2 face="sans-serif"><br>Works for me!</font><font size=3><br>_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at </font><a href=http://spectrumscale.org/><font size=3 color=blue><u>spectrumscale.org</u></font></a><font size=3 color=blue><u><br></u></font><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><font size=3 color=blue><u>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</u></font></a><br><br><font size=3><br></font><br><tt><font size=2>_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at spectrumscale.org<br></font></tt><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss"><tt><font size=2>http://gpfsug.org/mailman/listinfo/gpfsug-discuss</font></tt></a><tt><font size=2><br></font></tt><br><br></div><BR>