[gpfsug-discuss] RAID type for system pool

Wed Sep 12 23:23:58 BST 2018

It's a good question, Simon. I don't know the answer. At least, when I 
started composing this e-mail what, 5 days ago now, I didn't.

I did a little test using dd to write directly to the NSD (not in 
production just to be clear...I've got co-workers on this list ;-) ).

Here's a partial dump of the inode prior:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
   Inode address: 1:4207872 2:4207872 size 512 nAddrs 25
   indirectionLevel=INDIRECT status=USERFILE
   objectVersion=103 generation=0x6E256E16 nlink=1
   owner uid=0 gid=0 mode=0200100644: -rw-r--r--
   blocksize code=5 (32 subblocks)
   lastBlockSubblocks=32
   checksum=0xF74A31AA is Valid

This is me writing junk to that sector of the NSD:
# dd if=/dev/urandom bs=512 of=/dev/sda seek=4207872 count=1

Post-junkifying:
# /usr/lpp/mmfs/bin/tsdbfs fs1 sector 1:4207872
Contents of 1 sector(s) from 1:4207872 = 0x1:403500, width1
0000000000000000: 4FA27C86 5D2076BB 6CD011DE D582F7CE  *O.|.].v.l.......*
0000000000000010: 60A708F1 A3C60FCD 7D796E3D CC97F586  *`.......}yn=....*
0000000000000020: 57B643A7 FABD7235 A2BD9B75 6DDA0771  *W.C...r5...um..q*
0000000000000030: 6A818411 0D59D1D3 2C4C7F39 2B2B529D  *j....Y..,L.9++R.*
0000000000000040: 9AE06C7D A8FB1DC9 7E783DB4 90A9E9E4  *..l}....~x=.....*
0000000000000050: B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0  *..........=.....*
0000000000000060: DA9C817C D20C0FB2 F30AAF55 C86D4155  *...|.......U.mAU*

Dump of the inode post-junkifying:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
   Inode address: 1:4207872 2:4207872 size 512 nAddrs 0
   indirectionLevel=13 status=4
   objectVersion=5738285791753303739 generation=0x9AE06C7D nlink=3955281023
   owner uid=2121809332 gid=-1867912732 mode=025076616711: prws--s--x
   flags set: exposed illCompressed dataUpdateMissRRPlus metaUpdateMiss
   blocksize code=8 (256 subblocks)
   lastBlockSubblocks=15582
   checksum=0xD582F7CE is INVALID (computed checksum=0x2A2FA283)

Attempts to access the file succeed but I get an fsstruct error:

# /usr/lpp/mmfs/samples/debugtools/fsstructlx.awk /var/log/messages
09/12 at 17:38:03 gpfs-adm1 FSSTRUCT fs1 108 FSErrValidate 
type=inode da=00000001:0000000000403500(1:4207872) sectors=0001 
repda=[nVal=2 00000001:0000000000403500(1:4207872) 
00000002:0000000000403500(2:4207872)] data=(len=00000200) 4FA27C86 
5D2076BB 6CD011DE D582F7CE 60A708F1 A3C60FCD 7D796E3D CC97F586 57B643A7 
FABD7235 A2BD9B75 6DDA0771 6A818411 0D59D1D3 2C4C7F39 2B2B529D 9AE06C7D 
A8FB1DC9 7E783DB4 90A9E9E4 B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0 DA9C817C 
D20C0FB2 F30AAF55

It *didn't* automatically repair it, it seems. The restripe did pick it up:
# /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --read-only --metadata-only
Scanning file system metadata, phase 1 ...
Inode 0 [fileset 0, snapshot 0 ] has mismatch in replicated disk address 
1:4206592 2:4206592
Scan completed successfully.
Scanning file system metadata, phase 2 ...
Scan completed successfully.
Scanning file system metadata, phase 3 ...
Scan completed successfully.
Scanning file system metadata, phase 4 ...
Scan completed successfully.
Scanning user file metadata ...
  100.00 % complete on Sun Aug 26 18:10:36 2018  (     69632 inodes with 
total        406 MB data processed)
Scan completed successfully.

I ran this to fix it:
# /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --metadata-only

And things appear better afterwards:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
   Inode address: 1:4207872 2:4207872 size 512 nAddrs 25
   indirectionLevel=INDIRECT status=USERFILE
   objectVersion=103 generation=0x6E256E16 nlink=1
   owner uid=0 gid=0 mode=0200100644: -rw-r--r--
   blocksize code=5 (32 subblocks)
   lastBlockSubblocks=32
   checksum=0xF74A31AA is Valid

This is with 4.2.3-10.

-Aaron

On 9/6/18 1:49 PM, Simon Thompson wrote:
> I thought reads were always round robin's (in some form) unless you set readreplicapolicy.
> 
> And I thought with fsstruct you had to use mmfsck offline to fix.
> 
> Simon
> ________________________________________
> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov]
> Sent: 06 September 2018 18:06
> To: gpfsug-discuss at spectrumscale.org
> Subject: Re: [gpfsug-discuss] RAID type for system pool
> 
> Answers inline based on my recollection of experiences we've had here:
> 
> On 9/6/18 12:19 PM, Bryan Banister wrote:
>> I have questions about how the GPFS metadata replication of 3 works.
>>
>>   1. Is it basically the same as replication of 2 but just have one more
>>      copy, making recovery much more likely?
> 
> That's my understanding.
> 
>>   2. If there is nothing that is checking that the data was correctly
>>      read off of the device (e.g. CRC checking ON READS like the DDNs do,
>>      T10PI or Data Integrity Field) then how does GPFS handle a corrupted
>>      read of the data?
>>      - unlikely with SSD but head could be off on a NLSAS read, no
>>      errors, but you get some garbage instead, plus no auto retries
> 
> The inode itself is checksummed:
> 
> # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs
> Enter command or null to read next sector.  Type ? for help.
> inode 20087366
> Inode 20087366 [20087366] snap 0 (index 582 in block 9808):
>     Inode address: 30:263275078 32:263264838 size 512 nAddrs 32
>     indirectionLevel=3 status=USERFILE
>     objectVersion=49352 generation=0x2B519B3 nlink=1
>     owner uid=8675309 gid=999 mode=0200100600: -rw-------
>     blocksize code=5 (32 subblocks)
>     lastBlockSubblocks=1
>     checksum=0xF2EF3427 is Valid
> ...
>     Disk pointers [32]:
>       0:  31:217629376    1:  30:217632960    2: (null)         ...
>      31: (null)
> 
> as are indirect blocks (I'm sure that's not an exhaustive list of
> checksummed metadata structures):
> 
> ind 31:217629376
> Indirect block starting in sector 31:217629376:
>     magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366
>     indirection level=2
>     checksum=0x6BDAA92A
>     CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A
>     Data pointers:
> 
>>   3. Does GPFS read at least two of the three replicas and compares them
>>      to ensure the data is correct?
>>      - expensive operation, so very unlikely
> 
> I don't know, but I do know it verifies the checksum and I believe if
> that's wrong it will try another replica.
> 
>>   4. If not reading multiple replicas for comparison, are reads round
>>      robin across all three copies?
> 
> I feel like we see pretty even distribution of reads across all replicas
> of our metadata LUNs, although this is looking overall at the array
> level so it may be a red herring.
> 
>>   5. If one replica is corrupted (bad blocks) what does GPFS do to
>>      recover this metadata copy?  Is this automatic or does this require
>>      a manual `mmrestripefs -c` operation or something?
>>      - If not, seems like a pretty simple idea and maybe an RFE worthy
>>      submission
> 
> My experience has been it will attempt to correct it (and maybe log an
> fsstruct error?). This was in the 3.5 days, though.
> 
>>   6. Would the idea of an option to run “background scrub/verifies” of
>>      the data/metadata be worthwhile to ensure no hidden bad blocks?
>>      - Using QoS this should be relatively painless
> 
> If you don't have array-level background scrubbing, this is what I'd
> suggest. (e.g. mmrestripefs -c --metadata-only).
> 
>>   7. With a drive failure do you have to delete the NSD from the file
>>      system and cluster, recreate the NSD, add it back to the FS, then
>>      again run the `mmrestripefs -c` operation to restore the replication?
>>      - As Kevin mentions this will end up being a FULL file system scan
>>      vs. a block-based scan and replication.  That could take a long time
>>      depending on number of inodes and type of storage!
>>
>> Thanks for any insight,
>>
>> -Bryan
>>
>> *From:* gpfsug-discuss-bounces at spectrumscale.org
>> <gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Buterbaugh,
>> Kevin L
>> *Sent:* Thursday, September 6, 2018 9:59 AM
>> *To:* gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> *Subject:* Re: [gpfsug-discuss] RAID type for system pool
>>
>> /Note: External Email/
>>
>> ------------------------------------------------------------------------
>>
>> Hi All,
>>
>> Wow - my query got more responses than I expected and my sincere thanks
>> to all who took the time to respond!
>>
>> At this point in time we do have two GPFS filesystems … one which is
>> basically “/home” and some software installations and the other which is
>> “/scratch” and “/data” (former backed up, latter not).  Both of them
>> have their metadata on SSDs set up as RAID 1 mirrors and replication set
>> to two.  But at this point in time all of the SSDs are in a single
>> storage array (albeit with dual redundant controllers) … so the storage
>> array itself is my only SPOF.
>>
>> As part of the hardware purchase we are in the process of making we will
>> be buying a 2nd storage array that can house 2.5” SSDs.  Therefore, we
>> will be splitting our SSDs between chassis and eliminating that last
>> SPOF.  Of course, this includes the new SSDs we are getting for our new
>> /home filesystem.
>>
>> Our plan right now is to buy 10 SSDs, which will allow us to test 3
>> configurations:
>>
>> 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my
>> 8 NSD servers as primary for one of those LV’s and the other 7 as
>> backups) and GPFS metadata replication set to 2.
>>
>> 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS
>> metadata replication set to 2.  This would mean that only 4 of my 8 NSD
>> servers would be a primary.
>>
>> 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3
>> (which leaves 1 SSD unused).  All 8 NSD servers primary for one SSD and
>> 1 serving up two.
>>
>> The responses I received concerning RAID 5 and performance were not a
>> surprise to me.  The main advantage that option gives is the most usable
>> storage space for the money (in fact, it gives us way more storage space
>> than we currently need) … but if it tanks performance, then that’s a
>> deal breaker.
>>
>> Personally, I like the four RAID 1 mirrors config like we’ve been using
>> for years, but it has the disadvantage of giving us the least usable
>> storage space … that config would give us the minimum we need for right
>> now, but doesn’t really allow for much future growth.
>>
>> I have no experience with metadata replication of 3 (but had actually
>> thought of that option, so feel good that others suggested it) so option
>> 3 will be a brand new experience for us.  It is the most optimal in
>> terms of meeting current needs plus allowing for future growth without
>> giving us way more space than we are likely to need).  I will be curious
>> to see how long it takes GPFS to re-replicate the data when we simulate
>> a drive failure as opposed to how long a RAID rebuild takes.
>>
>> I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesday
>> my refrigerator died!) … and also believe that the definition of a
>> pessimist is “someone with experience” <grin> … so we will definitely
>> not set GPFS metadata replication to less than two, nor will we use
>> non-Enterprise class SSDs for metadata … but I do still appreciate the
>> suggestions.
>>
>> If there is interest, I will report back on our findings.  If anyone has
>> any additional thoughts or suggestions, I’d also appreciate hearing
>> them.  Again, thank you!
>>
>> Kevin
>>
>> —
>>
>> Kevin Buterbaugh - Senior System Administrator
>>
>> Vanderbilt University - Advanced Computing Center for Research and Education
>>
>> Kevin.Buterbaugh at vanderbilt.edu
>> <mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633
>>
>>
>> ------------------------------------------------------------------------
>>
>> Note: This email is for the confidential use of the named addressee(s)
>> only and may contain proprietary, confidential, or privileged
>> information and/or personal data. If you are not the intended recipient,
>> you are hereby notified that any review, dissemination, or copying of
>> this email is strictly prohibited, and requested to notify the sender
>> immediately and destroy this email and any attachments. Email
>> transmission cannot be guaranteed to be secure or error-free. The
>> Company, therefore, does not make any guarantees as to the completeness
>> or accuracy of this email or any attachments. This email is for
>> informational purposes only and does not constitute a recommendation,
>> offer, request, or solicitation of any kind to buy, sell, subscribe,
>> redeem, or perform any type of transaction of a financial product.
>> Personal data, as defined by applicable data privacy laws, contained in
>> this email may be processed by the Company, and any of its affiliated or
>> related companies, for potential ongoing compliance and/or
>> business-related purposes. You may have rights regarding your personal
>> data; for information on exercising these rights or the Company’s
>> treatment of personal data, please email datarequests at jumptrading.com.
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> 
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776