[gpfsug-discuss] RAID type for system pool
Aaron Knister
aaron.s.knister at nasa.gov
Wed Sep 12 23:23:58 BST 2018
It's a good question, Simon. I don't know the answer. At least, when I
started composing this e-mail what, 5 days ago now, I didn't.
I did a little test using dd to write directly to the NSD (not in
production just to be clear...I've got co-workers on this list ;-) ).
Here's a partial dump of the inode prior:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
Inode address: 1:4207872 2:4207872 size 512 nAddrs 25
indirectionLevel=INDIRECT status=USERFILE
objectVersion=103 generation=0x6E256E16 nlink=1
owner uid=0 gid=0 mode=0200100644: -rw-r--r--
blocksize code=5 (32 subblocks)
lastBlockSubblocks=32
checksum=0xF74A31AA is Valid
This is me writing junk to that sector of the NSD:
# dd if=/dev/urandom bs=512 of=/dev/sda seek=4207872 count=1
Post-junkifying:
# /usr/lpp/mmfs/bin/tsdbfs fs1 sector 1:4207872
Contents of 1 sector(s) from 1:4207872 = 0x1:403500, width1
0000000000000000: 4FA27C86 5D2076BB 6CD011DE D582F7CE *O.|.].v.l.......*
0000000000000010: 60A708F1 A3C60FCD 7D796E3D CC97F586 *`.......}yn=....*
0000000000000020: 57B643A7 FABD7235 A2BD9B75 6DDA0771 *W.C...r5...um..q*
0000000000000030: 6A818411 0D59D1D3 2C4C7F39 2B2B529D *j....Y..,L.9++R.*
0000000000000040: 9AE06C7D A8FB1DC9 7E783DB4 90A9E9E4 *..l}....~x=.....*
0000000000000050: B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0 *..........=.....*
0000000000000060: DA9C817C D20C0FB2 F30AAF55 C86D4155 *...|.......U.mAU*
Dump of the inode post-junkifying:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
Inode address: 1:4207872 2:4207872 size 512 nAddrs 0
indirectionLevel=13 status=4
objectVersion=5738285791753303739 generation=0x9AE06C7D nlink=3955281023
owner uid=2121809332 gid=-1867912732 mode=025076616711: prws--s--x
flags set: exposed illCompressed dataUpdateMissRRPlus metaUpdateMiss
blocksize code=8 (256 subblocks)
lastBlockSubblocks=15582
checksum=0xD582F7CE is INVALID (computed checksum=0x2A2FA283)
Attempts to access the file succeed but I get an fsstruct error:
# /usr/lpp/mmfs/samples/debugtools/fsstructlx.awk /var/log/messages
09/12 at 17:38:03 gpfs-adm1 FSSTRUCT fs1 108 FSErrValidate
type=inode da=00000001:0000000000403500(1:4207872) sectors=0001
repda=[nVal=2 00000001:0000000000403500(1:4207872)
00000002:0000000000403500(2:4207872)] data=(len=00000200) 4FA27C86
5D2076BB 6CD011DE D582F7CE 60A708F1 A3C60FCD 7D796E3D CC97F586 57B643A7
FABD7235 A2BD9B75 6DDA0771 6A818411 0D59D1D3 2C4C7F39 2B2B529D 9AE06C7D
A8FB1DC9 7E783DB4 90A9E9E4 B2D0E9C9 CC7FEBC0 85F23DF8 F18D19C0 DA9C817C
D20C0FB2 F30AAF55
It *didn't* automatically repair it, it seems. The restripe did pick it up:
# /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --read-only --metadata-only
Scanning file system metadata, phase 1 ...
Inode 0 [fileset 0, snapshot 0 ] has mismatch in replicated disk address
1:4206592 2:4206592
Scan completed successfully.
Scanning file system metadata, phase 2 ...
Scan completed successfully.
Scanning file system metadata, phase 3 ...
Scan completed successfully.
Scanning file system metadata, phase 4 ...
Scan completed successfully.
Scanning user file metadata ...
100.00 % complete on Sun Aug 26 18:10:36 2018 ( 69632 inodes with
total 406 MB data processed)
Scan completed successfully.
I ran this to fix it:
# /usr/lpp/mmfs/bin/mmrestripefs fs1 -c --metadata-only
And things appear better afterwards:
# /usr/lpp/mmfs/bin/tsdbfs fs1 inode 23808
Inode 23808 [23808] snap 0 (index 1280 in block 11):
Inode address: 1:4207872 2:4207872 size 512 nAddrs 25
indirectionLevel=INDIRECT status=USERFILE
objectVersion=103 generation=0x6E256E16 nlink=1
owner uid=0 gid=0 mode=0200100644: -rw-r--r--
blocksize code=5 (32 subblocks)
lastBlockSubblocks=32
checksum=0xF74A31AA is Valid
This is with 4.2.3-10.
-Aaron
On 9/6/18 1:49 PM, Simon Thompson wrote:
> I thought reads were always round robin's (in some form) unless you set readreplicapolicy.
>
> And I thought with fsstruct you had to use mmfsck offline to fix.
>
> Simon
> ________________________________________
> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov]
> Sent: 06 September 2018 18:06
> To: gpfsug-discuss at spectrumscale.org
> Subject: Re: [gpfsug-discuss] RAID type for system pool
>
> Answers inline based on my recollection of experiences we've had here:
>
> On 9/6/18 12:19 PM, Bryan Banister wrote:
>> I have questions about how the GPFS metadata replication of 3 works.
>>
>> 1. Is it basically the same as replication of 2 but just have one more
>> copy, making recovery much more likely?
>
> That's my understanding.
>
>> 2. If there is nothing that is checking that the data was correctly
>> read off of the device (e.g. CRC checking ON READS like the DDNs do,
>> T10PI or Data Integrity Field) then how does GPFS handle a corrupted
>> read of the data?
>> - unlikely with SSD but head could be off on a NLSAS read, no
>> errors, but you get some garbage instead, plus no auto retries
>
> The inode itself is checksummed:
>
> # /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs
> Enter command or null to read next sector. Type ? for help.
> inode 20087366
> Inode 20087366 [20087366] snap 0 (index 582 in block 9808):
> Inode address: 30:263275078 32:263264838 size 512 nAddrs 32
> indirectionLevel=3 status=USERFILE
> objectVersion=49352 generation=0x2B519B3 nlink=1
> owner uid=8675309 gid=999 mode=0200100600: -rw-------
> blocksize code=5 (32 subblocks)
> lastBlockSubblocks=1
> checksum=0xF2EF3427 is Valid
> ...
> Disk pointers [32]:
> 0: 31:217629376 1: 30:217632960 2: (null) ...
> 31: (null)
>
> as are indirect blocks (I'm sure that's not an exhaustive list of
> checksummed metadata structures):
>
> ind 31:217629376
> Indirect block starting in sector 31:217629376:
> magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366
> indirection level=2
> checksum=0x6BDAA92A
> CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A
> Data pointers:
>
>> 3. Does GPFS read at least two of the three replicas and compares them
>> to ensure the data is correct?
>> - expensive operation, so very unlikely
>
> I don't know, but I do know it verifies the checksum and I believe if
> that's wrong it will try another replica.
>
>> 4. If not reading multiple replicas for comparison, are reads round
>> robin across all three copies?
>
> I feel like we see pretty even distribution of reads across all replicas
> of our metadata LUNs, although this is looking overall at the array
> level so it may be a red herring.
>
>> 5. If one replica is corrupted (bad blocks) what does GPFS do to
>> recover this metadata copy? Is this automatic or does this require
>> a manual `mmrestripefs -c` operation or something?
>> - If not, seems like a pretty simple idea and maybe an RFE worthy
>> submission
>
> My experience has been it will attempt to correct it (and maybe log an
> fsstruct error?). This was in the 3.5 days, though.
>
>> 6. Would the idea of an option to run “background scrub/verifies” of
>> the data/metadata be worthwhile to ensure no hidden bad blocks?
>> - Using QoS this should be relatively painless
>
> If you don't have array-level background scrubbing, this is what I'd
> suggest. (e.g. mmrestripefs -c --metadata-only).
>
>> 7. With a drive failure do you have to delete the NSD from the file
>> system and cluster, recreate the NSD, add it back to the FS, then
>> again run the `mmrestripefs -c` operation to restore the replication?
>> - As Kevin mentions this will end up being a FULL file system scan
>> vs. a block-based scan and replication. That could take a long time
>> depending on number of inodes and type of storage!
>>
>> Thanks for any insight,
>>
>> -Bryan
>>
>> *From:* gpfsug-discuss-bounces at spectrumscale.org
>> <gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Buterbaugh,
>> Kevin L
>> *Sent:* Thursday, September 6, 2018 9:59 AM
>> *To:* gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
>> *Subject:* Re: [gpfsug-discuss] RAID type for system pool
>>
>> /Note: External Email/
>>
>> ------------------------------------------------------------------------
>>
>> Hi All,
>>
>> Wow - my query got more responses than I expected and my sincere thanks
>> to all who took the time to respond!
>>
>> At this point in time we do have two GPFS filesystems … one which is
>> basically “/home” and some software installations and the other which is
>> “/scratch” and “/data” (former backed up, latter not). Both of them
>> have their metadata on SSDs set up as RAID 1 mirrors and replication set
>> to two. But at this point in time all of the SSDs are in a single
>> storage array (albeit with dual redundant controllers) … so the storage
>> array itself is my only SPOF.
>>
>> As part of the hardware purchase we are in the process of making we will
>> be buying a 2nd storage array that can house 2.5” SSDs. Therefore, we
>> will be splitting our SSDs between chassis and eliminating that last
>> SPOF. Of course, this includes the new SSDs we are getting for our new
>> /home filesystem.
>>
>> Our plan right now is to buy 10 SSDs, which will allow us to test 3
>> configurations:
>>
>> 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my
>> 8 NSD servers as primary for one of those LV’s and the other 7 as
>> backups) and GPFS metadata replication set to 2.
>>
>> 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS
>> metadata replication set to 2. This would mean that only 4 of my 8 NSD
>> servers would be a primary.
>>
>> 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3
>> (which leaves 1 SSD unused). All 8 NSD servers primary for one SSD and
>> 1 serving up two.
>>
>> The responses I received concerning RAID 5 and performance were not a
>> surprise to me. The main advantage that option gives is the most usable
>> storage space for the money (in fact, it gives us way more storage space
>> than we currently need) … but if it tanks performance, then that’s a
>> deal breaker.
>>
>> Personally, I like the four RAID 1 mirrors config like we’ve been using
>> for years, but it has the disadvantage of giving us the least usable
>> storage space … that config would give us the minimum we need for right
>> now, but doesn’t really allow for much future growth.
>>
>> I have no experience with metadata replication of 3 (but had actually
>> thought of that option, so feel good that others suggested it) so option
>> 3 will be a brand new experience for us. It is the most optimal in
>> terms of meeting current needs plus allowing for future growth without
>> giving us way more space than we are likely to need). I will be curious
>> to see how long it takes GPFS to re-replicate the data when we simulate
>> a drive failure as opposed to how long a RAID rebuild takes.
>>
>> I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesday
>> my refrigerator died!) … and also believe that the definition of a
>> pessimist is “someone with experience” <grin> … so we will definitely
>> not set GPFS metadata replication to less than two, nor will we use
>> non-Enterprise class SSDs for metadata … but I do still appreciate the
>> suggestions.
>>
>> If there is interest, I will report back on our findings. If anyone has
>> any additional thoughts or suggestions, I’d also appreciate hearing
>> them. Again, thank you!
>>
>> Kevin
>>
>> —
>>
>> Kevin Buterbaugh - Senior System Administrator
>>
>> Vanderbilt University - Advanced Computing Center for Research and Education
>>
>> Kevin.Buterbaugh at vanderbilt.edu
>> <mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633
>>
>>
>> ------------------------------------------------------------------------
>>
>> Note: This email is for the confidential use of the named addressee(s)
>> only and may contain proprietary, confidential, or privileged
>> information and/or personal data. If you are not the intended recipient,
>> you are hereby notified that any review, dissemination, or copying of
>> this email is strictly prohibited, and requested to notify the sender
>> immediately and destroy this email and any attachments. Email
>> transmission cannot be guaranteed to be secure or error-free. The
>> Company, therefore, does not make any guarantees as to the completeness
>> or accuracy of this email or any attachments. This email is for
>> informational purposes only and does not constitute a recommendation,
>> offer, request, or solicitation of any kind to buy, sell, subscribe,
>> redeem, or perform any type of transaction of a financial product.
>> Personal data, as defined by applicable data privacy laws, contained in
>> this email may be processed by the Company, and any of its affiliated or
>> related companies, for potential ongoing compliance and/or
>> business-related purposes. You may have rights regarding your personal
>> data; for information on exercising these rights or the Company’s
>> treatment of personal data, please email datarequests at jumptrading.com.
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
> --
> Aaron Knister
> NASA Center for Climate Simulation (Code 606.2)
> Goddard Space Flight Center
> (301) 286-2776
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
More information about the gpfsug-discuss
mailing list