[gpfsug-discuss] RAID type for system pool

Simon Thompson S.J.Thompson at bham.ac.uk
Thu Sep 6 18:49:25 BST 2018


I thought reads were always round robin's (in some form) unless you set readreplicapolicy.

And I thought with fsstruct you had to use mmfsck offline to fix.

Simon
________________________________________
From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov]
Sent: 06 September 2018 18:06
To: gpfsug-discuss at spectrumscale.org
Subject: Re: [gpfsug-discuss] RAID type for system pool

Answers inline based on my recollection of experiences we've had here:

On 9/6/18 12:19 PM, Bryan Banister wrote:
> I have questions about how the GPFS metadata replication of 3 works.
>
>  1. Is it basically the same as replication of 2 but just have one more
>     copy, making recovery much more likely?

That's my understanding.

>  2. If there is nothing that is checking that the data was correctly
>     read off of the device (e.g. CRC checking ON READS like the DDNs do,
>     T10PI or Data Integrity Field) then how does GPFS handle a corrupted
>     read of the data?
>     - unlikely with SSD but head could be off on a NLSAS read, no
>     errors, but you get some garbage instead, plus no auto retries

The inode itself is checksummed:

# /usr/lpp/mmfs/bin/tsdbfs mysuperawesomespacefs
Enter command or null to read next sector.  Type ? for help.
inode 20087366
Inode 20087366 [20087366] snap 0 (index 582 in block 9808):
   Inode address: 30:263275078 32:263264838 size 512 nAddrs 32
   indirectionLevel=3 status=USERFILE
   objectVersion=49352 generation=0x2B519B3 nlink=1
   owner uid=8675309 gid=999 mode=0200100600: -rw-------
   blocksize code=5 (32 subblocks)
   lastBlockSubblocks=1
   checksum=0xF2EF3427 is Valid
...
   Disk pointers [32]:
     0:  31:217629376    1:  30:217632960    2: (null)         ...
    31: (null)

as are indirect blocks (I'm sure that's not an exhaustive list of
checksummed metadata structures):

ind 31:217629376
Indirect block starting in sector 31:217629376:
   magic=0x112DF307 generation=0x2B519B3 blockNum=0 inodeNum=20087366
   indirection level=2
   checksum=0x6BDAA92A
   CalcChecksum(0x5B6DC9FC000, 32768, 20)=0x6BDAA92A
   Data pointers:

>  3. Does GPFS read at least two of the three replicas and compares them
>     to ensure the data is correct?
>     - expensive operation, so very unlikely

I don't know, but I do know it verifies the checksum and I believe if
that's wrong it will try another replica.

>  4. If not reading multiple replicas for comparison, are reads round
>     robin across all three copies?

I feel like we see pretty even distribution of reads across all replicas
of our metadata LUNs, although this is looking overall at the array
level so it may be a red herring.

>  5. If one replica is corrupted (bad blocks) what does GPFS do to
>     recover this metadata copy?  Is this automatic or does this require
>     a manual `mmrestripefs -c` operation or something?
>     - If not, seems like a pretty simple idea and maybe an RFE worthy
>     submission

My experience has been it will attempt to correct it (and maybe log an
fsstruct error?). This was in the 3.5 days, though.

>  6. Would the idea of an option to run “background scrub/verifies” of
>     the data/metadata be worthwhile to ensure no hidden bad blocks?
>     - Using QoS this should be relatively painless

If you don't have array-level background scrubbing, this is what I'd
suggest. (e.g. mmrestripefs -c --metadata-only).

>  7. With a drive failure do you have to delete the NSD from the file
>     system and cluster, recreate the NSD, add it back to the FS, then
>     again run the `mmrestripefs -c` operation to restore the replication?
>     - As Kevin mentions this will end up being a FULL file system scan
>     vs. a block-based scan and replication.  That could take a long time
>     depending on number of inodes and type of storage!
>
> Thanks for any insight,
>
> -Bryan
>
> *From:* gpfsug-discuss-bounces at spectrumscale.org
> <gpfsug-discuss-bounces at spectrumscale.org> *On Behalf Of *Buterbaugh,
> Kevin L
> *Sent:* Thursday, September 6, 2018 9:59 AM
> *To:* gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> *Subject:* Re: [gpfsug-discuss] RAID type for system pool
>
> /Note: External Email/
>
> ------------------------------------------------------------------------
>
> Hi All,
>
> Wow - my query got more responses than I expected and my sincere thanks
> to all who took the time to respond!
>
> At this point in time we do have two GPFS filesystems … one which is
> basically “/home” and some software installations and the other which is
> “/scratch” and “/data” (former backed up, latter not).  Both of them
> have their metadata on SSDs set up as RAID 1 mirrors and replication set
> to two.  But at this point in time all of the SSDs are in a single
> storage array (albeit with dual redundant controllers) … so the storage
> array itself is my only SPOF.
>
> As part of the hardware purchase we are in the process of making we will
> be buying a 2nd storage array that can house 2.5” SSDs.  Therefore, we
> will be splitting our SSDs between chassis and eliminating that last
> SPOF.  Of course, this includes the new SSDs we are getting for our new
> /home filesystem.
>
> Our plan right now is to buy 10 SSDs, which will allow us to test 3
> configurations:
>
> 1) two 4+1P RAID 5 LUNs split up into a total of 8 LV’s (with each of my
> 8 NSD servers as primary for one of those LV’s and the other 7 as
> backups) and GPFS metadata replication set to 2.
>
> 2) four RAID 1 mirrors (which obviously leaves 2 SSDs unused) and GPFS
> metadata replication set to 2.  This would mean that only 4 of my 8 NSD
> servers would be a primary.
>
> 3) nine RAID 0 / bare drives with GPFS metadata replication set to 3
> (which leaves 1 SSD unused).  All 8 NSD servers primary for one SSD and
> 1 serving up two.
>
> The responses I received concerning RAID 5 and performance were not a
> surprise to me.  The main advantage that option gives is the most usable
> storage space for the money (in fact, it gives us way more storage space
> than we currently need) … but if it tanks performance, then that’s a
> deal breaker.
>
> Personally, I like the four RAID 1 mirrors config like we’ve been using
> for years, but it has the disadvantage of giving us the least usable
> storage space … that config would give us the minimum we need for right
> now, but doesn’t really allow for much future growth.
>
> I have no experience with metadata replication of 3 (but had actually
> thought of that option, so feel good that others suggested it) so option
> 3 will be a brand new experience for us.  It is the most optimal in
> terms of meeting current needs plus allowing for future growth without
> giving us way more space than we are likely to need).  I will be curious
> to see how long it takes GPFS to re-replicate the data when we simulate
> a drive failure as opposed to how long a RAID rebuild takes.
>
> I am a big believer in Murphy’s Law (Sunday I paid off a bill, Wednesday
> my refrigerator died!) … and also believe that the definition of a
> pessimist is “someone with experience” <grin> … so we will definitely
> not set GPFS metadata replication to less than two, nor will we use
> non-Enterprise class SSDs for metadata … but I do still appreciate the
> suggestions.
>
> If there is interest, I will report back on our findings.  If anyone has
> any additional thoughts or suggestions, I’d also appreciate hearing
> them.  Again, thank you!
>
> Kevin
>
>>
> Kevin Buterbaugh - Senior System Administrator
>
> Vanderbilt University - Advanced Computing Center for Research and Education
>
> Kevin.Buterbaugh at vanderbilt.edu
> <mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633
>
>
> ------------------------------------------------------------------------
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential, or privileged
> information and/or personal data. If you are not the intended recipient,
> you are hereby notified that any review, dissemination, or copying of
> this email is strictly prohibited, and requested to notify the sender
> immediately and destroy this email and any attachments. Email
> transmission cannot be guaranteed to be secure or error-free. The
> Company, therefore, does not make any guarantees as to the completeness
> or accuracy of this email or any attachments. This email is for
> informational purposes only and does not constitute a recommendation,
> offer, request, or solicitation of any kind to buy, sell, subscribe,
> redeem, or perform any type of transaction of a financial product.
> Personal data, as defined by applicable data privacy laws, contained in
> this email may be processed by the Company, and any of its affiliated or
> related companies, for potential ongoing compliance and/or
> business-related purposes. You may have rights regarding your personal
> data; for information on exercising these rights or the Company’s
> treatment of personal data, please email datarequests at jumptrading.com.
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>

--
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



More information about the gpfsug-discuss mailing list