[gpfsug-discuss] Migration to separate metadata and data disks

Miroslav Bauer bauer at cesnet.cz
Wed Sep 7 10:40:19 BST 2016


Hello Yuri,

here goes the actual mmdf output of filesystem in question:
disk                disk size  failure holds holds                 
free                free
name                             group metadata data        in full 
blocks        in fragments
--------------- ------------- -------- -------- ----- 
-------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 40 TB)
dcsh_10C                   5T        1 Yes      Yes          1.661T ( 
33%)        68.48G ( 1%)
dcsh_10D               6.828T        1 Yes      Yes          2.809T ( 
41%)        83.82G ( 1%)
dcsh_11C                   5T        1 Yes      Yes          1.659T ( 
33%)        69.01G ( 1%)
dcsh_11D               6.828T        1 Yes      Yes           2.81T ( 
41%)        83.33G ( 1%)
dcsh_12C                   5T        1 Yes      Yes          1.659T ( 
33%)        69.48G ( 1%)
dcsh_12D               6.828T        1 Yes      Yes          2.807T ( 
41%)        83.14G ( 1%)
dcsh_13C                   5T        1 Yes      Yes          1.659T ( 
33%)        69.35G ( 1%)
dcsh_13D               6.828T        1 Yes      Yes           2.81T ( 
41%)        82.97G ( 1%)
dcsh_14C                   5T        1 Yes      Yes           1.66T ( 
33%)        69.06G ( 1%)
dcsh_14D               6.828T        1 Yes      Yes          2.811T ( 
41%)        83.61G ( 1%)
dcsh_15C                   5T        1 Yes      Yes          1.658T ( 
33%)        69.38G ( 1%)
dcsh_15D               6.828T        1 Yes      Yes          2.814T ( 
41%)        83.69G ( 1%)
dcsd_15D               6.828T        1 Yes      Yes          2.811T ( 
41%)        83.98G ( 1%)
dcsd_15C                   5T        1 Yes      Yes           1.66T ( 
33%)        68.66G ( 1%)
dcsd_14D               6.828T        1 Yes      Yes           2.81T ( 
41%)        84.18G ( 1%)
dcsd_14C                   5T        1 Yes      Yes          1.659T ( 
33%)        69.43G ( 1%)
dcsd_13D               6.828T        1 Yes      Yes           2.81T ( 
41%)        83.27G ( 1%)
dcsd_13C                   5T        1 Yes      Yes           1.66T ( 
33%)         69.1G ( 1%)
dcsd_12D               6.828T        1 Yes      Yes           2.81T ( 
41%)        83.61G ( 1%)
dcsd_12C                   5T        1 Yes      Yes           1.66T ( 
33%)        69.42G ( 1%)
dcsd_11D               6.828T        1 Yes      Yes          2.811T ( 
41%)        83.59G ( 1%)
dcsh_10B                   5T        1 Yes      Yes          1.633T ( 
33%)        76.97G ( 2%)
dcsh_11A                   5T        1 Yes      Yes          1.632T ( 
33%)        77.29G ( 2%)
dcsh_11B                   5T        1 Yes      Yes          1.633T ( 
33%)        76.73G ( 1%)
dcsh_12A                   5T        1 Yes      Yes          1.634T ( 
33%)        76.49G ( 1%)
dcsd_11C                   5T        1 Yes      Yes           1.66T ( 
33%)        69.25G ( 1%)
dcsd_10D               6.828T        1 Yes      Yes          2.811T ( 
41%)        83.39G ( 1%)
dcsh_10A                   5T        1 Yes      Yes          1.633T ( 
33%)        77.06G ( 2%)
dcsd_10C                   5T        1 Yes      Yes           1.66T ( 
33%)        69.83G ( 1%)
dcsd_15B                   5T        1 Yes      Yes          1.635T ( 
33%)        76.52G ( 1%)
dcsd_15A                   5T        1 Yes      Yes          1.634T ( 
33%)        76.24G ( 1%)
dcsd_14B                   5T        1 Yes      Yes          1.634T ( 
33%)        76.31G ( 1%)
dcsd_14A                   5T        1 Yes      Yes          1.634T ( 
33%)        76.23G ( 1%)
dcsd_13B                   5T        1 Yes      Yes          1.634T ( 
33%)        76.13G ( 1%)
dcsd_13A                   5T        1 Yes      Yes          1.634T ( 
33%)        76.22G ( 1%)
dcsd_12B                   5T        1 Yes      Yes          1.635T ( 
33%)        77.49G ( 2%)
dcsd_12A                   5T        1 Yes      Yes          1.633T ( 
33%)        77.13G ( 2%)
dcsd_11B                   5T        1 Yes      Yes          1.633T ( 
33%)        76.86G ( 2%)
dcsd_11A                   5T        1 Yes      Yes          1.632T ( 
33%)        76.22G ( 1%)
dcsd_10B                   5T        1 Yes      Yes          1.633T ( 
33%)        76.79G ( 1%)
dcsd_10A                   5T        1 Yes      Yes          1.633T ( 
33%)        77.21G ( 2%)
dcsh_15B                   5T        1 Yes      Yes          1.635T ( 
33%)        76.04G ( 1%)
dcsh_15A                   5T        1 Yes      Yes          1.634T ( 
33%)        76.84G ( 2%)
dcsh_14B                   5T        1 Yes      Yes          1.635T ( 
33%)        76.75G ( 1%)
dcsh_14A                   5T        1 Yes      Yes          1.633T ( 
33%)        76.05G ( 1%)
dcsh_13B                   5T        1 Yes      Yes          1.634T ( 
33%)        76.35G ( 1%)
dcsh_13A                   5T        1 Yes      Yes          1.634T ( 
33%)        76.68G ( 1%)
dcsh_12B                   5T        1 Yes      Yes          1.635T ( 
33%)        76.74G ( 1%)
ssd_5_5                   80G        3 Yes      No           22.31G ( 
28%)        7.155G ( 9%)
ssd_4_4                   80G        3 Yes      No           22.21G ( 
28%)        7.196G ( 9%)
ssd_3_3                   80G        3 Yes      No            22.2G ( 
28%)        7.239G ( 9%)
ssd_2_2                   80G        3 Yes      No           22.24G ( 
28%)        7.146G ( 9%)
ssd_1_1                   80G        3 Yes      No           22.29G ( 
28%)        7.134G ( 9%)
                 ------------- -------------------- -------------------
(pool total)           262.3T                                92.96T ( 
35%)        3.621T ( 1%)

Disks in storage pool: maid4 (Maximum disk size allowed is 466 TB)
...<dataOnly disks>...
                 ------------- -------------------- -------------------
(pool total)             291T                                126.5T ( 
43%)        562.6G ( 0%)

Disks in storage pool: maid5 (Maximum disk size allowed is 466 TB)
...<dataOnly disks>...
                 ------------- -------------------- -------------------
(pool total)           436.6T                                120.8T ( 
28%)        25.23G ( 0%)

Disks in storage pool: maid6 (Maximum disk size allowed is 466 TB)
...<dataOnly disks>....
                 ------------- -------------------- -------------------
(pool total)           582.1T                                358.7T ( 
62%)        9.458G ( 0%)

                 ============= ==================== ===================
(data)                 1.535P                                698.9T ( 
44%)         4.17T ( 0%)
(metadata)             262.3T                                92.96T ( 
35%)        3.621T ( 1%)
                 ============= ==================== ===================
(total)                1.535P                                  699T ( 
44%)        4.205T ( 0%)

Inode Information
-----------------
Number of used inodes:        79607225
Number of free inodes:        82340423
Number of allocated inodes:  161947648
Maximum number of inodes:   1342177280

I have a smaller testing FS with the same setup (with plenty of free space),
and the actual sequence of commands that worked for me was:
mmchfs fs1  -m1
mmrestripefs fs1 -R
mmrestripefs fs1 -b
mmchdisk fs1 change -F ~/nsd_metadata_test (dataAndMetadata -> dataOnly)
mmrestripefs fs1 -r

Could you please evaluate more on the performance overhead with having 
metadata
on SSD+SATA? Are the read operations automatically directed to faster 
disks by GPFS?
Is each write operation waiting for write to be finished by SATA disks?

Thank you,

--
Miroslav Bauer

On 09/06/2016 09:06 PM, Yuri L Volobuev wrote:
>
> The correct way to accomplish what you're looking for (in particular, 
> changing the fs-wide level of replication) is mmrestripefs -R. This 
> command also takes care of moving data off disks now marked metadataOnly.
>
> The restripe job hits an error trying to move blocks of the inode 
> file, i.e. before it gets to actual user data blocks. Note that at 
> this point the metadata replication factor is still 2. This suggests 
> one of two possibilities: (1) there isn't enough actual free space on 
> the remaining metadataOnly disks, (2) there isn't enough space in some 
> failure groups to allocate two replicas.
>
> All of this assumes you're operating within a single storage pool. If 
> multiple storage pools are in play, there are other possibilities.
>
> 'mmdf' output would be helpful in providing more helpful advice. With 
> the information at hand, I can only suggest trying to accomplish the 
> task in two phases: (a) deallocated extra metadata replicas, by doing 
> mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do 
> want to point out that metadata replication is a highly recommended 
> insurance policy to have for your file system. As with other kinds of 
> insurance, you may or may not need it, but if you do end up needing 
> it, you'll be very glad you have it. The costs, in terms of extra 
> metadata space and performance overhead, are very reasonable.
>
> yuri
>
>
> Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is 
> exactly what I meant :). Unfortunately, mmrestripefs with -R
>
> From: Miroslav Bauer <bauer at cesnet.cz>
> To: gpfsug-discuss at spectrumscale.org,
> Date: 09/01/2016 07:29 AM
> Subject: Re: [gpfsug-discuss] Migration to separate metadata and data 
> disks
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> ------------------------------------------------------------------------
>
>
>
> Yes, failure group id is exactly what I meant :). Unfortunately,
> mmrestripefs with -R
> behaves the same as with -r. I also believed that mmrestripefs -R is the
> correct tool for
> fixing the replication settings on inodes (according to manpages), but I
> will try possible
> solutions you and Marc suggested and let you know how it went.
>
> Thank you,
> --
> Miroslav Bauer
>
> On 09/01/2016 04:02 PM, Aaron Knister wrote:
> > Oh! I think you've already provided the info I was looking for :) I
> > thought that failGroup=3 meant there were 3 failure groups within the
> > SSDs. I suspect that's not at all what you meant and that actually is
> > the failure group of all of those disks. That I think explains what's
> > going on-- there's only one failure group's worth of metadata-capable
> > disks available and as such GPFS can't place the 2nd replica for
> > existing files.
> >
> > Here's what I would suggest:
> >
> > - Create at least 2 failure groups within the SSDs
> > - Put the default metadata replication factor back to 2
> > - Run a restripefs -R to shuffle files around and restore the metadata
> > replication factor of 2 to any files created while it was set to 1
> >
> > If you're not interested in replication for metadata then perhaps all
> > you need to do is the mmrestripefs -R. I think that should
> > un-replicate the file from the SATA disks leaving the copy on the SSDs.
> >
> > Hope that helps.
> >
> > -Aaron
> >
> > On 9/1/16 9:39 AM, Aaron Knister wrote:
> >> By the way, I suspect the no space on device errors are because GPFS
> >> believes for some reason that it is unable to maintain the metadata
> >> replication factor of 2 that's likely set on all previously created
> >> inodes.
> >>
> >> On 9/1/16 9:36 AM, Aaron Knister wrote:
> >>> I must admit, I'm curious as to the reason you're dropping the
> >>> replication factor from 2 down to 1. There are some serious advantages
> >>> we've seen to having multiple metadata replicas, as far as error
> >>> recovery is concerned.
> >>>
> >>> Could you paste an output of mmlsdisk for the filesystem?
> >>>
> >>> -Aaron
> >>>
> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote:
> >>>> Hello,
> >>>>
> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the
> >>>> filesystem metadata from state:
> >>>> -m = 2 (default metadata replicas)
> >>>> - SATA disks (dataAndMetadata, failGroup=1)
> >>>> - SSDs (metadataOnly, failGroup=3)
> >>>> to the desired state:
> >>>> -m = 1
> >>>> - SATA disks (dataOnly, failGroup=1)
> >>>> - SSDs (metadataOnly, failGroup=3)
> >>>>
> >>>> I have done the following steps in the following order:
> >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage'
> >>>> attribute only):
> >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza
> >>>> Attention: Disk parameters were changed.
> >>>>   Use the mmrestripefs command with the -r option to relocate 
> data and
> >>>> metadata.
> >>>> Verifying file system configuration information ...
> >>>> mmchdisk: Propagating the cluster configuration data to all
> >>>>   affected nodes.  This is an asynchronous process.
> >>>>
> >>>> 2) change default metadata replicas number 2->1
> >>>> # mmchfs fs1 -m 1
> >>>>
> >>>> 3) run mmrestripefs as suggested by output of 1)
> >>>> # mmrestripefs fs1 -r
> >>>> Scanning file system metadata, phase 1 ...
> >>>> Error processing inodes.
> >>>> No space left on device
> >>>> mmrestripefs: Command failed.  Examine previous error messages to
> >>>> determine cause.
> >>>>
> >>>> It is, however, still possible to create new files on the filesystem.
> >>>> When I return one of the SATA disks as a dataAndMetadata disk, the
> >>>> mmrestripefs
> >>>> command stops complaining about No space left on device. Both df and
> >>>> mmdf
> >>>> say that there is enough space both for data (SATA) and metadata
> >>>> (SSDs).
> >>>> Does anyone have an idea why is it complaining?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> --
> >>>> Miroslav Bauer
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> gpfsug-discuss mailing list
> >>>> gpfsug-discuss at spectrumscale.org
> >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>>>
> >>>
> >>
> >
>
>
> [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM] 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160907/42db88eb/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3716 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160907/42db88eb/attachment-0002.bin>


More information about the gpfsug-discuss mailing list