[gpfsug-discuss] Migration to separate metadata and data disks
Miroslav Bauer
bauer at cesnet.cz
Wed Sep 7 10:40:19 BST 2016
Hello Yuri,
here goes the actual mmdf output of filesystem in question:
disk disk size failure holds holds
free free
name group metadata data in full
blocks in fragments
--------------- ------------- -------- -------- -----
-------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 40 TB)
dcsh_10C 5T 1 Yes Yes 1.661T (
33%) 68.48G ( 1%)
dcsh_10D 6.828T 1 Yes Yes 2.809T (
41%) 83.82G ( 1%)
dcsh_11C 5T 1 Yes Yes 1.659T (
33%) 69.01G ( 1%)
dcsh_11D 6.828T 1 Yes Yes 2.81T (
41%) 83.33G ( 1%)
dcsh_12C 5T 1 Yes Yes 1.659T (
33%) 69.48G ( 1%)
dcsh_12D 6.828T 1 Yes Yes 2.807T (
41%) 83.14G ( 1%)
dcsh_13C 5T 1 Yes Yes 1.659T (
33%) 69.35G ( 1%)
dcsh_13D 6.828T 1 Yes Yes 2.81T (
41%) 82.97G ( 1%)
dcsh_14C 5T 1 Yes Yes 1.66T (
33%) 69.06G ( 1%)
dcsh_14D 6.828T 1 Yes Yes 2.811T (
41%) 83.61G ( 1%)
dcsh_15C 5T 1 Yes Yes 1.658T (
33%) 69.38G ( 1%)
dcsh_15D 6.828T 1 Yes Yes 2.814T (
41%) 83.69G ( 1%)
dcsd_15D 6.828T 1 Yes Yes 2.811T (
41%) 83.98G ( 1%)
dcsd_15C 5T 1 Yes Yes 1.66T (
33%) 68.66G ( 1%)
dcsd_14D 6.828T 1 Yes Yes 2.81T (
41%) 84.18G ( 1%)
dcsd_14C 5T 1 Yes Yes 1.659T (
33%) 69.43G ( 1%)
dcsd_13D 6.828T 1 Yes Yes 2.81T (
41%) 83.27G ( 1%)
dcsd_13C 5T 1 Yes Yes 1.66T (
33%) 69.1G ( 1%)
dcsd_12D 6.828T 1 Yes Yes 2.81T (
41%) 83.61G ( 1%)
dcsd_12C 5T 1 Yes Yes 1.66T (
33%) 69.42G ( 1%)
dcsd_11D 6.828T 1 Yes Yes 2.811T (
41%) 83.59G ( 1%)
dcsh_10B 5T 1 Yes Yes 1.633T (
33%) 76.97G ( 2%)
dcsh_11A 5T 1 Yes Yes 1.632T (
33%) 77.29G ( 2%)
dcsh_11B 5T 1 Yes Yes 1.633T (
33%) 76.73G ( 1%)
dcsh_12A 5T 1 Yes Yes 1.634T (
33%) 76.49G ( 1%)
dcsd_11C 5T 1 Yes Yes 1.66T (
33%) 69.25G ( 1%)
dcsd_10D 6.828T 1 Yes Yes 2.811T (
41%) 83.39G ( 1%)
dcsh_10A 5T 1 Yes Yes 1.633T (
33%) 77.06G ( 2%)
dcsd_10C 5T 1 Yes Yes 1.66T (
33%) 69.83G ( 1%)
dcsd_15B 5T 1 Yes Yes 1.635T (
33%) 76.52G ( 1%)
dcsd_15A 5T 1 Yes Yes 1.634T (
33%) 76.24G ( 1%)
dcsd_14B 5T 1 Yes Yes 1.634T (
33%) 76.31G ( 1%)
dcsd_14A 5T 1 Yes Yes 1.634T (
33%) 76.23G ( 1%)
dcsd_13B 5T 1 Yes Yes 1.634T (
33%) 76.13G ( 1%)
dcsd_13A 5T 1 Yes Yes 1.634T (
33%) 76.22G ( 1%)
dcsd_12B 5T 1 Yes Yes 1.635T (
33%) 77.49G ( 2%)
dcsd_12A 5T 1 Yes Yes 1.633T (
33%) 77.13G ( 2%)
dcsd_11B 5T 1 Yes Yes 1.633T (
33%) 76.86G ( 2%)
dcsd_11A 5T 1 Yes Yes 1.632T (
33%) 76.22G ( 1%)
dcsd_10B 5T 1 Yes Yes 1.633T (
33%) 76.79G ( 1%)
dcsd_10A 5T 1 Yes Yes 1.633T (
33%) 77.21G ( 2%)
dcsh_15B 5T 1 Yes Yes 1.635T (
33%) 76.04G ( 1%)
dcsh_15A 5T 1 Yes Yes 1.634T (
33%) 76.84G ( 2%)
dcsh_14B 5T 1 Yes Yes 1.635T (
33%) 76.75G ( 1%)
dcsh_14A 5T 1 Yes Yes 1.633T (
33%) 76.05G ( 1%)
dcsh_13B 5T 1 Yes Yes 1.634T (
33%) 76.35G ( 1%)
dcsh_13A 5T 1 Yes Yes 1.634T (
33%) 76.68G ( 1%)
dcsh_12B 5T 1 Yes Yes 1.635T (
33%) 76.74G ( 1%)
ssd_5_5 80G 3 Yes No 22.31G (
28%) 7.155G ( 9%)
ssd_4_4 80G 3 Yes No 22.21G (
28%) 7.196G ( 9%)
ssd_3_3 80G 3 Yes No 22.2G (
28%) 7.239G ( 9%)
ssd_2_2 80G 3 Yes No 22.24G (
28%) 7.146G ( 9%)
ssd_1_1 80G 3 Yes No 22.29G (
28%) 7.134G ( 9%)
------------- -------------------- -------------------
(pool total) 262.3T 92.96T (
35%) 3.621T ( 1%)
Disks in storage pool: maid4 (Maximum disk size allowed is 466 TB)
...<dataOnly disks>...
------------- -------------------- -------------------
(pool total) 291T 126.5T (
43%) 562.6G ( 0%)
Disks in storage pool: maid5 (Maximum disk size allowed is 466 TB)
...<dataOnly disks>...
------------- -------------------- -------------------
(pool total) 436.6T 120.8T (
28%) 25.23G ( 0%)
Disks in storage pool: maid6 (Maximum disk size allowed is 466 TB)
...<dataOnly disks>....
------------- -------------------- -------------------
(pool total) 582.1T 358.7T (
62%) 9.458G ( 0%)
============= ==================== ===================
(data) 1.535P 698.9T (
44%) 4.17T ( 0%)
(metadata) 262.3T 92.96T (
35%) 3.621T ( 1%)
============= ==================== ===================
(total) 1.535P 699T (
44%) 4.205T ( 0%)
Inode Information
-----------------
Number of used inodes: 79607225
Number of free inodes: 82340423
Number of allocated inodes: 161947648
Maximum number of inodes: 1342177280
I have a smaller testing FS with the same setup (with plenty of free space),
and the actual sequence of commands that worked for me was:
mmchfs fs1 -m1
mmrestripefs fs1 -R
mmrestripefs fs1 -b
mmchdisk fs1 change -F ~/nsd_metadata_test (dataAndMetadata -> dataOnly)
mmrestripefs fs1 -r
Could you please evaluate more on the performance overhead with having
metadata
on SSD+SATA? Are the read operations automatically directed to faster
disks by GPFS?
Is each write operation waiting for write to be finished by SATA disks?
Thank you,
--
Miroslav Bauer
On 09/06/2016 09:06 PM, Yuri L Volobuev wrote:
>
> The correct way to accomplish what you're looking for (in particular,
> changing the fs-wide level of replication) is mmrestripefs -R. This
> command also takes care of moving data off disks now marked metadataOnly.
>
> The restripe job hits an error trying to move blocks of the inode
> file, i.e. before it gets to actual user data blocks. Note that at
> this point the metadata replication factor is still 2. This suggests
> one of two possibilities: (1) there isn't enough actual free space on
> the remaining metadataOnly disks, (2) there isn't enough space in some
> failure groups to allocate two replicas.
>
> All of this assumes you're operating within a single storage pool. If
> multiple storage pools are in play, there are other possibilities.
>
> 'mmdf' output would be helpful in providing more helpful advice. With
> the information at hand, I can only suggest trying to accomplish the
> task in two phases: (a) deallocated extra metadata replicas, by doing
> mmchfs -m 1 + mmrestripefs -R (b) move metadata off SATA disks. I do
> want to point out that metadata replication is a highly recommended
> insurance policy to have for your file system. As with other kinds of
> insurance, you may or may not need it, but if you do end up needing
> it, you'll be very glad you have it. The costs, in terms of extra
> metadata space and performance overhead, are very reasonable.
>
> yuri
>
>
> Miroslav Bauer ---09/01/2016 07:29:06 AM---Yes, failure group id is
> exactly what I meant :). Unfortunately, mmrestripefs with -R
>
> From: Miroslav Bauer <bauer at cesnet.cz>
> To: gpfsug-discuss at spectrumscale.org,
> Date: 09/01/2016 07:29 AM
> Subject: Re: [gpfsug-discuss] Migration to separate metadata and data
> disks
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>
> ------------------------------------------------------------------------
>
>
>
> Yes, failure group id is exactly what I meant :). Unfortunately,
> mmrestripefs with -R
> behaves the same as with -r. I also believed that mmrestripefs -R is the
> correct tool for
> fixing the replication settings on inodes (according to manpages), but I
> will try possible
> solutions you and Marc suggested and let you know how it went.
>
> Thank you,
> --
> Miroslav Bauer
>
> On 09/01/2016 04:02 PM, Aaron Knister wrote:
> > Oh! I think you've already provided the info I was looking for :) I
> > thought that failGroup=3 meant there were 3 failure groups within the
> > SSDs. I suspect that's not at all what you meant and that actually is
> > the failure group of all of those disks. That I think explains what's
> > going on-- there's only one failure group's worth of metadata-capable
> > disks available and as such GPFS can't place the 2nd replica for
> > existing files.
> >
> > Here's what I would suggest:
> >
> > - Create at least 2 failure groups within the SSDs
> > - Put the default metadata replication factor back to 2
> > - Run a restripefs -R to shuffle files around and restore the metadata
> > replication factor of 2 to any files created while it was set to 1
> >
> > If you're not interested in replication for metadata then perhaps all
> > you need to do is the mmrestripefs -R. I think that should
> > un-replicate the file from the SATA disks leaving the copy on the SSDs.
> >
> > Hope that helps.
> >
> > -Aaron
> >
> > On 9/1/16 9:39 AM, Aaron Knister wrote:
> >> By the way, I suspect the no space on device errors are because GPFS
> >> believes for some reason that it is unable to maintain the metadata
> >> replication factor of 2 that's likely set on all previously created
> >> inodes.
> >>
> >> On 9/1/16 9:36 AM, Aaron Knister wrote:
> >>> I must admit, I'm curious as to the reason you're dropping the
> >>> replication factor from 2 down to 1. There are some serious advantages
> >>> we've seen to having multiple metadata replicas, as far as error
> >>> recovery is concerned.
> >>>
> >>> Could you paste an output of mmlsdisk for the filesystem?
> >>>
> >>> -Aaron
> >>>
> >>> On 9/1/16 9:30 AM, Miroslav Bauer wrote:
> >>>> Hello,
> >>>>
> >>>> I have a GPFS 3.5 filesystem (fs1) and I'm trying to migrate the
> >>>> filesystem metadata from state:
> >>>> -m = 2 (default metadata replicas)
> >>>> - SATA disks (dataAndMetadata, failGroup=1)
> >>>> - SSDs (metadataOnly, failGroup=3)
> >>>> to the desired state:
> >>>> -m = 1
> >>>> - SATA disks (dataOnly, failGroup=1)
> >>>> - SSDs (metadataOnly, failGroup=3)
> >>>>
> >>>> I have done the following steps in the following order:
> >>>> 1) change SATA disks to dataOnly (stanza file modifies the 'usage'
> >>>> attribute only):
> >>>> # mmchdisk fs1 change -F dataOnly_disks.stanza
> >>>> Attention: Disk parameters were changed.
> >>>> Use the mmrestripefs command with the -r option to relocate
> data and
> >>>> metadata.
> >>>> Verifying file system configuration information ...
> >>>> mmchdisk: Propagating the cluster configuration data to all
> >>>> affected nodes. This is an asynchronous process.
> >>>>
> >>>> 2) change default metadata replicas number 2->1
> >>>> # mmchfs fs1 -m 1
> >>>>
> >>>> 3) run mmrestripefs as suggested by output of 1)
> >>>> # mmrestripefs fs1 -r
> >>>> Scanning file system metadata, phase 1 ...
> >>>> Error processing inodes.
> >>>> No space left on device
> >>>> mmrestripefs: Command failed. Examine previous error messages to
> >>>> determine cause.
> >>>>
> >>>> It is, however, still possible to create new files on the filesystem.
> >>>> When I return one of the SATA disks as a dataAndMetadata disk, the
> >>>> mmrestripefs
> >>>> command stops complaining about No space left on device. Both df and
> >>>> mmdf
> >>>> say that there is enough space both for data (SATA) and metadata
> >>>> (SSDs).
> >>>> Does anyone have an idea why is it complaining?
> >>>>
> >>>> Thanks,
> >>>>
> >>>> --
> >>>> Miroslav Bauer
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> gpfsug-discuss mailing list
> >>>> gpfsug-discuss at spectrumscale.org
> >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >>>>
> >>>
> >>
> >
>
>
> [attachment "smime.p7s" deleted by Yuri L Volobuev/Austin/IBM]
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160907/42db88eb/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 3716 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160907/42db88eb/attachment-0002.bin>
More information about the gpfsug-discuss
mailing list