[gpfsug-discuss] mmdf and maybe other commands long running // influence of n and B on number of regions

Walter Sklenka Walter.Sklenka at EDV-Design.at
Mon Feb 10 18:34:45 GMT 2020


Hello Nate!
Thank you very much for the response
Do you know if the rule of thumb for “enough regions =N*32 per pool
And isn´t there an other way to increate the number of regions? (mybe by reducing block-size ?
It´s only because the commands excetuin time of a couple of minutes make me nervous , or is the reason more a poor metadata perf for the long running command?

But if you say so we will change it to N=5000

Best regards
Walter


Mit freundlichen Grüßen
Walter Sklenka
Technical Consultant

EDV-Design Informationstechnologie GmbH
Giefinggasse 6/1/2, A-1210 Wien
Tel: +43 1 29 22 165-31
Fax: +43 1 29 22 165-90
E-Mail: sklenka at edv-design.at<mailto:sklenka at edv-design.at>
Internet: www.edv-design.at<http://www.edv-design.at/>


Von: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> Im Auftrag von Nathan Falk
Gesendet: Monday, February 10, 2020 3:57 PM
An: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Betreff: Re: [gpfsug-discuss] mmdf and maybe other commands long running // influence of n and B on number of regions

Hello Walter,

If you anticipate that the number of clients accessing this file system may grow as high as 5000, then that is probably the value you should use when creating the file system. The data structures (regions for example) are allocated at file system creation time (more precisely at storage pool creation time) and are not changed later.

The mmcrfs doc explains this:

https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.4/com.ibm.spectrum.scale.v5r04.doc/bl1adm_mmcrfs.htm

-n NumNodes

The estimated number of nodes that will mount the file system in the local cluster and all remote clusters. This is used as a best guess for the initial size of some file system data structures. The default is 32. This value can be changed after the file system has been created but it does not change the existing data structures. Only the newly created data structure is affected by the new value. For example, new storage pool.

When you create a GPFS file system, you might want to overestimate the number of nodes that will mount the file system. GPFS uses this information for creating data structures that are essential for achieving maximum parallelism in file system operations (For more information, see GPFS architecture ). If you are sure there will never be more than 64 nodes, allow the default value to be applied. If you are planning to add nodes to your system, you should specify a number larger than the default.

Thanks,
Nate Falk
IBM Spectrum Scale Level 2 Support
Software Defined Infrastructure, IBM Systems


________________________________

Phone:1-720-349-9538| Mobile:1-845-546-4930
E-mail:nfalk at us.ibm.com<mailto:nfalk at us.ibm.com>
Find me on:[LinkedIn: https://www.linkedin.com/in/nathan-falk-078ba5125]<https://www.linkedin.com/in/nathan-falk-078ba5125> [Twitter: https://twitter.com/natefalk922] <https://twitter.com/natefalk922>

[IBM]





From:        Walter Sklenka <Walter.Sklenka at EDV-Design.at<mailto:Walter.Sklenka at EDV-Design.at>>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Date:        02/09/2020 04:59 AM
Subject:        [EXTERNAL] Re: [gpfsug-discuss] mmdf and maybe other commands long running // influence of n and B on number of regions
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
________________________________


Hi!

At the time of writing we set N to 1200 , but we are not sure if it would be better to set to overestimated 5000 ?

We use 6 backend nodes

The backend storage is a Flash9100 for metadata and 6x Lenovo DE6000H . We will finally use 2 filesystems : data and home

Fs “data” consist of 12 metadada-nsd and 72 dataonly nsds



We have enough space to add nsds (finally the fs

[root at nsd75-01 ~]# mmlspool data

Storage pools in file system at '/gpfs/data':

Name                    Id   BlkSize Data Meta Total Data in (KB)   Free Data in (KB)   Total Meta in (KB)    Free Meta in (KB)

system                   0      4 MB   no  yes              0              0 (  0%)    12884901888    12800315392 ( 99%)

saspool              65537      4 MB  yes   no  1082331758592  1082326446080 (100%)              0              0 (  0%)



[root at nsd75-01 ~]# mmlsfs data

flag                value                    description

------------------- ------------------------ -----------------------------------

-f                 8192                     Minimum fragment (subblock) size in bytes

-i                 4096                     Inode size in bytes

-I                 32768                    Indirect block size in bytes

-m                 1                        Default number of metadata replicas

-M                 2                        Maximum number of metadata replicas

-r                 1                        Default number of data replicas

-R                 2                        Maximum number of data replicas

-j                 scatter                  Block allocation type

-D                 nfs4                     File locking semantics in effect

-k                 all                      ACL semantics in effect

-n                 1200                     Estimated number of nodes that will mount file system

-B                 4194304                  Block size

-Q                 user;group;fileset       Quotas accounting enabled

                    user;group;fileset       Quotas enforced

                    fileset                  Default quotas enabled

--perfileset-quota Yes                      Per-fileset quota enforcement

--filesetdf        Yes                      Fileset df enabled?

-V                 21.00 (5.0.3.0)          File system version

--create-time      Fri Feb  7 15:32:05 2020 File system creation time

-z                 No                       Is DMAPI enabled?

-L                 33554432                 Logfile size

-E                 Yes                      Exact mtime mount option

-S                 relatime                 Suppress atime mount option

-K                 whenpossible             Strict replica allocation option

--fastea           Yes                      Fast external attributes enabled?

--encryption       No                       Encryption enabled?

--inode-limit      1342177280               Maximum number of inodes

--log-replicas     0                        Number of log replicas

--is4KAligned      Yes                      is4KAligned?

--rapid-repair     Yes                      rapidRepair enabled?

--write-cache-threshold 0                   HAWC Threshold (max 65536)

--subblocks-per-full-block 512              Number of subblocks per full block

-P                 system;saspool           Disk storage pools in file system

--file-audit-log   No                       File Audit Logging enabled?

--maintenance-mode No                       Maintenance Mode enabled?

-d                 de750101vol01;de750101vol02;de750101vol03;de750101vol04;de750101vol05;de750101vol06;de750102vol01;de750102vol02;de750102vol03;de750102vol04;de750102vol05;de750102vol06;

-d                 de750201vol01;de750201vol02;de750201vol03;de750201vol04;de750201vol05;de750201vol06;de750202vol01;de750202vol02;de750202vol03;de750202vol04;de750202vol05;de750202vol06;

-d                 de760101vol01;de760101vol02;de760101vol03;de760101vol04;de760101vol05;de760101vol06;de760102vol01;de760102vol02;de760102vol03;de760102vol04;de760102vol05;de760102vol06;

-d                 de760201vol01;de760201vol02;de760201vol03;de760201vol04;de760201vol05;de760201vol06;de760202vol01;de760202vol02;de760202vol03;de760202vol04;de760202vol05;de760202vol06;

-d                 de770101vol01;de770101vol02;de770101vol03;de770101vol04;de770101vol05;de770101vol06;de770102vol01;de770102vol02;de770102vol03;de770102vol04;de770102vol05;de770102vol06;

-d                 de770201vol01;de770201vol02;de770201vol03;de770201vol04;de770201vol05;de770201vol06;de770202vol01;de770202vol02;de770202vol03;de770202vol04;de770202vol05;de770202vol06;

-d                 globalmeta0;globalmeta1;globalmeta2;globalmeta3;globalmeta4;globalmeta5;globalmeta6;globalmeta7;globalmeta8;globalmeta9;globalmeta10;globalmeta11  Disks in file system

-A                 yes                      Automatic mount option

-o                 none                     Additional mount options

-T                 /gpfs/data               Default mount point

--mount-priority   0                        Mount priority





##

For fs Home  we use 24 dataAdnMetadata disks only on flash



[root at nsd75-01 ~]# mmlspool home

Storage pools in file system at '/gpfs/home':

Name                    Id   BlkSize Data Meta Total Data in (KB)   Free Data in (KB)   Total Meta in (KB)    Free Meta in (KB)

system                   0   1024 KB  yes  yes    25769803776    25722931200 (100%)    25769803776    25722981376 (100%)

[root at nsd75-01 ~]#



[root at nsd75-01 ~]# mmlsfs home

flag                value                    description

------------------- ------------------------ -----------------------------------

-f                 8192                     Minimum fragment (subblock) size in bytes

-i                 4096                     Inode size in bytes

-I                 32768                    Indirect block size in bytes

-m                 1                        Default number of metadata replicas

-M                 2                        Maximum number of metadata replicas

-r                 1                        Default number of data replicas

-R                 2                        Maximum number of data replicas

-j                 scatter                  Block allocation type

-D                 nfs4                     File locking semantics in effect

-k                 all                      ACL semantics in effect

-n                 1200                     Estimated number of nodes that will mount file system

-B                 1048576                  Block size

-Q                 user;group;fileset       Quotas accounting enabled

                    user;group;fileset       Quotas enforced

                    fileset                  Default quotas enabled

--perfileset-quota Yes                      Per-fileset quota enforcement

--filesetdf        Yes                      Fileset df enabled?

-V                 21.00 (5.0.3.0)          File system version

--create-time      Fri Feb  7 15:31:28 2020 File system creation time

-z                 No                       Is DMAPI enabled?

-L                 33554432                 Logfile size

-E                 Yes                      Exact mtime mount option

-S                 relatime                 Suppress atime mount option

-K                 whenpossible             Strict replica allocation option

--fastea           Yes                      Fast external attributes enabled?

--encryption       No                       Encryption enabled?

--inode-limit      25166080                 Maximum number of inodes

--log-replicas     0                        Number of log replicas

--is4KAligned      Yes                      is4KAligned?

--rapid-repair     Yes                      rapidRepair enabled?

--write-cache-threshold 0                   HAWC Threshold (max 65536)

--subblocks-per-full-block 128              Number of subblocks per full block

-P                 system                   Disk storage pools in file system

--file-audit-log   No                       File Audit Logging enabled?

--maintenance-mode No                       Maintenance Mode enabled?

-d                 home0;home10;home11;home12;home13;home14;home15;home16;home17;home18;home19;home1;home20;home21;home22;home23;home2;home3;home4;home5;home6;home7;home8;home9  Disks in file system

-A                 yes                      Automatic mount option

-o                 none                     Additional mount options

-T                 /gpfs/home               Default mount point

--mount-priority   0                        Mount priority

[root at nsd75-01 ~]#







Mit freundlichen Grüßen
Walter Sklenka
Technical Consultant



EDV-Design Informationstechnologie GmbH
Giefinggasse 6/1/2, A-1210 Wien
Tel: +43 1 29 22 165-31
Fax: +43 1 29 22 165-90
E-Mail: sklenka at edv-design.at<mailto:sklenka at edv-design.at>
Internet: www.edv-design.at<http://www.edv-design.at/>



Von:gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>> Im Auftrag von José Filipe Higino
Gesendet: Saturday, February 8, 2020 1:00 PM
An: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Betreff: Re: [gpfsug-discuss] mmdf and maybe other commands long running // influence of n and B on number of regions



How many back end nodes for that cluster? and how many filesystems for that same access... and how many pools for the same data access type (12 ndisks sounds very LOW to me, for that size of a cluster, probably no other filesystem can do more than that). On GPFS there are so many different ways to access the data, that is sometimes hard to start a conversation. And you did a very great job of introducing it. =)



We (I am a customer too) do not have that many nodes, but from experience, I know some clusters (and also multicluster configs) depend mostly on how much metadata you can service in the network and how fast (latency wise) you can do it, to accommodate such amount of nodes. There is never design by the book that can safely tell something will work 100% times. But the beauty of it is that GPFS allows lots of aspects to be resized at your convenience to facilitate what you need most the system to do.



Let us know more...



On Sun, 9 Feb 2020 at 00:40, Walter Sklenka <Walter.Sklenka at edv-design.at<mailto:Walter.Sklenka at edv-design.at>> wrote:

Hello!

We are designing two fs  where we cannot anticipate if there will be 3000, or maybe 5000 or more nodes totally accessing these filesystems

What we saw, was that execution time of mmdf can last 5-7min

We openend a case and they said, that during such commands like mmdf or also mmfsck, mmdefragfs,mmresripefs all regions must be scanned at this is the reason why it takes so long

The technichian also said, that it is “rule of thumb” that there should be

(-n)*32 regions , this would then be enough ( N=5000 -->160000 regions per pool ?)

(also Block size has influence on regions ?)



#mmfsadm saferdump stripe

Gives the regions number

 storage pools: max 8

     alloc map type 'scatter'

      0: name 'system' Valid nDisks 12 nInUse 12 id 0 poolFlags 0 thinProvision reserved inode -1, reserved nBlocks 0

          regns 170413 segs 1 size 4096 FBlks 0 MBlks 3145728 subblock size 8192











We  also saw when creating the filesystem with a speciicic (-n)  very high (5000)  (where mmdf execution time was some minutes) and then changing (-n) to a lower value this does not influence the behavior any more



My question is: Is the rule (Number of Nodes)x5000 for number of regios in a pool an good estimation ,

Is it better to overestimate the number of Nodes (lnger running commands) or is it unrealistic to get into problems when not reaching the regions number calculated ?



Does  anybody have experience with high number of nodes (>>3000)  and how to design the filesystems for such large clusters ?



Thank you very much in advance !







Mit freundlichen Grüßen
Walter Sklenka
Technical Consultant



EDV-Design Informationstechnologie GmbH
Giefinggasse 6/1/2, A-1210 Wien
Tel: +43 1 29 22 165-31
Fax: +43 1 29 22 165-90
E-Mail: sklenka at edv-design.at<mailto:sklenka at edv-design.at>
Internet: www.edv-design.at<http://www.edv-design.at/>



_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200210/0743e372/attachment-0002.htm>


More information about the gpfsug-discuss mailing list