[gpfsug-discuss] Write performances and filesystem size
Ivano Talamo
Ivano.Talamo at psi.ch
Wed Nov 22 08:23:22 GMT 2017
Hello Olaf,
thank you for your reply and for confirming that this is not expected,
as we also thought. We did repeat the test with 2 vdisks only without
dedicated ones for metadata but the result did not change.
We now opened a PMR.
Thanks,
Ivano
Il 16/11/17 17:08, Olaf Weiser ha scritto:
> Hi Ivano,
> so from this output, the performance degradation is not explainable ..
> in my current environments.. , having multiple file systems (so vdisks
> on one BB) .. and it works fine ..
>
> as said .. just open a PMR.. I would'nt consider this as the "expected
> behavior"
> the only thing is.. the MD disks are a bit small.. so maybe redo your
> tests and for a simple compare between 1/2 1/1 or 1/4 capacity test
> with 2 vdisks only and /dataAndMetadata/
> cheers
>
>
>
>
>
> From: Ivano Talamo <Ivano.Talamo at psi.ch>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date: 11/16/2017 08:52 AM
> Subject: Re: [gpfsug-discuss] Write performances and filesystem size
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------------------------------------------------
>
>
>
> Hi,
>
> as additional information I past the recovery group information in the
> full and half size cases.
> In both cases:
> - data is on sf_g_01_vdisk01
> - metadata on sf_g_01_vdisk02
> - sf_g_01_vdisk07 is not used in the filesystem.
>
> This is with the full-space filesystem:
>
> declustered current allowable
> recovery group arrays vdisks pdisks format version format
> version
> ----------------- ----------- ------ ------ --------------
> --------------
> sf-g-01 3 6 86 4.2.2.0 4.2.2.0
>
>
> declustered needs replace
> scrub background activity
> array service vdisks pdisks spares threshold free space
> duration task progress priority
> ----------- ------- ------ ------ ------ --------- ----------
> -------- -------------------------
> NVR no 1 2 0,0 1 3632 MiB
> 14 days scrub 95% low
> DA1 no 4 83 2,44 1 57 TiB
> 14 days scrub 0% low
> SSD no 1 1 0,0 1 372 GiB
> 14 days scrub 79% low
>
> declustered
> checksum
> vdisk RAID code array vdisk size block
> size granularity state remarks
> ------------------ ------------------ ----------- ----------
> ---------- ----------- ----- -------
> sf_g_01_logTip 2WayReplication NVR 48 MiB 2
> MiB 4096 ok logTip
> sf_g_01_logTipBackup Unreplicated SSD 48 MiB
> 2 MiB 4096 ok logTipBackup
> sf_g_01_logHome 4WayReplication DA1 144 GiB 2
> MiB 4096 ok log
> sf_g_01_vdisk02 3WayReplication DA1 103 GiB 1
> MiB 32 KiB ok
> sf_g_01_vdisk07 3WayReplication DA1 103 GiB 1
> MiB 32 KiB ok
> sf_g_01_vdisk01 8+2p DA1 540 TiB 16
> MiB 32 KiB ok
>
> config data declustered array spare space remarks
> ------------------ ------------------ ------------- -------
> rebuild space DA1 53 pdisk
> increasing VCD spares is suggested
>
> config data disk group fault tolerance remarks
> ------------------ --------------------------------- -------
> rg descriptor 1 enclosure + 1 drawer + 2 pdisk limited by
> rebuild space
> system index 1 enclosure + 1 drawer + 2 pdisk limited by
> rebuild space
>
> vdisk disk group fault tolerance remarks
> ------------------ --------------------------------- -------
> sf_g_01_logTip 1 pdisk
> sf_g_01_logTipBackup 0 pdisk
> sf_g_01_logHome 1 enclosure + 1 drawer + 1 pdisk limited by
> rebuild space
> sf_g_01_vdisk02 1 enclosure + 1 drawer limited by
> rebuild space
> sf_g_01_vdisk07 1 enclosure + 1 drawer limited by
> rebuild space
> sf_g_01_vdisk01 2 pdisk
>
>
> This is with the half-space filesystem:
>
> declustered current allowable
> recovery group arrays vdisks pdisks format version format
> version
> ----------------- ----------- ------ ------ --------------
> --------------
> sf-g-01 3 6 86 4.2.2.0 4.2.2.0
>
>
> declustered needs replace
> scrub background activity
> array service vdisks pdisks spares threshold free space
> duration task progress priority
> ----------- ------- ------ ------ ------ --------- ----------
> -------- -------------------------
> NVR no 1 2 0,0 1 3632 MiB
> 14 days scrub 4% low
> DA1 no 4 83 2,44 1 395 TiB
> 14 days scrub 0% low
> SSD no 1 1 0,0 1 372 GiB
> 14 days scrub 79% low
>
> declustered
> checksum
> vdisk RAID code array vdisk size block
> size granularity state remarks
> ------------------ ------------------ ----------- ----------
> ---------- ----------- ----- -------
> sf_g_01_logTip 2WayReplication NVR 48 MiB 2
> MiB 4096 ok logTip
> sf_g_01_logTipBackup Unreplicated SSD 48 MiB
> 2 MiB 4096 ok logTipBackup
> sf_g_01_logHome 4WayReplication DA1 144 GiB 2
> MiB 4096 ok log
> sf_g_01_vdisk02 3WayReplication DA1 103 GiB 1
> MiB 32 KiB ok
> sf_g_01_vdisk07 3WayReplication DA1 103 GiB 1
> MiB 32 KiB ok
> sf_g_01_vdisk01 8+2p DA1 270 TiB 16
> MiB 32 KiB ok
>
> config data declustered array spare space remarks
> ------------------ ------------------ ------------- -------
> rebuild space DA1 68 pdisk
> increasing VCD spares is suggested
>
> config data disk group fault tolerance remarks
> ------------------ --------------------------------- -------
> rg descriptor 1 node + 3 pdisk limited by
> rebuild space
> system index 1 node + 3 pdisk limited by
> rebuild space
>
> vdisk disk group fault tolerance remarks
> ------------------ --------------------------------- -------
> sf_g_01_logTip 1 pdisk
> sf_g_01_logTipBackup 0 pdisk
> sf_g_01_logHome 1 node + 2 pdisk limited by
> rebuild space
> sf_g_01_vdisk02 1 node + 1 pdisk limited by
> rebuild space
> sf_g_01_vdisk07 1 node + 1 pdisk limited by
> rebuild space
> sf_g_01_vdisk01 2 pdisk
>
>
> Thanks,
> Ivano
>
>
>
>
> Il 16/11/17 13:03, Olaf Weiser ha scritto:
>> Rjx, that makes it a bit clearer.. as your vdisk is big enough to span
>> over all pdisks in each of your test 1/1 or 1/2 or 1/4 of capacity...
>> should bring the same performance. ..
>>
>> You mean something about vdisk Layout. ..
>> So in your test, for the full capacity test, you use just one vdisk per
>> RG - so 2 in total for 'data' - right?
>>
>> What about Md .. did you create separate vdisk for MD / what size then
>> ?
>>
>> Gesendet von IBM Verse
>>
>> Ivano Talamo --- Re: [gpfsug-discuss] Write performances and filesystem
>> size ---
>>
>> Von: "Ivano Talamo" <Ivano.Talamo at psi.ch>
>> An: "gpfsug main discussion list"
> <gpfsug-discuss at spectrumscale.org>
>> Datum: Do. 16.11.2017 03:49
>> Betreff: Re: [gpfsug-discuss] Write performances and
> filesystem size
>>
>> ------------------------------------------------------------------------
>>
>> Hello Olaf,
>>
>> yes, I confirm that is the Lenovo version of the ESS GL2, so 2
>> enclosures/4 drawers/166 disks in total.
>>
>> Each recovery group has one declustered array with all disks inside, so
>> vdisks use all the physical ones, even in the case of a vdisk that is
>> 1/4 of the total size.
>>
>> Regarding the layout allocation we used scatter.
>>
>> The tests were done on the just created filesystem, so no close-to-full
>> effect. And we run gpfsperf write seq.
>>
>> Thanks,
>> Ivano
>>
>>
>> Il 16/11/17 04:42, Olaf Weiser ha scritto:
>>> Sure... as long we assume that really all physical disk are used .. the
>>> fact that was told 1/2 or 1/4 might turn out that one / two complet
>>> enclosures 're eliminated ... ? ..that s why I was asking for more
>>> details ..
>>>
>>> I dont see this degration in my environments. . as long the vdisks are
>>> big enough to span over all pdisks ( which should be the case for
>>> capacity in a range of TB ) ... the performance stays the same
>>>
>>> Gesendet von IBM Verse
>>>
>>> Jan-Frode Myklebust --- Re: [gpfsug-discuss] Write performances and
>>> filesystem size ---
>>>
>>> Von: "Jan-Frode Myklebust" <janfrode at tanso.net>
>>> An: "gpfsug main discussion list" <gpfsug-discuss at spectrumscale.org>
>>> Datum: Mi. 15.11.2017 21:35
>>> Betreff: Re: [gpfsug-discuss] Write performances and filesystem size
>>>
>>> ------------------------------------------------------------------------
>>>
>>> Olaf, this looks like a Lenovo «ESS GLxS» version. Should be using same
>>> number of spindles for any size filesystem, so I would also expect them
>>> to perform the same.
>>>
>>>
>>>
>>> -jf
>>>
>>>
>>> ons. 15. nov. 2017 kl. 11:26 skrev Olaf Weiser <olaf.weiser at de.ibm.com
>>> <mailto:olaf.weiser at de.ibm.com>>:
>>>
>>> to add a comment ... .. very simply... depending on how you
>>> allocate the physical block storage .... if you - simply - using
>>> less physical resources when reducing the capacity (in the same
>>> ratio) .. you get , what you see....
>>>
>>> so you need to tell us, how you allocate your block-storage .. (Do
>>> you using RAID controllers , where are your LUNs coming from, are
>>> then less RAID groups involved, when reducing the capacity ?...)
>>>
>>> GPFS can be configured to give you pretty as much as what the
>>> hardware can deliver.. if you reduce resource.. ... you'll get less
>>> , if you enhance your hardware .. you get more... almost regardless
>>> of the total capacity in #blocks ..
>>>
>>>
>>>
>>>
>>>
>>>
>>> From: "Kumaran Rajaram" <kums at us.ibm.com
>>> <mailto:kums at us.ibm.com>>
>>> To: gpfsug main discussion list
>>> <gpfsug-discuss at spectrumscale.org
>>> <mailto:gpfsug-discuss at spectrumscale.org>>
>>> Date: 11/15/2017 11:56 AM
>>> Subject: Re: [gpfsug-discuss] Write performances and
>>> filesystem size
>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>> <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>>
>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> Hi,
>>>
>>> >>Am I missing something? Is this an expected behaviour and someone
>>> has an explanation for this?
>>>
>>> Based on your scenario, write degradation as the file-system is
>>> populated is possible if you had formatted the file-system with "-j
>>> cluster".
>>>
>>> For consistent file-system performance, we recommend *mmcrfs "-j
>>> scatter" layoutMap.* Also, we need to ensure the mmcrfs "-n" is
>>> set properly.
>>>
>>> [snip from mmcrfs]/
>>> # mmlsfs <fs> | egrep 'Block allocation| Estimated number'
>>> -j scatter Block allocation type
>>> -n 128 Estimated number of
>>> nodes that will mount file system/
>>> [/snip]
>>>
>>>
>>> [snip from man mmcrfs]/
>>> *layoutMap={scatter|*//*cluster}*//
>>> Specifies the block allocation map type. When
>>> allocating blocks for a given file, GPFS first
>>> uses a round‐robin algorithm to spread the data
>>> across all disks in the storage pool. After a
>>> disk is selected, the location of the data
>>> block on the disk is determined by the block
>>> allocation map type*. If cluster is
>>> specified, GPFS attempts to allocate blocks in
>>> clusters. Blocks that belong to a particular
>>> file are kept adjacent to each other within
>>> each cluster. If scatter is specified,
>>> the location of the block is chosen randomly.*/
>>> /
>>> * The cluster allocation method may provide
>>> better disk performance for some disk
>>> subsystems in relatively small installations.
>>> The benefits of clustered block allocation
>>> diminish when the number of nodes in the
>>> cluster or the number of disks in a file system
>>> increases, or when the file system’s free space
>>> becomes fragmented. *//The *cluster*//
>>> allocation method is the default for GPFS
>>> clusters with eight or fewer nodes and for file
>>> systems with eight or fewer disks./
>>> /
>>> *The scatter allocation method provides
>>> more consistent file system performance by
>>> averaging out performance variations due to
>>> block location (for many disk subsystems, the
>>> location of the data relative to the disk edge
>>> has a substantial effect on performance).*//This
>>> allocation method is appropriate in most cases
>>> and is the default for GPFS clusters with more
>>> than eight nodes or file systems with more than
>>> eight disks./
>>> /
>>> The block allocation map type cannot be changed
>>> after the storage pool has been created./
>>>
>>> */
>>> -n/*/*NumNodes*//
>>> The estimated number of nodes that will mount the file
>>> system in the local cluster and all remote clusters.
>>> This is used as a best guess for the initial size of
>>> some file system data structures. The default is 32.
>>> This value can be changed after the file system has been
>>> created but it does not change the existing data
>>> structures. Only the newly created data structure is
>>> affected by the new value. For example, new storage
>>> pool./
>>> /
>>> When you create a GPFS file system, you might want to
>>> overestimate the number of nodes that will mount the
>>> file system. GPFS uses this information for creating
>>> data structures that are essential for achieving maximum
>>> parallelism in file system operations (For more
>>> information, see GPFS architecture in IBM Spectrum
>>> Scale: Concepts, Planning, and Installation Guide ). If
>>> you are sure there will never be more than 64 nodes,
>>> allow the default value to be applied. If you are
>>> planning to add nodes to your system, you should specify
>>> a number larger than the default./
>>>
>>> [/snip from man mmcrfs]
>>>
>>> Regards,
>>> -Kums
>>>
>>>
>>>
>>>
>>>
>>> From: Ivano Talamo <Ivano.Talamo at psi.ch
>>> <mailto:Ivano.Talamo at psi.ch>>
>>> To: <gpfsug-discuss at spectrumscale.org
>>> <mailto:gpfsug-discuss at spectrumscale.org>>
>>> Date: 11/15/2017 11:25 AM
>>> Subject: [gpfsug-discuss] Write performances and filesystem
>> size
>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>> <mailto:gpfsug-discuss-bounces at spectrumscale.org>
>>>
>> ------------------------------------------------------------------------
>>>
>>>
>>>
>>> Hello everybody,
>>>
>>> together with my colleagues we are actually running some tests on
>> a new
>>> DSS G220 system and we see some unexpected behaviour.
>>>
>>> What we actually see is that write performances (we did not test read
>>> yet) decreases with the decrease of filesystem size.
>>>
>>> I will not go into the details of the tests, but here are some
>> numbers:
>>>
>>> - with a filesystem using the full 1.2 PB space we get 14 GB/s as the
>>> sum of the disk activity on the two IO servers;
>>> - with a filesystem using half of the space we get 10 GB/s;
>>> - with a filesystem using 1/4 of the space we get 5 GB/s.
>>>
>>> We also saw that performances are not affected by the vdisks layout,
>>> ie.
>>> taking the full space with one big vdisk or 2 half-size vdisks per RG
>>> gives the same performances.
>>>
>>> To our understanding the IO should be spread evenly across all the
>>> pdisks in the declustered array, and looking at iostat all disks
>>> seem to
>>> be accessed. But so there must be some other element that affects
>>> performances.
>>>
>>> Am I missing something? Is this an expected behaviour and someone
>>> has an
>>> explanation for this?
>>>
>>> Thank you,
>>> Ivano
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org
> <http://spectrumscale.org/>>_
>>>
>>
> __https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=McIf98wfiVqHU8ZygezLrQ&m=py_FGl3hi9yQsby94NZdpBFPwcUU0FREyMSSvuK_10U&s=Bq1J9eIXxadn5yrjXPHmKEht0CDBwfKJNH72p--T-6s&e=_
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org
> <http://spectrumscale.org/>>
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org <http://spectrumscale.org
> <http://spectrumscale.org/>>
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> gpfsug-discuss mailing list
>>> gpfsug-discuss at spectrumscale.org
>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
More information about the gpfsug-discuss
mailing list