[gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

Luis Bolinches luis.bolinches at fi.ibm.com
Thu Jun 11 15:11:14 BST 2020


8 data * 256K does not align to your 1MB

Raid 6 is already not the best option for writes. I would look into use
multiples of 2MB block sizes.

--
Cheers

> On 11. Jun 2020, at 17.07, Giovanni Bracco <giovanni.bracco at enea.it>
wrote:
>
> 256K
>
> Giovanni
>
>> On 11/06/20 10:01, Luis Bolinches wrote:
>> On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
>> --
>> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>> / Salutacions
>> Luis Bolinches
>> Consultant IT Specialist
>> IBM Spectrum Scale development
>> ESS & client adoption teams
>> Mobile Phone: +358503112585
>>
*https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youracclaim.com_user_luis-2Dbolinches-2A&d=DwIDaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk&s=0sBbPyJrNuU4BjRb4Cv2f8Z0ot7MiVpqshdkyAHqiuE&e=

>> Ab IBM Finland Oy
>> Laajalahdentie 23
>> 00330 Helsinki
>> Uusimaa - Finland
>>
>> *"If you always give you will always have" --  Anonymous*
>>
>>    ----- Original message -----
>>    From: Giovanni Bracco <giovanni.bracco at enea.it>
>>    Sent by: gpfsug-discuss-bounces at spectrumscale.org
>>    To: Jan-Frode Myklebust <janfrode at tanso.net>, gpfsug main discussion
>>    list <gpfsug-discuss at spectrumscale.org>
>>    Cc: Agostino Funel <agostino.funel at enea.it>
>>    Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
>>    in simple spectrum scale/gpfs cluster with a storage-server SAN
>>    Date: Thu, Jun 11, 2020 10:53
>>    Comments and updates in the text:
>>
>>>    On 05/06/20 19:02, Jan-Frode Myklebust wrote:
>>> fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco
>>> <giovanni.bracco at enea.it <mailto:giovanni.bracco at enea.it>>:
>>>
>>>     answer in the text
>>>
>>>>     On 05/06/20 14:58, Jan-Frode Myklebust wrote:
>>>      >
>>>      > Could maybe be interesting to drop the NSD servers, and
>>    let all
>>>     nodes
>>>      > access the storage via srp ?
>>>
>>>     no we can not: the production clusters fabric is a mix of a
>>    QDR based
>>>     cluster and a OPA based cluster and NSD nodes provide the
>>    service to
>>>     both.
>>>
>>>
>>> You could potentially still do SRP from QDR nodes, and via NSD
>>    for your
>>> omnipath nodes. Going via NSD seems like a bit pointless indirection.
>>
>>    not really: both clusters, the 400 OPA nodes and the 300 QDR nodes
share
>>    the same data lake in Spectrum Scale/GPFS so the NSD servers support
the
>>    flexibility of the setup.
>>
>>    NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where
at
>>    the moment 3 different generations of DDN storages are connected,
>>    9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some
less
>>    expensive storage, to be used when performance is not the first
>>    priority.
>>
>>>
>>>
>>>
>>>      >
>>>      > Maybe turn off readahead, since it can cause performance
>>    degradation
>>>      > when GPFS reads 1 MB blocks scattered on the NSDs, so that
>>>     read-ahead
>>>      > always reads too much. This might be the cause of the slow
>>    read
>>>     seen —
>>>      > maybe you’ll also overflow it if reading from both
>>    NSD-servers at
>>>     the
>>>      > same time?
>>>
>>>     I have switched the readahead off and this produced a small
>>    (~10%)
>>>     increase of performances when reading from a NSD server, but
>>    no change
>>>     in the bad behaviour for the GPFS clients
>>>
>>>
>>>      >
>>>      >
>>>      > Plus.. it’s always nice to give a bit more pagepool to hhe
>>>     clients than
>>>      > the default.. I would prefer to start with 4 GB.
>>>
>>>     we'll do also that and we'll let you know!
>>>
>>>
>>> Could you show your mmlsconfig? Likely you should set maxMBpS to
>>> indicate what kind of throughput a client can do (affects GPFS
>>> readahead/writebehind).  Would typically also increase
>>    workerThreads on
>>> your NSD servers.
>>
>>    At this moment this is the output of mmlsconfig
>>
>>    # mmlsconfig
>>    Configuration data for cluster GPFSEXP.portici.enea.it:
>>    -------------------------------------------------------
>>    clusterName GPFSEXP.portici.enea.it
>>    clusterId 13274694257874519577
>>    autoload no
>>    dmapiFileHandleSize 32
>>    minReleaseLevel 5.0.4.0
>>    ccrEnabled yes
>>    cipherList AUTHONLY
>>    verbsRdma enable
>>    verbsPorts qib0/1
>>    [cresco-gpfq7,cresco-gpfq8]
>>    verbsPorts qib0/2
>>    [common]
>>    pagepool 4G
>>    adminMode central
>>
>>    File systems in cluster GPFSEXP.portici.enea.it:
>>    ------------------------------------------------
>>    /dev/vsd_gexp2
>>    /dev/vsd_gexp3
>>
>>
>>>
>>>
>>> 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
>>    size.
>>> When you write one GPFS block, less than a half RAID stripe is
>>    written,
>>> which means you  need to read back some data to calculate new
>>    parities.
>>> I would prefer 4 MB block size, and maybe also change to 8+p+q so
>>    that
>>> one GPFS is a multiple of a full 2 MB stripe.
>>>
>>>
>>>     -jf
>>
>>    we have now added another file system based on 2 NSD on RAID6 8+p+q,
>>    keeping the 1MB block size just not to change too many things at the
>>    same time, but no substantial change in very low readout
performances,
>>    that are still of the order of 50 MB/s while write performance are
>>    1000MB/s
>>
>>    Any other suggestion is welcomed!
>>
>>    Giovanni
>>
>>
>>
>>    --
>>    Giovanni Bracco
>>    phone  +39 351 8804788
>>    E-mail  giovanni.bracco at enea.it
>>    WWW
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.afs.enea.it_bracco&d=DwIDaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk&s=q-8zfr3t0TGWOicysbq0ezzL2xpk3dzDg2m1plcsWm0&e=

>>    _______________________________________________
>>    gpfsug-discuss mailing list
>>    gpfsug-discuss at spectrumscale.org
>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIDaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk&s=CZv204_tsb3M3xIwxRyIyvTjptoQL-gD-VhzUkMRyrc&e=

>>
>>
>> Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
>> Oy IBM Finland Ab
>> PL 265, 00101 Helsinki, Finland
>> Business ID, Y-tunnus: 0195876-3
>> Registered in Finland
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>>
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIDaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk&s=CZv204_tsb3M3xIwxRyIyvTjptoQL-gD-VhzUkMRyrc&e=

>>
>
> --
> Giovanni Bracco
> phone  +39 351 8804788
> E-mail  giovanni.bracco at enea.it
> WWW
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.afs.enea.it_bracco&d=DwIDaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk&s=q-8zfr3t0TGWOicysbq0ezzL2xpk3dzDg2m1plcsWm0&e=

>
Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3 
Registered in Finland

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200611/62187cb2/attachment-0002.htm>


More information about the gpfsug-discuss mailing list