[gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN
Uwe Falke
UWEFALKE at de.ibm.com
Thu Jun 11 21:41:52 BST 2020
While that point (block size should be an integer multiple of the RAID
stripe width) is a good one, its violation would explain slow writes, but
Giovanni talks of slow reads ...
Mit freundlichen Grüßen / Kind regards
Dr. Uwe Falke
IT Specialist
Global Technology Services / Project Services Delivery / High Performance
Computing
+49 175 575 2877 Mobile
Rathausstr. 7, 09111 Chemnitz, Germany
uwefalke at de.ibm.com
IBM Services
IBM Data Privacy Statement
IBM Deutschland Business & Technology Services GmbH
Geschäftsführung: Dr. Thomas Wolter, Sven Schooss
Sitz der Gesellschaft: Ehningen
Registergericht: Amtsgericht Stuttgart, HRB 17122
From: "Luis Bolinches" <luis.bolinches at fi.ibm.com>
To: "Giovanni Bracco" <giovanni.bracco at enea.it>
Cc: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
agostino.funel at enea.it
Date: 11/06/2020 16:11
Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
in simple spectrum scale/gpfs cluster with a storage-server SAN
Sent by: gpfsug-discuss-bounces at spectrumscale.org
8 data * 256K does not align to your 1MB
Raid 6 is already not the best option for writes. I would look into use
multiples of 2MB block sizes.
--
Cheers
> On 11. Jun 2020, at 17.07, Giovanni Bracco <giovanni.bracco at enea.it>
wrote:
>
> 256K
>
> Giovanni
>
>> On 11/06/20 10:01, Luis Bolinches wrote:
>> On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
>> --
>> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
>> / Salutacions
>> Luis Bolinches
>> Consultant IT Specialist
>> IBM Spectrum Scale development
>> ESS & client adoption teams
>> Mobile Phone: +358503112585
>>
*https://urldefense.proofpoint.com/v2/url?u=https-3A__www.youracclaim.com_user_luis-2Dbolinches-2A&d=DwIDaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=1mZ896psa5caYzBeaugTlc7TtRejJp3uvKYxas3S7Xc&m=_W83R8yjwX9boyrXDzvfuHOE2zMl1Ggo4JBio7nGUKk&s=0sBbPyJrNuU4BjRb4Cv2f8Z0ot7MiVpqshdkyAHqiuE&e=
>> Ab IBM Finland Oy
>> Laajalahdentie 23
>> 00330 Helsinki
>> Uusimaa - Finland
>>
>> *"If you always give you will always have" -- Anonymous*
>>
>> ----- Original message -----
>> From: Giovanni Bracco <giovanni.bracco at enea.it>
>> Sent by: gpfsug-discuss-bounces at spectrumscale.org
>> To: Jan-Frode Myklebust <janfrode at tanso.net>, gpfsug main discussion
>> list <gpfsug-discuss at spectrumscale.org>
>> Cc: Agostino Funel <agostino.funel at enea.it>
>> Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
>> in simple spectrum scale/gpfs cluster with a storage-server SAN
>> Date: Thu, Jun 11, 2020 10:53
>> Comments and updates in the text:
>>
>>> On 05/06/20 19:02, Jan-Frode Myklebust wrote:
>>> fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco
>>> <giovanni.bracco at enea.it <mailto:giovanni.bracco at enea.it>>:
>>>
>>> answer in the text
>>>
>>>> On 05/06/20 14:58, Jan-Frode Myklebust wrote:
>>> >
>>> > Could maybe be interesting to drop the NSD servers, and
>> let all
>>> nodes
>>> > access the storage via srp ?
>>>
>>> no we can not: the production clusters fabric is a mix of a
>> QDR based
>>> cluster and a OPA based cluster and NSD nodes provide the
>> service to
>>> both.
>>>
>>>
>>> You could potentially still do SRP from QDR nodes, and via NSD
>> for your
>>> omnipath nodes. Going via NSD seems like a bit pointless indirection.
>>
>> not really: both clusters, the 400 OPA nodes and the 300 QDR nodes
share
>> the same data lake in Spectrum Scale/GPFS so the NSD servers support
the
>> flexibility of the setup.
>>
>> NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at
>> the moment 3 different generations of DDN storages are connected,
>> 9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some
less
>> expensive storage, to be used when performance is not the first
>> priority.
>>
>>>
>>>
>>>
>>> >
>>> > Maybe turn off readahead, since it can cause performance
>> degradation
>>> > when GPFS reads 1 MB blocks scattered on the NSDs, so that
>>> read-ahead
>>> > always reads too much. This might be the cause of the slow
>> read
>>> seen ?
>>> > maybe you?ll also overflow it if reading from both
>> NSD-servers at
>>> the
>>> > same time?
>>>
>>> I have switched the readahead off and this produced a small
>> (~10%)
>>> increase of performances when reading from a NSD server, but
>> no change
>>> in the bad behaviour for the GPFS clients
>>>
>>>
>>> >
>>> >
>>> > Plus.. it?s always nice to give a bit more pagepool to hhe
>>> clients than
>>> > the default.. I would prefer to start with 4 GB.
>>>
>>> we'll do also that and we'll let you know!
>>>
>>>
>>> Could you show your mmlsconfig? Likely you should set maxMBpS to
>>> indicate what kind of throughput a client can do (affects GPFS
>>> readahead/writebehind). Would typically also increase
>> workerThreads on
>>> your NSD servers.
>>
>> At this moment this is the output of mmlsconfig
>>
>> # mmlsconfig
>> Configuration data for cluster GPFSEXP.portici.enea.it:
>> -------------------------------------------------------
>> clusterName GPFSEXP.portici.enea.it
>> clusterId 13274694257874519577
>> autoload no
>> dmapiFileHandleSize 32
>> minReleaseLevel 5.0.4.0
>> ccrEnabled yes
>> cipherList AUTHONLY
>> verbsRdma enable
>> verbsPorts qib0/1
>> [cresco-gpfq7,cresco-gpfq8]
>> verbsPorts qib0/2
>> [common]
>> pagepool 4G
>> adminMode central
>>
>> File systems in cluster GPFSEXP.portici.enea.it:
>> ------------------------------------------------
>> /dev/vsd_gexp2
>> /dev/vsd_gexp3
>>
>>
>>>
>>>
>>> 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
>> size.
>>> When you write one GPFS block, less than a half RAID stripe is
>> written,
>>> which means you need to read back some data to calculate new
>> parities.
>>> I would prefer 4 MB block size, and maybe also change to 8+p+q so
>> that
>>> one GPFS is a multiple of a full 2 MB stripe.
>>>
>>>
>>> -jf
>>
>> we have now added another file system based on 2 NSD on RAID6 8+p+q,
>> keeping the 1MB block size just not to change too many things at the
>> same time, but no substantial change in very low readout performances,
>> that are still of the order of 50 MB/s while write performance are
>> 1000MB/s
>>
>> Any other suggestion is welcomed!
>>
>> Giovanni
>>
>>
>>
>> --
>> Giovanni Bracco
>> phone +39 351 8804788
>> E-mail giovanni.bracco at enea.it
>> WWW http://www.afs.enea.it/bracco
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>>
>> Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
>> Oy IBM Finland Ab
>> PL 265, 00101 Helsinki, Finland
>> Business ID, Y-tunnus: 0195876-3
>> Registered in Finland
>>
>>
>> _______________________________________________
>> gpfsug-discuss mailing list
>> gpfsug-discuss at spectrumscale.org
>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>>
>
> --
> Giovanni Bracco
> phone +39 351 8804788
> E-mail giovanni.bracco at enea.it
> WWW http://www.afs.enea.it/bracco
>
Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
Oy IBM Finland Ab
PL 265, 00101 Helsinki, Finland
Business ID, Y-tunnus: 0195876-3
Registered in Finland
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=fTuVGtgq6A14KiNeaGfNZzOOgtHW5Lm4crZU6lJxtB8&m=CPBLf7s53vCFL0esHIl8ZkeC7BiuNZUHD6JVWkcy48c&s=wfe9UKg6bKylrLyuepv2J4jNN4BEfLQK6A46yX9IB-Q&e=
More information about the gpfsug-discuss
mailing list