[gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN
Giovanni Bracco
giovanni.bracco at enea.it
Thu Jun 11 15:06:45 BST 2020
256K
Giovanni
On 11/06/20 10:01, Luis Bolinches wrote:
> On that RAID 6 what is the logical RAID block size? 128K, 256K, other?
> --
> Ystävällisin terveisin / Kind regards / Saludos cordiales / Salutations
> / Salutacions
> Luis Bolinches
> Consultant IT Specialist
> IBM Spectrum Scale development
> ESS & client adoption teams
> Mobile Phone: +358503112585
> *https://www.youracclaim.com/user/luis-bolinches*
> Ab IBM Finland Oy
> Laajalahdentie 23
> 00330 Helsinki
> Uusimaa - Finland
>
> *"If you always give you will always have" -- Anonymous*
>
> ----- Original message -----
> From: Giovanni Bracco <giovanni.bracco at enea.it>
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> To: Jan-Frode Myklebust <janfrode at tanso.net>, gpfsug main discussion
> list <gpfsug-discuss at spectrumscale.org>
> Cc: Agostino Funel <agostino.funel at enea.it>
> Subject: [EXTERNAL] Re: [gpfsug-discuss] very low read performance
> in simple spectrum scale/gpfs cluster with a storage-server SAN
> Date: Thu, Jun 11, 2020 10:53
> Comments and updates in the text:
>
> On 05/06/20 19:02, Jan-Frode Myklebust wrote:
> > fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco
> > <giovanni.bracco at enea.it <mailto:giovanni.bracco at enea.it>>:
> >
> > answer in the text
> >
> > On 05/06/20 14:58, Jan-Frode Myklebust wrote:
> > >
> > > Could maybe be interesting to drop the NSD servers, and
> let all
> > nodes
> > > access the storage via srp ?
> >
> > no we can not: the production clusters fabric is a mix of a
> QDR based
> > cluster and a OPA based cluster and NSD nodes provide the
> service to
> > both.
> >
> >
> > You could potentially still do SRP from QDR nodes, and via NSD
> for your
> > omnipath nodes. Going via NSD seems like a bit pointless indirection.
>
> not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share
> the same data lake in Spectrum Scale/GPFS so the NSD servers support the
> flexibility of the setup.
>
> NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at
> the moment 3 different generations of DDN storages are connected,
> 9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some less
> expensive storage, to be used when performance is not the first
> priority.
>
> >
> >
> >
> > >
> > > Maybe turn off readahead, since it can cause performance
> degradation
> > > when GPFS reads 1 MB blocks scattered on the NSDs, so that
> > read-ahead
> > > always reads too much. This might be the cause of the slow
> read
> > seen —
> > > maybe you’ll also overflow it if reading from both
> NSD-servers at
> > the
> > > same time?
> >
> > I have switched the readahead off and this produced a small
> (~10%)
> > increase of performances when reading from a NSD server, but
> no change
> > in the bad behaviour for the GPFS clients
> >
> >
> > >
> > >
> > > Plus.. it’s always nice to give a bit more pagepool to hhe
> > clients than
> > > the default.. I would prefer to start with 4 GB.
> >
> > we'll do also that and we'll let you know!
> >
> >
> > Could you show your mmlsconfig? Likely you should set maxMBpS to
> > indicate what kind of throughput a client can do (affects GPFS
> > readahead/writebehind). Would typically also increase
> workerThreads on
> > your NSD servers.
>
> At this moment this is the output of mmlsconfig
>
> # mmlsconfig
> Configuration data for cluster GPFSEXP.portici.enea.it:
> -------------------------------------------------------
> clusterName GPFSEXP.portici.enea.it
> clusterId 13274694257874519577
> autoload no
> dmapiFileHandleSize 32
> minReleaseLevel 5.0.4.0
> ccrEnabled yes
> cipherList AUTHONLY
> verbsRdma enable
> verbsPorts qib0/1
> [cresco-gpfq7,cresco-gpfq8]
> verbsPorts qib0/2
> [common]
> pagepool 4G
> adminMode central
>
> File systems in cluster GPFSEXP.portici.enea.it:
> ------------------------------------------------
> /dev/vsd_gexp2
> /dev/vsd_gexp3
>
>
> >
> >
> > 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
> size.
> > When you write one GPFS block, less than a half RAID stripe is
> written,
> > which means you need to read back some data to calculate new
> parities.
> > I would prefer 4 MB block size, and maybe also change to 8+p+q so
> that
> > one GPFS is a multiple of a full 2 MB stripe.
> >
> >
> > -jf
>
> we have now added another file system based on 2 NSD on RAID6 8+p+q,
> keeping the 1MB block size just not to change too many things at the
> same time, but no substantial change in very low readout performances,
> that are still of the order of 50 MB/s while write performance are
> 1000MB/s
>
> Any other suggestion is welcomed!
>
> Giovanni
>
>
>
> --
> Giovanni Bracco
> phone +39 351 8804788
> E-mail giovanni.bracco at enea.it
> WWW http://www.afs.enea.it/bracco
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
> Ellei edellä ole toisin mainittu: / Unless stated otherwise above:
> Oy IBM Finland Ab
> PL 265, 00101 Helsinki, Finland
> Business ID, Y-tunnus: 0195876-3
> Registered in Finland
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
--
Giovanni Bracco
phone +39 351 8804788
E-mail giovanni.bracco at enea.it
WWW http://www.afs.enea.it/bracco
More information about the gpfsug-discuss
mailing list