[gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN

Thu Jun 11 08:53:01 BST 2020

Comments and updates in the text:

On 05/06/20 19:02, Jan-Frode Myklebust wrote:
> fre. 5. jun. 2020 kl. 15:53 skrev Giovanni Bracco 
> <giovanni.bracco at enea.it <mailto:giovanni.bracco at enea.it>>:
> 
>     answer in the text
> 
>     On 05/06/20 14:58, Jan-Frode Myklebust wrote:
>      >
>      > Could maybe be interesting to drop the NSD servers, and let all
>     nodes
>      > access the storage via srp ?
> 
>     no we can not: the production clusters fabric is a mix of a QDR based
>     cluster and a OPA based cluster and NSD nodes provide the service to
>     both.
> 
> 
> You could potentially still do SRP from QDR nodes, and via NSD for your 
> omnipath nodes. Going via NSD seems like a bit pointless indirection.

not really: both clusters, the 400 OPA nodes and the 300 QDR nodes share 
the same data lake in Spectrum Scale/GPFS so the NSD servers support the 
flexibility of the setup.

NSD servers make use of a IB SAN fabric (Mellanox FDR switch) where at 
the moment 3 different generations of DDN storages are connected, 
9900/QDR 7700/FDR and 7990/EDR. The idea was to be able to add some less 
expensive storage, to be used when performance is not the first priority.

> 
> 
> 
>      >
>      > Maybe turn off readahead, since it can cause performance degradation
>      > when GPFS reads 1 MB blocks scattered on the NSDs, so that
>     read-ahead
>      > always reads too much. This might be the cause of the slow read
>     seen —
>      > maybe you’ll also overflow it if reading from both NSD-servers at
>     the
>      > same time?
> 
>     I have switched the readahead off and this produced a small (~10%)
>     increase of performances when reading from a NSD server, but no change
>     in the bad behaviour for the GPFS clients
> 
> 
>      >
>      >
>      > Plus.. it’s always nice to give a bit more pagepool to hhe
>     clients than
>      > the default.. I would prefer to start with 4 GB.
> 
>     we'll do also that and we'll let you know!
> 
> 
> Could you show your mmlsconfig? Likely you should set maxMBpS to 
> indicate what kind of throughput a client can do (affects GPFS 
> readahead/writebehind).  Would typically also increase workerThreads on 
> your NSD servers.

At this moment this is the output of mmlsconfig

# mmlsconfig
Configuration data for cluster GPFSEXP.portici.enea.it:
-------------------------------------------------------
clusterName GPFSEXP.portici.enea.it
clusterId 13274694257874519577
autoload no
dmapiFileHandleSize 32
minReleaseLevel 5.0.4.0
ccrEnabled yes
cipherList AUTHONLY
verbsRdma enable
verbsPorts qib0/1
[cresco-gpfq7,cresco-gpfq8]
verbsPorts qib0/2
[common]
pagepool 4G
adminMode central

File systems in cluster GPFSEXP.portici.enea.it:
------------------------------------------------
/dev/vsd_gexp2
/dev/vsd_gexp3

> 
> 
> 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip size. 
> When you write one GPFS block, less than a half RAID stripe is written, 
> which means you  need to read back some data to calculate new parities. 
> I would prefer 4 MB block size, and maybe also change to 8+p+q so that 
> one GPFS is a multiple of a full 2 MB stripe.
> 
> 
>     -jf

we have now added another file system based on 2 NSD on RAID6 8+p+q, 
keeping the 1MB block size just not to change too many things at the 
same time, but no substantial change in very low readout performances, 
that are still of the order of 50 MB/s while write performance are 1000MB/s

Any other suggestion is welcomed!

Giovanni

-- 
Giovanni Bracco
phone  +39 351 8804788
E-mail  giovanni.bracco at enea.it
WWW http://www.afs.enea.it/bracco