[gpfsug-discuss] very low read performance in simple spectrum scale/gpfs cluster with a storage-server SAN: effect of ignorePrefetchLUNCount

Tue Jun 16 14:32:53 BST 2020

On 11/06/20 12:13, Jan-Frode Myklebust wrote:
> On Thu, Jun 11, 2020 at 9:53 AM Giovanni Bracco <giovanni.bracco at enea.it 
> <mailto:giovanni.bracco at enea.it>> wrote:
> 
> 
>      >
>      > You could potentially still do SRP from QDR nodes, and via NSD
>     for your
>      > omnipath nodes. Going via NSD seems like a bit pointless indirection.
> 
>     not really: both clusters, the 400 OPA nodes and the 300 QDR nodes
>     share
>     the same data lake in Spectrum Scale/GPFS so the NSD servers support
>     the
>     flexibility of the setup.
> 
> 
> Maybe there's something I don't understand, but couldn't you use the 
> NSD-servers to serve to your
> OPA nodes, and then SRP directly for your 300 QDR-nodes??

not in an easy way without loosing the flexibility of the system where 
NSD are the hubs between the three different fabrics, QDR compute, OPA 
compute, Mellanox FDR SAN.

The storages have QDR,FDR and EDR interfaces and Mellanox guarantees the 
compatibility QDR-FDR and FDR-EDR but not, as far as I know, QDR-EDR
So in this configuration, all the compute nodes can access to all the 
storages.

> 
> 
>     At this moment this is the output of mmlsconfig
> 
>     # mmlsconfig
>     Configuration data for cluster GPFSEXP.portici.enea.it
>     <http://GPFSEXP.portici.enea.it>:
>     -------------------------------------------------------
>     clusterName GPFSEXP.portici.enea.it <http://GPFSEXP.portici.enea.it>
>     clusterId 13274694257874519577
>     autoload no
>     dmapiFileHandleSize 32
>     minReleaseLevel 5.0.4.0
>     ccrEnabled yes
>     cipherList AUTHONLY
>     verbsRdma enable
>     verbsPorts qib0/1
>     [cresco-gpfq7,cresco-gpfq8]
>     verbsPorts qib0/2
>     [common]
>     pagepool 4G
>     adminMode central
> 
>     File systems in cluster GPFSEXP.portici.enea.it
>     <http://GPFSEXP.portici.enea.it>:
>     ------------------------------------------------
>     /dev/vsd_gexp2
>     /dev/vsd_gexp3
> 
> 
> 
> So, trivial close to default config.. assume the same for the client 
> cluster.
> 
> I would correct MaxMBpS -- put it at something reasonable, enable 
> verbsRdmaSend=yes and
> ignorePrefetchLUNCount=yes.

Now we have set:
verbsRdmaSend yes
ignorePrefetchLUNCount yes
maxMBpS 8000

but the only parameter which has a strong effect by itself is

ignorePrefetchLUNCount yes

and the readout performance increased of a factor at least 4, from 
50MB/s to 210 MB/s

So from the client now the situation is:

Sequential write 800 MB/s, sequential read 200 MB/s, much better then 
before but still a factor 3, both Write/Read compared what is observed 
from the NSD node:

Sequential write 2300 MB/s, sequential read 600 MB/s

As far as the test is concerned I have seen that the lmdd results are 
very similar to

fio --name=seqwrite --rw=write --buffered=1 --ioengine=posixaio --bs=1m 
--numjobs=1 --size=100G --runtime=60

fio --name=seqread --rw=wread --buffered=1 --ioengine=posixaio --bs=1m 
--numjobs=1 --size=100G --runtime=60

In the present situation the settings of read-ahead on the RAID 
controllers has practically non effect, we have also checked that by the 
way.

Giovanni

> 
> 
>      >
>      >
>      > 1 MB blocksize is a bit bad for your 9+p+q RAID with 256 KB strip
>     size.
>      > When you write one GPFS block, less than a half RAID stripe is
>     written,
>      > which means you  need to read back some data to calculate new
>     parities.
>      > I would prefer 4 MB block size, and maybe also change to 8+p+q so
>     that
>      > one GPFS is a multiple of a full 2 MB stripe.
>      >
>      >
>      >     -jf
> 
>     we have now added another file system based on 2 NSD on RAID6 8+p+q,
>     keeping the 1MB block size just not to change too many things at the
>     same time, but no substantial change in very low readout performances,
>     that are still of the order of 50 MB/s while write performance are
>     1000MB/s
> 
>     Any other suggestion is welcomed!
> 
> 
> 
> Maybe rule out the storage, and check if you get proper throughput from 
> nsdperf?
> 
> Maybe also benchmark using "gpfsperf" instead of "lmdd", and show your 
> full settings -- so that
> we see that the benchmark is sane :-)
> 
> 
> 
>    -jf

-- 
Giovanni Bracco
phone  +39 351 8804788
E-mail  giovanni.bracco at enea.it
WWW http://www.afs.enea.it/bracco