[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Wed Oct 17 17:22:05 BST 2018

while most said here is correct, it can’t explain the performance of 200 files /sec and I couldn’t resist jumping in here :-D

lets assume for a second each operation is synchronous and its done by just 1 thread. 200 files / sec means 5 ms on average per file write. Lets be generous and say the network layer is 100 usec per roud-trip network hop (including code processing on protocol node or client) and for visualization lets assume the setup looks like this : 

ESS Node ---ethernet--- Protocol Node –ethernet--- client Node . 

lets say the ESS write cache can absorb small io at a fixed cost of 300 usec if the heads are ethernet connected and not using IB (then it would be more in the 250 usec range). That’s 300 +100(net1) +100(net2) usec or 500 usec in total. So you are a factor 10 off from your number. So lets just assume a create + write is more than just 1 roundtrip worth or synchronization, lets say it needs to do 2 full roundtrips synchronously one for the create and one for the stable write that’s 1 ms, still 5x off of your 5 ms. 

So either there is a bug in the NFS Server, the NFS client or the storage is not behaving properly. To verify this, the best would be to run the following test :

Create a file on the ESS node itself in the shared filesystem like : 

/usr/lpp/mmfs/samples/perf/gpfsperf create seq -nongpfs -r 4k -n 1m -th 1 -dio /sharedfs/test

Now run the following command on one of the ESS nodes, then the protocol node and last the nfs client : 

/usr/lpp/mmfs/samples/perf/gpfsperf write seq -nongpfs -r 4k -n 1m -th 1 -dio /sharedfs/test

This will create 256 stable 4k write i/os to the storage system, I picked the number just to get a statistical relevant number of i/os you can change 1m to 2m or 4m, just don’t make it too high or you might get variations due to de-staging or other side effects happening on the storage system, which you don’t care at this point you want to see the round trip time on each layer. 

The gpfsperf command will spit out a line like :

Data rate was XYZ Kbytes/sec, Op Rate was XYZ Ops/sec, Avg Latency was 0.266 milliseconds, thread utilization 1.000, bytesTransferred 1048576

The only number here that matters is the average latency number , write it down.

What I would expect to get back is something like : 

On ESS Node – 300 usec average i/o 

On PN – 400 usec average i/o 

On Client – 500 usec average i/o 

If you get anything higher than the numbers above something fundamental is bad (in fact on fast system you may see from client no more than 200-300 usec response time) and it will be in the layer in between or below of where you test. 

If all the numbers are somewhere in line with my numbers above, it clearly points to a problem in NFS itself and the way it communicates with GPFS. Marc, myself and others have debugged numerous issues in this space in the past last one was fixed beginning of this year and ended up in some Scale 5.0.1.X release. To debug this is very hard and most of the time only possible with GPFS source code access which I no longer have. 

You would start with something like strace -Ttt -f -o tar-debug.out tar -xvf …..  and check what exact system calls are made to nfs client and how long each takes. You would then run a similar strace on the NFS server to see how many individual system calls will be made to GPFS and how long each takes. This will allow you to narrow down where the issue really is. But I suggest to start with the simpler test above as this might already point to a much simpler problem. 

Btw. I will be also be speaking at the UG Meeting at SC18 in Dallas, in case somebody wants to catch up …

Sven

From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of Jan-Frode Myklebust <janfrode at tanso.net>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Wednesday, October 17, 2018 at 6:50 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Also beware there are 2 different linux NFS "async" settings. A client side setting (mount -o async), which still cases sync on file close() -- and a server (knfs) side setting (/etc/exports) that violates NFS protocol and returns requests before data has hit stable storage.

  -jf

On Wed, Oct 17, 2018 at 9:41 AM Tomer Perry <TOMP at il.ibm.com> wrote:

Hi,

Without going into to much details, AFAIR, Ontap integrate NVRAM into the NFS write cache ( as it was developed as a NAS product).
Ontap is using the STABLE bit which kind of tell the client "hey, I have no write cache at all, everything is written to stable storage - thus, don't bother with commits ( sync) commands - they are meaningless".

Regards,

Tomer Perry
Scalable I/O Development (Spectrum Scale)
email: tomp at il.ibm.com
1 Azrieli Center, Tel Aviv 67021, Israel
Global Tel:    +1 720 3422758
Israel Tel:      +972 3 9188625
Mobile:         +972 52 2554625

From:        "Keigo Matsubara" <MKEIGO at jp.ibm.com>
To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:        17/10/2018 16:35
Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS
Sent by:        gpfsug-discuss-bounces at spectrumscale.org

I also wonder how many products actually exploit NFS async mode to improve I/O performance by sacrificing the file system consistency risk:

gpfsug-discuss-bounces at spectrumscale.org wrote on 2018/10/17 22:26:52:
>               Using this option usually improves performance, but at
> the cost that an unclean server restart (i.e. a crash) can cause 
> data to be lost or corrupted."

For instance, NetApp, at the very least FAS 3220 running Data OnTap 8.1.2p4 7-mode which I tested with, would forcibly *promote* async mode to sync mode.
Promoting means even if NFS client requests async mount mode, the NFS server ignores and allows only sync mount mode.

Best Regards,
---
Keigo Matsubara, Storage Solutions Client Technical Specialist, IBM Japan
TEL: +81-50-3150-0595, T/L: 6205-0595
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/c513bc2a/attachment-0002.htm>