[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Jan-Frode Myklebust janfrode at tanso.net
Wed Oct 17 14:35:22 BST 2018


My thinking was mainly that single threaded 200 files/second == 5 ms/file.
Where do these 5 ms go? Is it NFS protocol overhead, or is it waiting for
I/O so that it can be fixed with a lower latency storage backend?


 -jf

On Wed, Oct 17, 2018 at 9:15 AM Olaf Weiser <olaf.weiser at de.ibm.com> wrote:

> Jallo Jan,
> you can expect to get slightly improved numbers from the lower response
> times of the HAWC ... but the loss of performance comes from the fact, that
> GPFS or (async kNFS) writes with multiple parallel threads - in opposite
> to e.g. tar via GaneshaNFS  comes with single threads fsync on each file..
>
> you'll never outperform e.g. 128 (maybe slower), but, parallel threads
> (running write-behind)   <--->   with one single but fast threads, ....
>
> so as Alex suggest.. if possible.. take gpfs client of kNFS  for those
> types of workloads..
>
>
>
>
>
>
>
>
>
>
> From:        Jan-Frode Myklebust <janfrode at tanso.net>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        10/17/2018 02:24 PM
> Subject:        Re: [gpfsug-discuss] Preliminary conclusion: single
> client, single thread, small files - native Scale vs NFS
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> Do you know if the slow throughput is caused by the network/nfs-protocol
> layer, or does it help to use faster storage (ssd)? If on storage, have you
> considered if HAWC can help?
>
> I’m thinking about adding an SSD pool as a first tier to hold the active
> dataset for a similar setup, but that’s mainly to solve the small file read
> workload (i.e. random I/O ).
>
>
> -jf
> ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
> *Alexander.Saupp at de.ibm.com* <Alexander.Saupp at de.ibm.com>>:
> Dear Mailing List readers,
>
> I've come to a preliminary conclusion that explains the behavior in an
> appropriate manner, so I'm trying to summarize my current thinking with
> this audience.
>
> *Problem statement: *
> Big performance derivation between native GPFS (fast) and loopback NFS
> mount on the same node (way slower) for single client, single thread, small
> files workload.
>
>
> *Current explanation:*
> tar seems to use close() on files, not fclose(). That is an application
> choice and common behavior. The ideas is to allow OS write caching to speed
> up process run time.
>
> When running locally on ext3 / xfs / GPFS / .. that allows async destaging
> of data down to disk, somewhat compromising data for better performance.
> As we're talking about write caching on the same node that the application
> runs on - a crash is missfortune but in the same failure domain.
> E.g. if you run a compile job that includes extraction of a tar and the
> node crashes you'll have to restart the entire job, anyhow.
>
> The NFSv2 spec defined that NFS io's are to be 'sync', probably because
> the compile job on the nfs client would survive if the NFS Server crashes,
> so the failure domain would be different
>
> NFSv3 in rfc1813 below acknowledged the performance impact and introduced
> the 'async' flag for NFS, which would handle IO's similar to local IOs,
> allowing to destage in the background.
>
> Keep in mind - applications, independent if running locally or via NFS can
> always decided to use the fclose() option, which will ensure that data is
> destaged to persistent storage right away.
> But its an applications choice if that's really mandatory or whether
> performance has higher priority.
>
> The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache down
> to disk - very filesystem independent.
>
> -> single client, single thread, small files workload on GPFS can be
> destaged async, allowing to hide latency and parallelizing disk IOs.
> -> NFS client IO's are sync, so the second IO can only be started after
> the first one hit non volatile memory -> much higher latency
>
>
> The Spectrum Scale NFS implementation (based on ganesha) does not support
> the async mount option, which is a bit of a pitty. There might also be
> implementation differences compared to kernel-nfs, I did not investigate
> into that direction.
>
> However, the principles of the difference are explained for my by the
> above behavior.
>
> One workaround that I saw working well for multiple customers was to
> replace the NFS client by a Spectrum Scale nsd client.
> That has two advantages, but is certainly not suitable in all cases:
> - Improved speed by efficent NSD protocol and NSD client side write caching
> - Write Caching in the same failure domain as the application (on NSD
> client) which seems to be more reasonable compared to NFS Server side write
> caching.
>
> *References:*
>
> NFS sync vs async
> *https://tools.ietf.org/html/rfc1813*
> <https://tools.ietf.org/html/rfc1813>
> *The write throughput bottleneck caused by the synchronous definition of
> write in the NFS version 2 protocol has been addressed by adding support so
> that the NFS server can do unsafe writes.*
> Unsafe writes are writes which have not been committed to stable storage
> before the operation returns. This specification defines a method for
> committing these unsafe writes to stable storage in a reliable way.
>
>
> *sync() vs fsync()*
>
> *https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm*
> <https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm>
> - An application program makes an fsync() call for a specified file. This
> causes all of the pages that contain modified data for that file to be
> written to disk. The writing is complete when the fsync() call returns to
> the program.
>
> - An application program makes a sync() call. This causes all of the file
> pages in memory that contain modified data to be scheduled for writing to
> disk. The writing is not necessarily complete when the sync() call returns
> to the program.
>
> - A user can enter the sync command, which in turn issues a sync() call.
> Again, some of the writes may not be complete when the user is prompted for
> input (or the next command in a shell script is processed).
>
>
> *close() vs fclose()*
> A successful close does not guarantee that the data has been successfully
> saved to disk, as the kernel defers writes. It is not common for a file
> system to flush the buffers when the stream is closed. If you need to be
> sure that the data is
> physically stored use fsync(2). (It will depend on the disk hardware at
> this point.)
>
>
> Mit freundlichen Grüßen / Kind regards
>
> *Alexander Saupp*
>
> IBM Systems, Storage Platform, EMEA Storage Competence Center
> ------------------------------
> Phone: +49 7034-643-1512 IBM Deutschland GmbH
> Mobile: +49-172 7251072 Am Weiher 24
> Email: *alexander.saupp at de.ibm.com* <alexander.saupp at de.ibm.com> 65451
> Kelsterbach
> Germany
> ------------------------------
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Matthias Hartmann (Vorsitzender), Norbert Janzen, Stefan
> Lutz, Nicole Reimer, Dr. Klaus Seifert, Wolfgang Wendt
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at *spectrumscale.org* <http://spectrumscale.org>
> *http://gpfsug.org/mailman/listinfo/gpfsug-discuss*
> <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>*[attachment
> "ecblank.gif" deleted by Olaf Weiser/Germany/IBM] [attachment
> "19995626.gif" deleted by Olaf Weiser/Germany/IBM] [attachment
> "ecblank.gif" deleted by Olaf Weiser/Germany/IBM] *
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/a7b9e636/attachment-0002.htm>


More information about the gpfsug-discuss mailing list