[gpfsug-discuss] Preliminary conclusion: single client, single thread, small files - native Scale vs NFS

Jan-Frode Myklebust janfrode at tanso.net
Wed Oct 17 13:24:01 BST 2018


Do you know if the slow throughput is caused by the network/nfs-protocol
layer, or does it help to use faster storage (ssd)? If on storage, have you
considered if HAWC can help?

I’m thinking about adding an SSD pool as a first tier to hold the active
dataset for a similar setup, but that’s mainly to solve the small file read
workload (i.e. random I/O ).


-jf
ons. 17. okt. 2018 kl. 07:47 skrev Alexander Saupp <
Alexander.Saupp at de.ibm.com>:

> Dear Mailing List readers,
>
> I've come to a preliminary conclusion that explains the behavior in an
> appropriate manner, so I'm trying to summarize my current thinking with
> this audience.
>
> *Problem statement: *
>
>    Big performance derivation between native GPFS (fast) and loopback NFS
>    mount on the same node (way slower) for single client, single thread, small
>    files workload.
>
>
>
> *Current explanation:*
>
>    tar seems to use close() on files, not fclose(). That is an
>    application choice and common behavior. The ideas is to allow OS write
>    caching to speed up process run time.
>
>    When running locally on ext3 / xfs / GPFS / .. that allows async
>    destaging of data down to disk, somewhat compromising data for better
>    performance.
>    As we're talking about write caching on the same node that the
>    application runs on - a crash is missfortune but in the same failure domain.
>    E.g. if you run a compile job that includes extraction of a tar and
>    the node crashes you'll have to restart the entire job, anyhow.
>
>    The NFSv2 spec defined that NFS io's are to be 'sync', probably
>    because the compile job on the nfs client would survive if the NFS Server
>    crashes, so the failure domain would be different
>
>    NFSv3 in rfc1813 below acknowledged the performance impact and
>    introduced the 'async' flag for NFS, which would handle IO's similar to
>    local IOs, allowing to destage in the background.
>
>    Keep in mind - applications, independent if running locally or via NFS
>    can always decided to use the fclose() option, which will ensure that data
>    is destaged to persistent storage right away.
>    But its an applications choice if that's really mandatory or whether
>    performance has higher priority.
>
>    The linux 'sync' (man sync) tool allows to sync 'dirty' memory cache
>    down to disk - very filesystem independent.
>
>
> -> single client, single thread, small files workload on GPFS can be
> destaged async, allowing to hide latency and parallelizing disk IOs.
> -> NFS client IO's are sync, so the second IO can only be started after
> the first one hit non volatile memory -> much higher latency
>
>
>
>    The Spectrum Scale NFS implementation (based on ganesha) does not
>    support the async mount option, which is a bit of a pitty. There might also
>    be implementation differences compared to kernel-nfs, I did not investigate
>    into that direction.
>
>    However, the principles of the difference are explained for my by the
>    above behavior.
>
>    One workaround that I saw working well for multiple customers was to
>    replace the NFS client by a Spectrum Scale nsd client.
>    That has two advantages, but is certainly not suitable in all cases:
>       - Improved speed by efficent NSD protocol and NSD client side write
>       caching
>       - Write Caching in the same failure domain as the application (on
>       NSD client) which seems to be more reasonable compared to NFS Server side
>       write caching.
>
>
> *References:*
>
> NFS sync vs async
> https://tools.ietf.org/html/rfc1813
> *The write throughput bottleneck caused by the synchronous definition of
> write in the NFS version 2 protocol has been addressed by adding support so
> that the NFS server can do unsafe writes.*
> Unsafe writes are writes which have not been committed to stable storage
> before the operation returns. This specification defines a method for
> committing these unsafe writes to stable storage in a reliable way.
>
>
> *sync() vs fsync()*
>
> https://www.ibm.com/support/knowledgecenter/en/ssw_aix_72/com.ibm.aix.performance/using_sync_fsync_calls.htm
> - An application program makes an fsync() call for a specified file. This
> causes all of the pages that contain modified data for that file to be
> written to disk. The writing is complete when the fsync() call returns to
> the program.
>
> - An application program makes a sync() call. This causes all of the file
> pages in memory that contain modified data to be scheduled for writing to
> disk. The writing is not necessarily complete when the sync() call returns
> to the program.
>
> - A user can enter the sync command, which in turn issues a sync() call.
> Again, some of the writes may not be complete when the user is prompted for
> input (or the next command in a shell script is processed).
>
>
> *close() vs fclose()*
> A successful close does not guarantee that the data has been successfully
> saved to disk, as the kernel defers writes. It is not common for a file
> system to flush the buffers when the stream is closed. If you need to be
> sure that the data is
> physically stored use fsync(2). (It will depend on the disk hardware at
> this point.)
>
>
> Mit freundlichen Grüßen / Kind regards
>
> *Alexander Saupp*
>
> IBM Systems, Storage Platform, EMEA Storage Competence Center
> ------------------------------
> Phone: +49 7034-643-1512 IBM Deutschland GmbH
> Mobile: +49-172 7251072 Am Weiher 24
> Email: alexander.saupp at de.ibm.com 65451 Kelsterbach
> Germany
> ------------------------------
> IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter
> Geschäftsführung: Matthias Hartmann (Vorsitzender), Norbert Janzen, Stefan
> Lutz, Nicole Reimer, Dr. Klaus Seifert, Wolfgang Wendt
> Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
> HRB 14562 / WEEE-Reg.-Nr. DE 99369940
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/135c98d7/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/135c98d7/attachment-0006.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 19995626.gif
Type: image/gif
Size: 1851 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/135c98d7/attachment-0007.gif>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ecblank.gif
Type: image/gif
Size: 45 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20181017/135c98d7/attachment-0008.gif>


More information about the gpfsug-discuss mailing list