[gpfsug-discuss] mmfsd write behavior

Aaron Knister aaron.s.knister at nasa.gov
Tue Oct 10 00:19:20 BST 2017


Thanks, Sven.

I think my goal was for the REQ_FUA flag to be used in alignment with
the consistency expectations of the filesystem. Meaning if I was writing
to a file on a filesystem (e.g. dd if=/dev/zero of=/gpfs/fs0/file1) that
the write requests to the disk addresses containing data on the file
wouldn't be issued with REQ_FUA. However, once the file was closed the
close() wouldn't return until a disk buffer flush had occurred. For more
important operations (e.g. metadata updates, log operations) I would
expect/suspect REQ_FUA would be issued more frequently.

The advantage here is it would allow GPFS to run ontop of block devices
that don't perform well with the present synchronous workload of mmfsd
(e.g. ZFS, and various other software-defined storage or hardware
appliances) but that can perform well when only periodically (e.g. every
few seconds) asked to flush pending data to disk. I also think this
would be *really* important in an FPO environment where individual
drives will probably have caches on by default and I'm not sure direct
I/O is sufficient to force linux to issue scsi synchronize cache
commands to those devices.

I'm guessing that this is far from easy but I figured I'd ask.

-Aaron

On 10/9/17 5:07 PM, Sven Oehme wrote:
> Hi,
> 
> yeah sorry i intended to reply back before my vacation and forgot about
> it the the vacation flushed it all away :-D
> so right now the assumption in Scale/GPFS is that the underlying storage
> doesn't have any form of enabled volatile write cache. the problem seems
> to be that even if we set REQ_FUA some stacks or devices may not have
> implemented that at all or correctly, so even we would set it there is
> no guarantee that it will do what you think it does. the benefit of
> adding the flag at least would allow us to blame everything on the
> underlying stack/device , but i am not sure that will make somebody
> happy if bad things happen, therefore the requirement of a non-volatile
> device will still be required at all times underneath Scale.
> so if you think we should do this, please open a PMR with the details of
> your test so it can go its regular support path. you can mention me in
> the PMR as a reference as we already looked at the places the request
> would have to be added.  
> 
> Sven
> 
> 
> On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister <aaron.s.knister at nasa.gov
> <mailto:aaron.s.knister at nasa.gov>> wrote:
> 
>     Hi Sven,
> 
>     Just wondering if you've had any additional thoughts/conversations about
>     this.
> 
>     -Aaron
> 
>     On 9/8/17 5:21 PM, Sven Oehme wrote:
>     > Hi,
>     >
>     > the code assumption is that the underlying device has no volatile
>     write
>     > cache, i was absolute sure we have that somewhere in the FAQ, but i
>     > couldn't find it, so i will talk to somebody to correct this.
>     > if i understand
>     >
>     https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt correct
>     > one could enforce this by setting REQ_FUA, but thats not
>     explicitly set
>     > today, at least i can't see it. i will discuss this with one of
>     our devs
>     > who owns this code and come back.
>     >
>     > sven
>     >
>     >
>     > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister
>     <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     > <mailto:aaron.s.knister at nasa.gov
>     <mailto:aaron.s.knister at nasa.gov>>> wrote:
>     >
>     >     Thanks Sven. I didn't think GPFS itself was caching anything
>     on that
>     >     layer, but it's my understanding that O_DIRECT isn't
>     sufficient to force
>     >     I/O to be flushed (e.g. the device itself might have a
>     volatile caching
>     >     layer). Take someone using ZFS zvol's as NSDs. I can write()
>     all day log
>     >     to that zvol (even with O_DIRECT) but there is absolutely no
>     guarantee
>     >     those writes have been committed to stable storage and aren't just
>     >     sitting in RAM until an fsync() occurs (or some other bio
>     function that
>     >     causes a flush). I also don't believe writing to a SATA drive with
>     >     O_DIRECT will force cache flushes of the drive's writeback cache..
>     >     although I just tested that one and it seems to actually
>     trigger a scsi
>     >     cache sync. Interesting.
>     >
>     >     -Aaron
>     >
>     >     On 9/7/17 10:55 PM, Sven Oehme wrote:
>     >      > I am not sure what exactly you are looking for but all
>     >     blockdevices are
>     >      > opened with O_DIRECT , we never cache anything on this layer .
>     >      >
>     >      >
>     >      > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister
>     >     <aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>
>     <mailto:aaron.s.knister at nasa.gov <mailto:aaron.s.knister at nasa.gov>>
>     >      > <mailto:aaron.s.knister at nasa.gov
>     <mailto:aaron.s.knister at nasa.gov>
>     >     <mailto:aaron.s.knister at nasa.gov
>     <mailto:aaron.s.knister at nasa.gov>>>> wrote:
>     >      >
>     >      >     Hi Everyone,
>     >      >
>     >      >     This is something that's come up in the past and has
>     recently
>     >     resurfaced
>     >      >     with a project I've been working on, and that is-- it seems
>     >     to me as
>     >      >     though mmfsd never attempts to flush the cache of the block
>     >     devices its
>     >      >     writing to (looking at blktrace output seems to confirm
>     >     this). Is this
>     >      >     actually the case? I've looked at the gpl headers for linux
>     >     and I don't
>     >      >     see any sign of blkdev_fsync, blkdev_issue_flush,
>     WRITE_FLUSH, or
>     >      >     REQ_FLUSH. I'm sure there's other ways to trigger this
>     >     behavior that
>     >      >     GPFS may very well be using that I've missed. That's
>     why I'm
>     >     asking :)
>     >      >
>     >      >     I figure with FPO being pushed as an HDFS replacement using
>     >     commodity
>     >      >     drives this feature has *got* to be in the code somewhere.
>     >      >
>     >      >     -Aaron
>     >      >
>     >      >     --
>     >      >     Aaron Knister
>     >      >     NASA Center for Climate Simulation (Code 606.2)
>     >      >     Goddard Space Flight Center
>     >      > (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
>     >      >     _______________________________________________
>     >      >     gpfsug-discuss mailing list
>     >      >     gpfsug-discuss at spectrumscale.org
>     <http://spectrumscale.org>
>     >     <http://spectrumscale.org> <http://spectrumscale.org>
>     >      > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >      >
>     >      >
>     >      >
>     >      > _______________________________________________
>     >      > gpfsug-discuss mailing list
>     >      > gpfsug-discuss at spectrumscale.org
>     <http://spectrumscale.org> <http://spectrumscale.org>
>     >      > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >      >
>     >
>     >     --
>     >     Aaron Knister
>     >     NASA Center for Climate Simulation (Code 606.2)
>     >     Goddard Space Flight Center
>     >     (301) 286-2776 <tel:(301)%20286-2776> <tel:(301)%20286-2776>
>     >     _______________________________________________
>     >     gpfsug-discuss mailing list
>     >     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     <http://spectrumscale.org>
>     >     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >
>     >
>     >
>     > _______________________________________________
>     > gpfsug-discuss mailing list
>     > gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>     >
> 
>     --
>     Aaron Knister
>     NASA Center for Climate Simulation (Code 606.2)
>     Goddard Space Flight Center
>     (301) 286-2776 <tel:(301)%20286-2776>
>     _______________________________________________
>     gpfsug-discuss mailing list
>     gpfsug-discuss at spectrumscale.org <http://spectrumscale.org>
>     http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

-- 
Aaron Knister
NASA Center for Climate Simulation (Code 606.2)
Goddard Space Flight Center
(301) 286-2776



More information about the gpfsug-discuss mailing list