[gpfsug-discuss] Can't take snapshots while re-striping

Sven Oehme oehmes at gmail.com
Thu Oct 18 19:09:56 BST 2018


Peter, 

If the 2 operations wouldn't be compatible you should have gotten a different message. 
To understand what the message means one needs to understand how the snapshot code works. 
When GPFS wants to do a snapshot it goes through multiple phases. It tries to first flush all dirty data a first time, then flushes new data a 2nd time and then tries to quiesce the filesystem, how to do this is quite complex, so let me try to explain. 

How much parallelism is used for the 2 sync periods  is controlled by sync workers 

. sync1WorkerThreads 64
 . sync2WorkerThreads 64
 . syncBackgroundThreads 64
. syncWorkerThreads 64

and if my memory serves me correct the sync1 number is for the first flush, the sync2 for the 2nd flush while syncworkerthreads are used explicit by e.g. crsnapshot to flush dirty data (I am sure somebody from IBM will correct me if I state something wrong I mixed them up before ) :

when data is flushed by background sync is triggered by the OS :

root at dgx-1-01:~# sysctl -a |grep -i vm.dirty
vm.dirty_background_bytes = 0
vm.dirty_background_ratio = 10
vm.dirty_bytes = 0
vm.dirty_expire_centisecs = 3000
vm.dirty_ratio = 20
vm.dirty_writeback_centisecs = 500.  <--- this is 5 seconds 

as well as GPFS settings : 

  syncInterval 5
  syncIntervalStrict 0

here both are set to 5 seconds, so every 5 seconds there is a periodic background flush happening .

why explain all this, because its very easy for a thread that does buffered i/o to make stuff dirty, a single thread can do 100's of thousands of i/os into memory so making stuff dirty is very easy. The number of threads described above need to clean all this stuff, means stabilizing it onto media and here is where it gets complicated. You already run rebalance, which puts a lot of work on the disk, on top I assume you don't have a idle filesystem  , people make stuff dirty and the threads above compete flushing things , so it’s a battle they can't really win unless you have very fast storage or at least very fast and large caches in the storage, so the 64 threads in the example above can clean stuff faster than new data gets made dirty. 

So your choices are  : 
1. reduce workerthreads, so stuff gets less dirty. 
2. turn writes into stable writes : mmchconfig forceOSyncWrites=yes (you can use -I while running) this will slow all write operations down on your system as all writes are now done synchronous, but because of that they can't make anything dirty, so the flushers actually don't have to do any work.

While back at IBM I proposed to change the code to switch into O_SYNC mode dynamically between sync 1 and sync2 , this means for a seconds or 2 all writes would be done synchronous to not have the possibility to make things dirty so the quiesce actually doesn't get delayed and as soon as the quiesce happened remove the temporary enforced stable flag, but that proposal never got anywhere as no customer pushed for it. Maybe that would be worth a RFE __
 

Btw. I described some of the parameters in more detail here --> http://files.gpfsug.org/presentations/2014/UG10_GPFS_Performance_Session_v10.pdf
Some of that is outdated by now, but probably still the best summary presentation out there. 

Sven

On 10/18/18, 8:32 AM, "Peter Childs" <gpfsug-discuss-bounces at spectrumscale.org on behalf of p.childs at qmul.ac.uk> wrote:

    We've just added 9 raid volumes to our main storage, (5 Raid6 arrays
    for data and 4 Raid1 arrays for metadata)
    
    We are now attempting to rebalance and our data around all the volumes.
    
    We started with the meta-data doing a "mmrestripe -r" as we'd changed
    the failure groups to on our meta-data disks and wanted to ensure we
    had all our metadata on known good ssd. No issues, here we could take
    snapshots and I even tested it. (New SSD on new failure group and move
    all old SSD to the same failure group)
    
    We're now doing a "mmrestripe -b" to rebalance the data accross all 21
    Volumes however when we attempt to take a snapshot, as we do every
    night at 11pm it fails with  
    
    sudo /usr/lpp/mmfs/bin/mmcrsnapshot home test
    Flushing dirty data for snapshot :test...
    Quiescing all file system operations.
    Unable to quiesce all nodes; some processes are busy or holding
    required resources.
    mmcrsnapshot: Command failed. Examine previous error messages to
    determine cause.
    
    Are you meant to be able to take snapshots while re-striping or not? 
    
    I know a rebalance of the data is probably unnecessary, but we'd like
    to get the best possible speed out of the system, and we also kind of
    like balance.
    
    Thanks
    
    
    -- 
    Peter Childs
    ITS Research Storage
    Queen Mary, University of London
    
    _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org
    http://gpfsug.org/mailman/listinfo/gpfsug-discuss
    





More information about the gpfsug-discuss mailing list