<html><body><p><tt>> Under both 3.2 and 3.3 mmbackup would always lock up our cluster when  <br>> using snapshot. I never understood the behavior without snapshot, and  <br>> the lock up was intermittent in the carved-out small test cluster, so  <br>> I never felt confident enough to deploy over the larger 4000+ clients  <br>> cluster.<br></tt><br><tt>Back then, GPFS code had a deficiency: migrating very large files didn't work well with snapshots (and some operation mm commands).  In order to create a snapshot, we have to have the file system in a consistent state for a moment, and we get there by performing a "quiesce" operation.  This is done by flushing all dirty buffers to disk, stopping any new incoming file system operations at the gates, and waiting for all in-flight operations to finish.  This works well when all in-flight operations actually finish reasonably quickly.  That assumption was broken if an external utility, e.g. mmapplypolicy, used gpfs_restripe_file API on a very large file, e.g. to migrate the file's blocks to a different storage pool.  The quiesce operation would need to wait for that API call to finish, as it's an in-flight operation, but migrating a multi-TB file could take a while, and during this time all new file system ops would be blocked.  This was solved several years ago by changing the API and its callers to do the migration one block range at a time, thus making each individual syscall short and allowing quiesce to barge in and do its thing.  All currently supported levels of GPFS have this fix.  I believe mmbackup was affected by the same GPFS deficiency and benefited from the same fix.</tt><br><br><tt>yuri</tt><BR>

</body></html>