[gpfsug-discuss] naive question about rsync: run it on a client or on NSD server?

Sanchez, Paul Paul.Sanchez at deshaw.com
Fri Feb 14 16:24:40 GMT 2020


Some (perhaps obvious) points to consider:

 - There are some corner cases (e.g. preserving hard-linked files or sparseness) which require special options.

 - Depending on your level of churn, it may be helpful to pre-stage the sync before your cutover so that there is less data movement required, and you're primarily comparing metadata.

- Files on the source filesysytem might change (and become internally inconsistent) during your rsync, so you should generally sync from a snapshot on the source.

 - If users can still modify the source filesystem, then you might not get everything.  For the final sync, you may need to make the source read-only, or unmount it on clients, kill user processes, or some combination to prevent all new writes from succeeding.  (If you're going to use the clients for MPI sync, you obviously need the filesystem to remain mounted there so you may need to take other measures to keep users away.)

 - If you decide to do a final "offline" sync, you want it to be fast so users can get back to work sooner, so parallelism is usually a must.  If you have lots of filesets, then that's a convenient way to split the work.

 - If you have any filesets with many more inodes than the others, keep in mind that those will likely take the longest to complete.

 - Test, test, test.  You usually won't get this right on the first go or know how long a full sync takes without practice.  Remember that you'll need to employ options to delete extraneous files on the target when you're syncing over the top of a previous attempt, since files intentionally deleted on the source aren't usually welcome if they reappear after a migration.

 - Verify.  Whether you use rsync of dsync, repeating the process with dry-run/no-op flags which report differences can be helpful to increase your confidence in the process.  If you don't have time to verify after the final offline sync, hopefully you were able to fit this in during testing.


Some thoughts about whether it's appropriate to use NSD servers as sync hosts...

 - If they are the managers and they have the best (direct) connectivity to the metadata NSDs, then I would at least consider them before ruling this out, with caveats...
     - do they have enough available RAM and CPU?
     - where do they get their software? Do you trust the version of kernel/libc/rsync there to behave as you expect?
     - if the data NSDs aren't local to these NSD servers, do they have sufficient network connectivity to not cause other problems during the sync?

 - Test at low parallelism and work your way up.  You can also compare performance of this method with any other, on a small scale, in your environment to see what you can expect from each.

Good luck, 
Paul

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Simon Thompson
Sent: Friday, February 14, 2020 09:57
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] naive question about rsync: run it on a client or on NSD server?

This message was sent by an external party.


I wouldn't run it on an NSD server. Ideally you want to avoid running other processes etc on there.

If you are running on clients, you also might want to look at: https://github.com/hpc/mpifileutils

And use MPI to parallelise the find and copy.

Simon

On 14/02/2020, 14:25, "gpfsug-discuss-bounces at spectrumscale.org on behalf of giovanni.bracco at enea.it" <gpfsug-discuss-bounces at spectrumscale.org on behalf of giovanni.bracco at enea.it> wrote:

    We must replicate about 100 TB data between two filesystems supported by
    two different storages (DDN9900 and DDN7990) both connected to the same
    NSD servers (6 of them) and we plan to use rsync.

    Non special GPFS attributes, just the standard POSIX one, we plan to use
    the standard rsync.

    The question:
    is there any advantage in running the rsync on one of the NSD server or
    is better to run it on a client?

    The environment:
    GPFS 4.2.3.19, NSD CentOS7.4,  clients mostly CentOS6.4 (connected by IB
    QDR) and CentOS7.3 (connected by OPA), connection between NSD and
    storage with IB QDR)

    Giovanni

    --
    Giovanni Bracco
    phone  +39 351 8804788
    E-mail  giovanni.bracco at enea.it
    WWW http://www.afs.enea.it/bracco
    _______________________________________________
    gpfsug-discuss mailing list
    gpfsug-discuss at spectrumscale.org
    http://gpfsug.org/mailman/listinfo/gpfsug-discuss


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


More information about the gpfsug-discuss mailing list