[gpfsug-discuss] Fwd: FW: Backing up GPFS with Rsync

Tagliavini, Enrico enrico.tagliavini at fmi.ch
Thu Mar 11 09:22:46 GMT 2021


Hello William,

I've got your email forwarded my another user and I decided to subscribe to give you my two cents.

I would like to warn you about the risk of dong what you have in mind. Using the GPFS policy engine to get a list of file to rsync is
easily going to get you with missing data in the backup. The problem is that there are cases that are not covered by it. For example
if you mv a folder with a lot of nested subfolders and files none of the subfolders would show up in your list of files to be updated.

DM API would be the way to go, as you could replicate the mv on the backup side, but you must not miss any event, which scares me
enough not to go that route.

What I ended up doing instead: we run GPFS on both side, main and backup storage. So I use the policy engine on both sides and just
build up the differences. We have about 250 million files and this is surprisingly fast. On top of that add all the files for which
the ctime changes in the last couple of days (to update metadata info).

Good luck.
Kind regards.

-- 

Enrico Tagliavini
Systems / Software Engineer

enrico.tagliavini at fmi.ch

Friedrich Miescher Institute for Biomedical Research
Infomatics

Maulbeerstrasse 66
4058 Basel
Switzerland




-------- Forwarded Message --------
> 
> -----Original Message-----
> From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Ryan Novosielski
> Sent: Wednesday, March 10, 2021 3:22 AM
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Subject: Re: [gpfsug-discuss] Backing up GPFS with Rsync
> 
> Yup, you want to use the policy engine:
> 
> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_policyrules.htm
> 
> Something in here ought to help. We do something like this (but I’m reluctant to provide examples as I’m actually suspicious that we
> don’t have it quite right and are passing far too much stuff to rsync).
> 
> --
> #BlackLivesMatter
> ____
> > > \\UTGERS,   |---------------------------*O*---------------------------
> > > _// the State |         Ryan Novosielski - novosirj at rutgers.edu
> > > \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus
> > >  \\    of NJ | Office of Advanced Research Computing - MSB C630, Newark
>      `'
> 
> > On Mar 9, 2021, at 9:19 PM, William Burke <bill.burke.860 at gmail.com> wrote:
> > 
> >  I would like to know what files were modified/created/deleted (only for the current day) on the GPFS's file system so that I
> > could rsync ONLY those files to a predetermined external location. I am running GPFS 4.2.3.9
> > 
> > Is there a way to access the GPFS's metadata directly so that I do not have to traverse the filesystem looking for these files? If
> > i use the rsync tool it will scan the file system which is 400+ million files.  Obviously this will be problematic to complete a
> > scan in a day, if it would ever complete single-threaded. There are tools or scripts that run multithreaded rsync but it's still a
> > brute force attempt. and it would be nice to know where the delta of files that have changed.
> > 
> > I began looking at Spectrum Scale Data Management (DM) API but I am not sure if this is the best approach to looking at the GPFS
> > metadata - inodes, modify times, creation times, etc.
> > 
> > 
> > 
> > --
> > 
> > Best Regards,
> > 
> > William Burke (he/him)
> > Lead HPC Engineer
> > Advance Research Computing
> > 860.255.8832 m | LinkedIn
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss


More information about the gpfsug-discuss mailing list