[gpfsug-discuss] Backing up GPFS with Rsync

Honwai Leong honwai.leong at sydney.edu.au
Thu Mar 11 22:28:57 GMT 2021


This paper might provide some ideas, not the best solution but works fine 

https://github.com/HPCSYSPROS/Workshop20/blob/master/Parallelized_data_replication_of_multi-petabyte_storage_systems/ws_hpcsysp103s1-file1.pdf

It is a two-part workflow to replicate files from production to DR site. It leverages on snapshot ID to determine which files have been updated/modified after a snapshot was taken. It doesn't take care of deletion of files moved from one directory to another, so it uses dsync to take care of that part. 

-----Original Message-----
From: gpfsug-discuss-bounces at spectrumscale.org <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of gpfsug-discuss-request at spectrumscale.org
Sent: Friday, March 12, 2021 3:08 AM
To: gpfsug-discuss at spectrumscale.org
Subject: gpfsug-discuss Digest, Vol 110, Issue 20

Send gpfsug-discuss mailing list submissions to
	gpfsug-discuss at spectrumscale.org

To subscribe or unsubscribe via the World Wide Web, visit
	https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org
or, via email, send a message with subject or body 'help' to
	gpfsug-discuss-request at spectrumscale.org

You can reach the person managing the list at
	gpfsug-discuss-owner at spectrumscale.org

When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..."


Today's Topics:

   1. Re: Fwd: FW:  Backing up GPFS with Rsync (Steven Daniels)


----------------------------------------------------------------------

Message: 1
Date: Thu, 11 Mar 2021 09:08:11 -0700
From: "Steven Daniels" <sadaniel at us.ibm.com>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Cc: gpfsug-discuss-bounces at spectrumscale.org, bill.burke.860 at gmail.com
Subject: Re: [gpfsug-discuss] Fwd: FW:  Backing up GPFS with Rsync
Message-ID:
	<OF7D742C62.38489C0C-ON00258695.00580EF0-87258695.0058A404 at notes.na.collabserv.com>
	
Content-Type: text/plain; charset="utf-8"

Also, be aware there have been massive improvements in AFM, in terms of usability, reliablity and performance.

I just completed a project where we moved about 3/4 PB during 7x24 operations to retire a very old storage system (1st Gen IBM GSS) to a new ESS. We were able to get considerable performance but not without effort, it allowed the client to continue operations and migrate to new hardware seamlessly.

The new v5.1 AFM feature supports filesystem level AFM which would have greatly simplified the effort and I believe will make AFM vastly easier to implement in the general case.

I'll leave it to Venkat and others on the development team to share more details about improvements.


Steven A. Daniels
Cross-brand Client Architect
Senior Certified IT Specialist
National Programs
Fax and Voice: 3038101229
sadaniel at us.ibm.com
https://protect-au.mimecast.com/s/ZnryCr81nyt88D8ZkuztwY-?domain=ibm.com




From:	Stephen Ulmer <ulmer at ulmer.org>
To:	gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Cc:	bill.burke.860 at gmail.com
Date:	03/11/2021 06:47 AM
Subject:	[EXTERNAL] Re: [gpfsug-discuss] Fwd: FW:  Backing up GPFS with
            Rsync
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



Thank you! Would you mind letting me know in what era you made your evaluation?

I?m not suggesting you should change anything at all, but when I make recommendations for my own customers I like to be able to associate the level of GPFS with the anecdotes. I view the software as more of a stream of features and capabilities than as a set product.

Different clients have different requirements, so every implementation could be different. When I add someone else?s judgement to my own, I just like getting as close to their actual evaluation scenario as possible.

Your original post was very thoughtful, and I appreciate your time.

 --
Stephen


      On Mar 11, 2021, at 7:58 AM, Tagliavini, Enrico
      <enrico.tagliavini at fmi.ch> wrote:

      ?
      Hello Stephen,

      actually not a dumb question at all. We evaluated AFM quite a bit
      before turning it down.

      The horror stories about it and massive data loss are too scary. Plus
      we had actual reports of very bad performance. Personally I think AFM
      is very complicated, overcomplicated for what we need. We need the
      data safe, we don't need active / active DR or anything like that.
      While AFM can technically do what we need the complexity of its
      design makes it too easy to make a mistake and cause a service
      disruption or, even worst, data loss. We are a very small institute
      with a small IT team, so investing time in making it right was also
      not really worth it due to the high TCO.

      Kind regards.

      --
      Enrico Tagliavini
      Systems / Software Engineer

      enrico.tagliavini at fmi.ch

      Friedrich Miescher Institute for Biomedical Research
      Infomatics

      Maulbeerstrasse 66
      4058 Basel
      Switzerland





      On Thu, 2021-03-11 at 08:17 -0500, Stephen Ulmer wrote:
        I?m going to ask what may be a dumb question:

        Given that you have GPFS on both ends, what made you decide to NOT
        use AFM?

         --
        Stephen


         On Mar 11, 2021, at 3:56 AM, Tagliavini, Enrico
         <enrico.tagliavini at fmi.ch> wrote:

         ?Hello William,

         I've got your email forwarded my another user and I decided to
         subscribe to give you my two cents.

         I would like to warn you about the risk of dong what you have in
         mind. Using the GPFS policy engine to get a list of file to rsync
         is
         easily going to get you with missing data in the backup. The
         problem is that there are cases that are not covered by it. For
         example
         if you mv a folder with a lot of nested subfolders and files none
         of the subfolders would show up in your list of files to be
         updated.

         DM API would be the way to go, as you could replicate the mv on
         the backup side, but you must not miss any event, which scares me
         enough not to go that route.

         What I ended up doing instead: we run GPFS on both side, main and
         backup storage. So I use the policy engine on both sides and just
         build up the differences. We have about 250 million files and this
         is surprisingly fast. On top of that add all the files for which
         the ctime changes in the last couple of days (to update metadata
         info).

         Good luck.
         Kind regards.

         --

         Enrico Tagliavini
         Systems / Software Engineer

         enrico.tagliavini at fmi.ch

         Friedrich Miescher Institute for Biomedical Research
         Infomatics

         Maulbeerstrasse 66
         4058 Basel
         Switzerland




         -------- Forwarded Message --------

           -----Original Message-----
           From: gpfsug-discuss-bounces at spectrumscale.org
           <gpfsug-discuss-bounces at spectrumscale.org> On Behalf Of Ryan
           Novosielski
           Sent: Wednesday, March 10, 2021 3:22 AM
           To: gpfsug main discussion list
           <gpfsug-discuss at spectrumscale.org>
           Subject: Re: [gpfsug-discuss] Backing up GPFS with Rsync

           Yup, you want to use the policy engine:

           https://protect-au.mimecast.com/s/5FXFCvl1rKi77y78YhzCNU5?domain=ibm.com

           Something in here ought to help. We do something like this (but
           I?m reluctant to provide examples as I?m actually suspicious
           that we
           don?t have it quite right and are passing far too much stuff to
           rsync).

           --
           #BlackLivesMatter
           ____
              \\UTGERS,
              |---------------------------*O*---------------------------
              _// the State |         Ryan Novosielski -
              novosirj at rutgers.edu
              \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~
              RBHS Campus
               \\    of NJ | Office of Advanced Research Computing - MSB
              C630, Newark
                `'

            On Mar 9, 2021, at 9:19 PM, William Burke
            <bill.burke.860 at gmail.com> wrote:

             I would like to know what files were modified/created/deleted
            (only for the current day) on the GPFS's file system so that I
            could rsync ONLY those files to a predetermined external
            location. I am running GPFS 4.2.3.9

            Is there a way to access the GPFS's metadata directly so that I
            do not have to traverse the filesystem looking for these files?
            If
            i use the rsync tool it will scan the file system which is 400+
            million files.  Obviously this will be problematic to complete
            a
            scan in a day, if it would ever complete single-threaded. There
            are tools or scripts that run multithreaded rsync but it's
            still a
            brute force attempt. and it would be nice to know where the
            delta of files that have changed.

            I began looking at Spectrum Scale Data Management (DM) API but
            I am not sure if this is the best approach to looking at the
            GPFS
            metadata - inodes, modify times, creation times, etc.



            --

            Best Regards,

            William Burke (he/him)
            Lead HPC Engineer
            Advance Research Computing
            860.255.8832 m | LinkedIn
            _______________________________________________
            gpfsug-discuss mailing list
            gpfsug-discuss at spectrumscale.org
            https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org

           _______________________________________________
           gpfsug-discuss mailing list
           gpfsug-discuss at spectrumscale.org
           https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org
         _______________________________________________
         gpfsug-discuss mailing list
         gpfsug-discuss at spectrumscale.org
         https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org
        _______________________________________________
        gpfsug-discuss mailing list
        gpfsug-discuss at spectrumscale.org
        https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org
      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org
      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      https://protect-au.mimecast.com/s/uNqKCwV1vMfGGRGxqcKIIVS?domain=urldefense.proofpoint.com



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://protect-au.mimecast.com/s/bouzCxngwOf11Q1v7TRQ-qb?domain=gpfsug.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 1A816397.jpg
Type: image/jpeg
Size: 4919 bytes
Desc: not available
URL: <https://protect-au.mimecast.com/s/MVTSCyojxQTrryro8UA5AGt?domain=gpfsug.org>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <https://protect-au.mimecast.com/s/D4DACzvkyVCMMmMqkcB4NCX?domain=gpfsug.org>

------------------------------

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
https://protect-au.mimecast.com/s/NW07Cq71mwf8878NEuZEWhS?domain=gpfsug.org


End of gpfsug-discuss Digest, Vol 110, Issue 20
***********************************************




More information about the gpfsug-discuss mailing list