[gpfsug-discuss] snapshots causing filesystem quiesce

Wed Feb 2 11:53:50 GMT 2022

Also, if snapshotting multiple filesets, it's important to group these into
a single mmcrsnapshot command. Then you get a single quiesce, instead of
one per fileset.

i.e. do:

    snapname=$(date --utc + at GMT-%Y.%m.%d-%H.%M.%S)
    mmcrsnapshot gpfs0 fileset1:$snapname,filset2:snapname,fileset3:snapname

instead of:

    mmcrsnapshot gpfs0 fileset1:$snapname
    mmcrsnapshot gpfs0 fileset2:$snapname
    mmcrsnapshot gpfs0 fileset3:$snapname

  -jf

On Wed, Feb 2, 2022 at 12:07 PM Jordi Caubet Serrabou <
jordi.caubet at es.ibm.com> wrote:

> Ivano,
>
> if it happens frequently, I would recommend to open a support case.
>
> The creation or deletion of a snapshot requires a quiesce of the nodes to
> obtain a consistent point-in-time image of the file system and/or update
> some internal structures afaik. Quiesce is required for nodes at the
> storage cluster but also remote clusters. Quiesce means stop activities
> (incl. I/O) for a short period of time to get such consistent image. Also
> waiting to flush any data in-flight to disk that does not allow a
> consistent point-in-time image.
>
> Nodes receive a quiesce request and acknowledge when ready. When all nodes
> acknowledge, snapshot operation can proceed and immediately I/O can resume.
> It usually takes few seconds at most and the operation performed is short
> but time I/O is stopped depends of how long it takes to quiesce the nodes.
> If some node take longer to agree stop the activities, such node will
> be delay the completion of the quiesce and keep I/O paused on the rest.
> There could many things while some nodes delay quiesce ack.
>
> The larger the cluster, the more difficult it gets. The more network
> congestion or I/O load, the more difficult it gets. I recommend to open a
> ticket for support to try to identify the root cause of which nodes not
> acknowledge the quiesce  and maybe find the root cause. If I recall some
> previous thread, default timeout was 60 seconds which match your log
> message. After such timeout, snapshot is considered failed to complete.
>
> Support might help you understand the root cause and provide some
> recommendations if it happens frequently.
>
> Best Regards,
> --
> Jordi Caubet Serrabou
> IBM Storage Client Technical Specialist (IBM Spain)
>
>
> ----- Original message -----
> From: "Talamo Ivano Giuseppe (PSI)" <ivano.talamo at psi.ch>
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> To: "gpfsug main discussion list" <gpfsug-discuss at spectrumscale.org>
> Cc:
> Subject: [EXTERNAL] Re: [gpfsug-discuss] snapshots causing filesystem
> quiesce
> Date: Wed, Feb 2, 2022 11:45 AM
>
>
> Hello Andrew,
>
>
>
> Thanks for your questions.
>
>
>
> We're not experiencing any other issue/slowness during normal activity.
>
> The storage is a Lenovo DSS appliance with a dedicated SSD enclosure/pool
> for metadata only.
>
>
>
> The two NSD servers have 750GB of RAM and 618 are configured as pagepool.
>
>
>
> The issue we see is happening on both the two filesystems we have:
>
>
>
> - perf filesystem:
>
>  - 1.8 PB size (71% in use)
>
>  - 570 milions of inodes (24% in use)
>
>
>
> - tiered filesystem:
>
>  - 400 TB size (34% in use)
>
>  - 230 Milions of files (60% in use)
>
>
>
> Cheers,
>
> Ivano
>
>
>
>
>
>
> __________________________________________
> Paul Scherrer Institut
> Ivano Talamo
> WHGA/038
> Forschungsstrasse 111
> 5232 Villigen PSI
> Schweiz
>
> Telefon: +41 56 310 47 11
> E-Mail: ivano.talamo at psi.ch
>
>
>
>
> ------------------------------
> *From:* gpfsug-discuss-bounces at spectrumscale.org <
> gpfsug-discuss-bounces at spectrumscale.org> on behalf of Andrew Beattie <
> abeattie at au1.ibm.com>
> *Sent:* Wednesday, February 2, 2022 10:33 AM
> *To:* gpfsug main discussion list
> *Subject:* Re: [gpfsug-discuss] snapshots causing filesystem quiesce
>
> Ivano,
>
> How big is the filesystem in terms of number of files?
> How big is the filesystem in terms of capacity?
> Is the Metadata on Flash or Spinning disk?
> Do you see issues when users do an LS of the filesystem or only when you
> are doing snapshots.
>
> How much memory do the NSD servers have?
> How much is allocated to the OS / Spectrum
>  Scale  Pagepool
>
> Regards
>
> Andrew Beattie
> Technical Specialist - Storage for Big Data & AI
> IBM Technology Group
> IBM Australia & New Zealand
> P. +61 421 337 927
> E. abeattie at au1.IBM.com
>
>
>
>
> On 2 Feb 2022, at 19:14, Talamo Ivano Giuseppe (PSI) <Ivano.Talamo at psi.ch>
> wrote:
>
>
> 
>
>
> Dear all,
>
> Since a while we are experiencing an issue when dealing with snapshots.
> Basically what happens is that when deleting a fileset snapshot (and maybe
> also when creating new ones) the filesystem becomes inaccessible on the
> clients for the duration of the operation (can take a few minutes).
>
> The clients and the storage are on two different clusters, using remote
> cluster mount for the access.
>
> On the log files many lines like the following appear (on both clusters):
> Snapshot whole quiesce of SG perf from xbldssio1 on this node lasted 60166
> msec
>
> By looking around I see we're not the first one. I am wondering if that's
> considered an unavoidable part of the snapshotting and if there's any
> tunable that can improve the situation. Since when this occurs all the
> clients are stuck and users are very quick to complain.
>
> If it can help, the clients are running GPFS 5.1.2-1 while the storage
> cluster is on 5.1.1-0.
>
> Thanks,
> Ivano
>
>
>
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
>
> Salvo indicado de otro modo más arriba / Unless stated otherwise above:
>
> International Business Machines, S.A.
>
> Santa Hortensia, 26-28, 28002 Madrid
>
> Registro Mercantil de Madrid; Folio 1; Tomo 1525; Hoja M-28146
>
> CIF A28-010791
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20220202/c578d2f9/attachment-0002.htm>