<div dir="auto">Might it be a case of being over built?  In the old days you could really mess up an Oracle DW by giving it too much RAM... It would spend all day reading in and out data to the ram that it didn't really need, because it had the SGA available to load the whole table.<div dir="auto"><br></div><div dir="auto">Perhaps the pagepool is so large that the time it takes to clear that much RAM is the actual time out?<div dir="auto"><br></div><div dir="auto">My environment has only a million files but has quite a bit more storage and has only an 8gb pagepool.  Seems you are saying you have 618gb of RAM for pagepool...  Even at 8GB/second that would take 77 seconds to flush it out..</div><div dir="auto"><br></div><div dir="auto">Perhaps drop the pagepool in half and see if your timeout adjusts accordingly?</div><div dir="auto"><br></div><div dir="auto">Alec</div></div><br><br><div class="gmail_quote" dir="auto"><div dir="ltr" class="gmail_attr">On Wed, Feb 2, 2022, 4:09 AM Olaf Weiser <<a href="mailto:olaf.weiser@de.ibm.com">olaf.weiser@de.ibm.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr" style="font-family:Arial,Helvetica,sans-serif;font-size:10pt"><div dir="ltr">keep in mind... creating many snapshots... means ;-) .. you'll have to delete many snapshots..</div>

<div dir="ltr">at a certain level, which depends on #files, #directories, ~workload, #nodes, #networks etc.... we ve seen cases, where generating just full snapshots (whole file system)  is the better approach instead of maintaining snapshots for each file set individually ..</div>

<div dir="ltr"> </div>

<div dir="ltr">sure. this has other side effects , like space consumption etc...</div>

<div dir="ltr">so as always.. it depends..</div>

<div dir="ltr"> </div>

<div dir="ltr"> </div>

<div dir="ltr"> </div>

<blockquote dir="ltr" style="border-left:solid #aaaaaa 2px;margin-left:5px;padding-left:5px;direction:ltr;margin-right:0px">----- Ursprüngliche Nachricht -----<br>Von: "Jan-Frode Myklebust" <<a href="mailto:janfrode@tanso.net" target="_blank" rel="noreferrer">janfrode@tanso.net</a>><br>Gesendet von: <a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" rel="noreferrer">gpfsug-discuss-bounces@spectrumscale.org</a><br>An: "gpfsug main discussion list" <<a href="mailto:gpfsug-discuss@spectrumscale.org" target="_blank" rel="noreferrer">gpfsug-discuss@spectrumscale.org</a>><br>CC:<br>Betreff: [EXTERNAL] Re: [gpfsug-discuss] snapshots causing filesystem quiesce<br>Datum: Mi, 2. Feb 2022 12:54<br> 

<div dir="ltr">Also, if snapshotting multiple filesets, it's important to group these into a single mmcrsnapshot command. Then you get a single q<span style="font-size:13.333333015441895px">uiesce, instead of one per fileset.</span>

<div> </div>

<div><span style="font-size:13.333333015441895px">i.e. do:</span></div>

<div> </div>

<div><span style="font-size:13.333333015441895px">    snapname=</span>$(date --utc +@GMT-%Y.%m.%d-%H.%M.%S)</div>

<div><span style="font-size:13.333333015441895px">    mmcrsnapshot gpfs0 fileset1:$snapname,filset2:snapname,fileset3:snapname</span></div>

<div> </div>

<div><span style="font-size:13.333333015441895px">instead of:</span></div>

<div> </div>

<div><span style="font-size:13.333333015441895px">    </span><span style="font-size:13.333333015441895px">mmcrsnapshot gpfs0 fileset1:$snapname</span></div>

<div><span style="font-size:13.333333015441895px">    </span><span style="font-size:13.333333015441895px">mmcrsnapshot gpfs0 fileset2:$snapname</span></div>

<div><span style="font-size:13.333333015441895px">    </span><span style="font-size:13.333333015441895px">mmcrsnapshot gpfs0 fileset3:$snapname</span><span style="font-size:13.333333015441895px">   </span></div>

<div> </div>

<div> </div>

<div><span style="font-size:13.333333015441895px">  -jf</span></div>

<div> </div></div> 


<div><div dir="ltr">On Wed, Feb 2, 2022 at 12:07 PM Jordi Caubet Serrabou <<a href="mailto:jordi.caubet@es.ibm.com" target="_blank" rel="noreferrer">jordi.caubet@es.ibm.com</a>> wrote:</div>

<blockquote style="margin:0px 0px 0px 0.8ex;border-left-width:1px;border-left-style:solid;border-left-color:rgb(204,204,204);padding-left:1ex"><div dir="ltr" style="font-family:Arial,Helvetica,sans-serif;font-size:10pt"><div dir="ltr"><div dir="ltr">Ivano,</div>

<div dir="ltr"> </div>

<div dir="ltr">if it happens frequently, I would recommend to open a support case.</div>

<div dir="ltr"> </div>

<div dir="ltr">The creation or deletion of a snapshot requires a quiesce of the nodes to obtain a consistent point-in-time image of the file system and/or update some internal structures afaik. Quiesce is required for nodes at the storage cluster but also remote clusters. Quiesce means stop activities (incl. I/O) for a short period of time to get such consistent image. Also waiting to flush any data in-flight to disk that does not allow a consistent point-in-time image.</div>

<div dir="ltr"> </div>

<div dir="ltr">Nodes receive a quiesce request and acknowledge when ready. When all nodes acknowledge, snapshot operation can proceed and immediately I/O can resume. It usually takes few seconds at most and the operation performed is short but time I/O is stopped depends of how long it takes to quiesce the nodes. If some node take longer to agree stop the activities, such node will be delay the completion of the quiesce and keep I/O paused on the rest.</div>

<div dir="ltr">There could many things while some nodes delay quiesce ack.</div>

<div dir="ltr"> </div>

<div dir="ltr">The larger the cluster, the more difficult it gets. The more network congestion or I/O load, the more difficult it gets. I recommend to open a ticket for support to try to identify the root cause of which nodes not acknowledge the quiesce  and maybe find the root cause. If I recall some previous thread, default timeout was 60 seconds which match your log message. After such timeout, snapshot is considered failed to complete.</div>

<div dir="ltr"> </div>

<div dir="ltr">Support might help you understand the root cause and provide some recommendations if it happens frequently.</div>

<div dir="ltr"> </div>

<div dir="ltr">Best Regards,<br>--<br>Jordi Caubet Serrabou<br>IBM Storage Client Technical Specialist (IBM Spain)</div></div>

<div dir="ltr"> </div>

<blockquote dir="ltr" style="border-left-width:2px;border-left-style:solid;border-left-color:rgb(170,170,170);margin-left:5px;padding-left:5px;direction:ltr;margin-right:0px">----- Original message -----<br>From: "Talamo Ivano Giuseppe (PSI)" <<a href="mailto:ivano.talamo@psi.ch" target="_blank" rel="noreferrer">ivano.talamo@psi.ch</a>><br>Sent by: <a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" rel="noreferrer">gpfsug-discuss-bounces@spectrumscale.org</a><br>To: "gpfsug main discussion list" <<a href="mailto:gpfsug-discuss@spectrumscale.org" target="_blank" rel="noreferrer">gpfsug-discuss@spectrumscale.org</a>><br>Cc:<br>Subject: [EXTERNAL] Re: [gpfsug-discuss] snapshots causing filesystem quiesce<br>Date: Wed, Feb 2, 2022 11:45 AM<br> 

<div dir="ltr" id="m_-6051020369072585167gmail-m_-3359539029430753890divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif"><p>Hello Andrew,</p>

<p> </p>

<p><span style="font-family:Calibri,Helvetica,sans-serif,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols;font-size:16px">Thanks for your questions.</span></p>

<p> </p>

<p>We're not experiencing any other issue/slowness during normal activity.</p>

<p>The storage is a Lenovo DSS appliance with a<span style="font-size:12pt"> dedicated SSD enclosure/pool for metadata only.</span></p>

<p> </p>

<p>The two NSD servers have 750GB of RAM and 618 are configured as pagepool.</p>

<p> </p>

<p>The issue we see is happening on both the two filesystems we have:</p>

<p> </p>

<p>- perf filesystem:</p>

<p> - 1.8 PB size (71% in use)</p>

<p> - 570 milions of inodes (24% in use)</p>

<p> </p>

<p>- tiered filesystem:</p>

<p> - 400 TB size (34% in use)</p>

<p> - 230 Milions of files (60% in use)</p>

<p> </p>

<p>Cheers,</p>

<p>Ivano</p>

<p> </p>

<p> </p>

<div id="m_-6051020369072585167gmail-m_-3359539029430753890Signature"><div dir="ltr" id="m_-6051020369072585167gmail-m_-3359539029430753890divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols"><p> </p>

<div>__________________________________________</div>

<div>Paul Scherrer Institut</div>

<div>Ivano Talamo</div>

<div>WHGA/038</div>

<div>Forschungsstrasse 111</div>

<div>5232 Villigen PSI</div>

<div>Schweiz</div>

<div> </div>

<div>Telefon: +41 56 310 47 11</div>

<div>E-Mail: <a href="mailto:ivano.talamo@psi.ch" target="_blank" rel="noreferrer">ivano.talamo@psi.ch</a></div> 


<p> </p></div></div> 


<div style="color:rgb(0,0,0)"><hr style="display:inline-block;width:98%"><div dir="ltr" id="m_-6051020369072585167gmail-m_-3359539029430753890divRplyFwdMsg"><font style="font-size:11pt" face="Calibri, sans-serif" color="#000000"><b>From:</b> <a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" rel="noreferrer">gpfsug-discuss-bounces@spectrumscale.org</a> <<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" target="_blank" rel="noreferrer">gpfsug-discuss-bounces@spectrumscale.org</a>> on behalf of Andrew Beattie <<a href="mailto:abeattie@au1.ibm.com" target="_blank" rel="noreferrer">abeattie@au1.ibm.com</a>><br><b>Sent:</b> Wednesday, February 2, 2022 10:33 AM<br><b>To:</b> gpfsug main discussion list<br><b>Subject:</b> Re: [gpfsug-discuss] snapshots causing filesystem quiesce</font>

<div> </div></div>

<div>Ivano,

<div> </div>

<div>How big is the filesystem in terms of number of files?</div>

<div>How big is the filesystem in terms of capacity? </div>

<div>Is the Metadata on Flash or Spinning disk? </div>

<div>Do you see issues when users do an LS of the filesystem or only when you are doing snapshots.</div>

<div> </div>

<div>How much memory do the NSD servers have?</div>

<div>How much is allocated to the OS / Spectrum</div>

<div> Scale  Pagepool<br> 

<div dir="ltr">Regards</div>

<div dir="ltr"> </div>

<div dir="ltr">Andrew Beattie</div>

<div dir="ltr">Technical Specialist - Storage for Big Data & AI</div>

<div dir="ltr">IBM Technology Group</div>

<div dir="ltr">IBM Australia & New Zealand</div>

<div dir="ltr">P. +61 421 337 927</div>

<div dir="ltr">E. <a href="mailto:abeattie@au1.IBM.com" target="_blank" rel="noreferrer">abeattie@au1.IBM.com</a></div>

<div dir="ltr"> </div>

<div dir="ltr"> </div>

<div dir="ltr"> 

<blockquote type="cite">On 2 Feb 2022, at 19:14, Talamo Ivano Giuseppe (PSI) <<a href="mailto:Ivano.Talamo@psi.ch" target="_blank" rel="noreferrer">Ivano.Talamo@psi.ch</a>> wrote:<br> </blockquote></div>

<blockquote type="cite"><div dir="ltr">

<div dir="ltr" id="m_-6051020369072585167gmail-m_-3359539029430753890divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif"><p> </p>

<div>Dear all,</div>

<div> </div>

<div>Since a while we are experiencing an issue when dealing with snapshots.</div>

<div>Basically what happens is that when deleting a fileset snapshot (and maybe also when creating new ones) the filesystem becomes inaccessible on the clients for the duration of the operation (can take a few minutes).</div>

<div> </div>

<div>The clients and the storage are on two different clusters, using remote cluster mount for the access.</div>

<div> </div>

<div>On the log files many lines like the following appear (on both clusters):</div>

<div>Snapshot whole quiesce of SG perf from xbldssio1 on this node lasted 60166 msec</div>

<div> </div>

<div>By looking around I see we're not the first one. I am wondering if that's considered an unavoidable part of the snapshotting and if there's any tunable that can improve the situation. Since when this occurs all the clients are stuck and users are very quick to complain.</div>

<div> </div>

<div>If it can help, the clients are running GPFS 5.1.2-1 while the storage cluster is on 5.1.1-0.</div>

<div> </div>

<div>Thanks,</div>

<div>Ivano</div>

<p> </p>

<div id="m_-6051020369072585167gmail-m_-3359539029430753890Signature"><div dir="ltr" id="m_-6051020369072585167gmail-m_-3359539029430753890divtagdefaultwrapper" style="font-size:12pt;color:rgb(0,0,0);font-family:Calibri,Helvetica,sans-serif,EmojiFont,"Apple Color Emoji","Segoe UI Emoji",NotoColorEmoji,"Segoe UI Symbol","Android Emoji",EmojiSymbols"><p> </p></div></div></div></div></blockquote></div><br> </div></div></div>

<div><font size="2" face="Default Monospace,Courier New,Courier,monospace">_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at <a href="http://spectrumscale.org" target="_blank" rel="noreferrer">spectrumscale.org</a><br><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" target="_blank" rel="noreferrer">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a> </font></div></blockquote>

<div dir="ltr"> </div></div><br><br><br>Salvo indicado de otro modo más arriba / Unless stated otherwise above:<br><br>International Business Machines, S.A.<br><br>Santa Hortensia, 26-28, 28002 Madrid<br><br>Registro Mercantil de Madrid; Folio 1; Tomo 1525; Hoja M-28146<br><br>CIF A28-010791<br><br><br>_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer noreferrer" target="_blank">spectrumscale.org</a><br><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a></blockquote></div>

<div><font size="2" face="Default Monospace,Courier New,Courier,monospace">_______________________________________________<br>gpfsug-discuss mailing list<br>gpfsug-discuss at <a href="http://spectrumscale.org" target="_blank" rel="noreferrer">spectrumscale.org</a><br><a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" target="_blank" rel="noreferrer">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a> </font></div></blockquote>

<div dir="ltr"> </div></div><br>

<br>

_______________________________________________<br>

gpfsug-discuss mailing list<br>

gpfsug-discuss at <a href="http://spectrumscale.org" rel="noreferrer noreferrer" target="_blank">spectrumscale.org</a><br>

<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss" rel="noreferrer noreferrer" target="_blank">http://gpfsug.org/mailman/listinfo/gpfsug-discuss</a><br>

</blockquote></div></div>