From madhu.punjabi at in.ibm.com Mon Nov 2 08:17:23 2020 From: madhu.punjabi at in.ibm.com (Madhu P Punjabi) Date: Mon, 2 Nov 2020 08:17:23 +0000 Subject: [gpfsug-discuss] [NFS-Ganesha-Support] 'ganesha_mgr display_export - client not listed In-Reply-To: <660DD807-C723-44EF-BC51-57EFB296FFC4@id.ethz.ch> References: <660DD807-C723-44EF-BC51-57EFB296FFC4@id.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From christian.vieser at 1und1.de Mon Nov 2 13:44:50 2020 From: christian.vieser at 1und1.de (Christian Vieser) Date: Mon, 2 Nov 2020 14:44:50 +0100 Subject: [gpfsug-discuss] Alternative to Scale S3 API. In-Reply-To: <1109480230.484366.1603799162955@privateemail.com> References: <1109480230.484366.1603799162955@privateemail.com> Message-ID: <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> Hi Andi, we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. Regards, Christian Am 27.10.20 um 12:46 schrieb Andi Christiansen: > Hi all, > > We have over a longer period used the S3 API within spectrum Scale.. > And that has shown that it does not support very many applications > because of limitations of the API.. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtick at us.ibm.com Tue Nov 3 00:21:43 2020 From: jmtick at us.ibm.com (Jacob M Tick) Date: Tue, 3 Nov 2020 00:21:43 +0000 Subject: [gpfsug-discuss] Use cases for file audit logging and clustered watch folder Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 3 17:00:54 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 3 Nov 2020 17:00:54 +0000 Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale Message-ID: Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t get sent! Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results. >>> Join Here <<< This episode will start 15 minutes later as usual. * San Francisco, USA at 08:15 PST * New York, USA at 11:15 EST * London, United Kingdom at 16:15 GMT * Frankfurt, Germany at 17:15 CET * Pune, India at 21:45 IST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2488 bytes Desc: not available URL: From andi at christiansen.xxx Wed Nov 4 07:14:41 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET) Subject: [gpfsug-discuss] Alternative to Scale S3 API. In-Reply-To: <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> References: <1109480230.484366.1603799162955@privateemail.com> <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> Message-ID: <1512108314.679947.1604474081488@privateemail.com> Hi Christian, Thanks for the information! My question also triggered IBM to tell me the same so i think we will stay on S3 with Scale and hoping the same with the new release.. Yes, MinIO is really lacking some good documentation.. but definatly a cool software package that i will keep an eye on in the future... Best Regards Andi Christiansen > On 11/02/2020 2:44 PM Christian Vieser wrote: > > > > Hi Andi, > > we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. > > We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. > > Regards, > > Christian > > Am 27.10.20 um 12:46 schrieb Andi Christiansen: > > > > Hi all, > > > > We have over a longer period used the S3 API within spectrum Scale.. And that has shown that it does not support very many applications because of limitations of the API.. > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joe at excelero.com Wed Nov 4 12:19:07 2020 From: joe at excelero.com (joe at excelero.com) Date: Wed, 4 Nov 2020 06:19:07 -0600 Subject: [gpfsug-discuss] Accepted: gpfsug-discuss Digest, Vol 106, Issue 3 Message-ID: <924bb673-0b2a-420a-8ce2-be24c5e6e4e8@Spark> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: reply.ics Type: application/ics Size: 0 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Wed Nov 4 16:05:50 2020 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Wed, 4 Nov 2020 16:05:50 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 3 In-Reply-To: References: Message-ID: Could someone share the password for the event today? Thanks! Thanks, Oluwasijibomi (Siji) Saula HPC Systems Administrator / Information Technology Research 2 Building 220B / Fargo ND 58108-6050 p: 701.231.7749 / www.ndsu.edu [cid:image001.gif at 01D57DE0.91C300C0] ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Wednesday, November 4, 2020 6:00 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 106, Issue 3 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale (Simon Thompson) 2. Re: Alternative to Scale S3 API. (Andi Christiansen) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 Nov 2020 17:00:54 +0000 From: Simon Thompson To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale Message-ID: Content-Type: text/plain; charset="utf-8" Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t get sent! Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results. >>> Join Here <<< This episode will start 15 minutes later as usual. * San Francisco, USA at 08:15 PST * New York, USA at 11:15 EST * London, United Kingdom at 16:15 GMT * Frankfurt, Germany at 17:15 CET * Pune, India at 21:45 IST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2488 bytes Desc: not available URL: ------------------------------ Message: 2 Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET) From: Andi Christiansen To: gpfsug main discussion list , Christian Vieser Subject: Re: [gpfsug-discuss] Alternative to Scale S3 API. Message-ID: <1512108314.679947.1604474081488 at privateemail.com> Content-Type: text/plain; charset="utf-8" Hi Christian, Thanks for the information! My question also triggered IBM to tell me the same so i think we will stay on S3 with Scale and hoping the same with the new release.. Yes, MinIO is really lacking some good documentation.. but definatly a cool software package that i will keep an eye on in the future... Best Regards Andi Christiansen > On 11/02/2020 2:44 PM Christian Vieser wrote: > > > > Hi Andi, > > we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. > > We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. > > Regards, > > Christian > > Am 27.10.20 um 12:46 schrieb Andi Christiansen: > > > > Hi all, > > > > We have over a longer period used the S3 API within spectrum Scale.. And that has shown that it does not support very many applications because of limitations of the API.. > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 106, Issue 3 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From herrmann at sprintmail.com Sat Nov 7 21:10:36 2020 From: herrmann at sprintmail.com (Ron H) Date: Sat, 7 Nov 2020 16:10:36 -0500 Subject: [gpfsug-discuss] Use cases for file audit logging and clusteredwatch folder In-Reply-To: References: Message-ID: <8F771847BDEB4447919D30A16FE48FAB@rone8PC> Hi Jacob, Can you point me to a good overview of each of these features? I know File Audit and Watch is part of the DME Scale edition license, but I can?t seem to find a good explanation of what these features can offer. Thanks Ron From: Jacob M Tick Sent: Monday, November 02, 2020 7:21 PM To: gpfsug-discuss at spectrumscale.org Cc: April Brown Subject: [gpfsug-discuss] Use cases for file audit logging and clusteredwatch folder Hi All, I am reaching out on behalf of the Spectrum Scale development team to get some insight on how our customers are using the file audit logging and the clustered watch folder features. If you have it enabled in your test or production environment, could you please elaborate on how and why you are using the feature? Also, knowing how you have the function configured (ie: watching or auditing for certain events, only enabling on certain filesets, ect..) would help us out. Please respond back to April, John (both on CC), and I with any info you are willing to provide. Thanks in advance! Regards, Jake Tick Manager Spectrum Scale - Scalable Data Interfaces IBM Systems Group Email:jmtick at us.ibm.com IBM -------------------------------------------------------------------------------- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtick at us.ibm.com Mon Nov 9 17:31:00 2020 From: jmtick at us.ibm.com (Jacob M Tick) Date: Mon, 9 Nov 2020 17:31:00 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Use_cases_for_file_audit_logging_and?= =?utf-8?q?=09clusteredwatch_folder?= In-Reply-To: <8F771847BDEB4447919D30A16FE48FAB@rone8PC> References: <8F771847BDEB4447919D30A16FE48FAB@rone8PC>, Message-ID: An HTML attachment was scrubbed... URL: From Kamil.Czauz at Squarepoint-Capital.com Wed Nov 11 22:29:31 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Wed, 11 Nov 2020 22:29:31 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Message-ID: We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Nov 12 01:56:46 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 12 Nov 2020 02:56:46 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage_of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From luis.bolinches at fi.ibm.com Thu Nov 12 13:19:05 2020 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 12 Nov 2020 13:19:05 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage_of=09mmfsd_process?= In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From jyyum at kr.ibm.com Thu Nov 12 14:10:17 2020 From: jyyum at kr.ibm.com (Jae Yoon Yum) Date: Thu, 12 Nov 2020 14:10:17 +0000 Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163322.png Type: image/png Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163323.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: From Eric.Wendel at ibm.com Thu Nov 12 15:43:46 2020 From: Eric.Wendel at ibm.com (Eric Wendel - Eric.Wendel@ibm.com) Date: Thu, 12 Nov 2020 15:43:46 +0000 Subject: [gpfsug-discuss] Problems reading emails to the mailing list Message-ID: <31233620a4324240885aed7ad18a729a@ibm.com> Hi Folks, As you are no doubt aware, Lotus Notes and its ecosystem is virtually extinct. For those of us who have moved on to more modern email clients (including an increasing number of IBMERs like me), the email links we receive from SSUG (for example) 'OF0433B7F4.580A7B75-ON0025861E.004DD432-0025861E.004DD8A4 at notes.na.collabserv.com are useless because they can only be read if you have the Notes client installed. This is especially problematic for Linux users as the Linux client for Notes is discontinued. It would be very helpful if the SSUG could move to a modern email platform. Thanks, Eric Wendel eric.wendel at ibm.com -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of gpfsug-discuss-request at spectrumscale.org Sent: Thursday, November 12, 2020 8:10 AM To: gpfsug-discuss at spectrumscale.org Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 106, Issue 8 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Poor client performance with high cpu usage of mmfsd process (Luis Bolinches) 2. Question about the Clearing Spectrum Scale GUI event (Jae Yoon Yum) ---------------------------------------------------------------------- Message: 1 Date: Thu, 12 Nov 2020 13:19:05 +0000 From: "Luis Bolinches" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 12 Nov 2020 14:10:17 +0000 From: "Jae Yoon Yum" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163322.png Type: image/png Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163323.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 106, Issue 8 ********************************************** From stefan.roth at de.ibm.com Thu Nov 12 17:13:38 2020 From: stefan.roth at de.ibm.com (Stefan Roth) Date: Thu, 12 Nov 2020 18:13:38 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Question_about_the_Clearing_Spectrum_S?= =?utf-8?q?cale_GUI=09event?= In-Reply-To: References: Message-ID: Hello Jay, as long as those errors are still shown by "mmhealth node show" CLI command, they will again appear in the GUI. In the GUI events table you can show an "Event Type" column which is hidden by default. Events that have event type "Notice" can be cleared by the "Mark as Read" action. Events that have event type "State" can not be cleared by the "Mark as Read" action. They have to disappear by solving the problem. If a problem is solved the error should disappear from "mmhealth node show" and after that it will disappear from the GUI as well. Mit freundlichen Gr??en / Kind regards Stefan Roth Spectrum Scale Developement Phone: +49 162 4159934 IBM Deutschland Research & Development GmbH Email: stefan.roth at de.ibm.com Am Weiher 24 65451 Kelsterbach IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Jae Yoon Yum" To: gpfsug-discuss at spectrumscale.org Date: 12.11.2020 15:10 Subject: [EXTERNAL] [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? events on the Scale GUI. As you know, there is ?mark as read? for ?warning? messages but there isn?t for ?ERROR?. (In fact, the ?mark as read? button is exist but it does not work.) So I sent him to run this command on cli. /usr/lpp/mmfs/gui/cli/lshealth --reset On my test VM, all of the error messages has been cleared when I run the command?. But, for the client?s system, client said that ?All of the error / warning messages had been appeared again include the one which I had delete by clicking ?mark as read?.? Does anyone who has similar experience like this? and How Could I solve this problem? Or, Is there any way to clear the event one by one? * I sent the same message to the Slack 'scale-help' channel. Thanks. Jay. Best Regards, JaeYoon(Jay) IBM Korea, Three IFC, Yum 10 Gukjegeumyung-ro, Yeongdeungpo-gu, IBM Systems Seoul, Korea Hardware, Storage Technical Sales Mobile : +82-10-4995-4814 07326 e-mail: jyyum at kr.ibm.com ? ??? ??? ??, ??? ??, ???? ???? ???? ?????. ??? ??? ??? ??? ????, ????? ??? ?? ?????,??IBM ??? ? ?? ??? ????, ????? ???? ????. (If you don't wish to receive e-mail from sender, please send e-mail directly. For IBM e-mail, please click here). ??? ???? ??,??, ?? ??? ???????(??: 02-3781-7800,? ?? mktg at kr.ibm.com )? ?? ?? ???? ?? ???? ? ????. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E506389.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E764757.gif Type: image/gif Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E982001.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: From arc at b4restore.com Thu Nov 12 17:33:01 2020 From: arc at b4restore.com (=?utf-8?B?QW5kaSBOw7hyIENocmlzdGlhbnNlbg==?=) Date: Thu, 12 Nov 2020 17:33:01 +0000 Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event In-Reply-To: References: Message-ID: Hi Jay, First of you need to make sure your system is actually healthy. Events that are not fixed will reappear. I have had a lot of ?stale? entries happening over the last years and more often than not ?/usr/lpp/mmfs/gui/cli/lshealth ?reset? clears the entries if they are not actual faults.. As Stefan says if the errors/warnings are shown in ?mmhealth node show or mmhealth cluster show? they will reappear as they should. (I have sometimes seen stale entries there aswell) When I have encountered stale entries which wasn?t cleared with ?lshealth ?reset? I could clear them with ?mmsysmoncontrol restart?. I think I actually run that command maybe once or twice every month because of stale entries in the GUI og mmhealth itself.. don?t know why they happen but they seem to appear more frequently for me atleast.. I have high hopes for the 5.1.0.0/5.1.0.1 release as I have heard there should be some new things for the GUI as well.. not sure what they are yet though 😊 Hope this helps. Cheers A. Christiansen Fra: gpfsug-discuss-bounces at spectrumscale.org P? vegne af Stefan Roth Sendt: Thursday, November 12, 2020 6:14 PM Til: gpfsug main discussion list Emne: Re: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Hello Jay, as long as those errors are still shown by "mmhealth node show" CLI command, they will again appear in the GUI. In the GUI events table you can show an "Event Type" column which is hidden by default. Events that have event type "Notice" can be cleared by the "Mark as Read" action. Events that have event type "State" can not be cleared by the "Mark as Read" action. They have to disappear by solving the problem. If a problem is solved the error should disappear from "mmhealth node show" and after that it will disappear from the GUI as well. Mit freundlichen Gr??en / Kind regards Stefan Roth Spectrum Scale Developement ________________________________ Phone: +49 162 4159934 IBM Deutschland Research & Development GmbH [cid:image002.gif at 01D6B922.3FE99E70] Email: stefan.roth at de.ibm.com Am Weiher 24 65451 Kelsterbach ________________________________ IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 [cid:image003.gif at 01D6B922.3FE99E70]"Jae Yoon Yum" ---12.11.2020 15:10:35---Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? ev From: "Jae Yoon Yum" > To: gpfsug-discuss at spectrumscale.org Date: 12.11.2020 15:10 Subject: [EXTERNAL] [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? events on the Scale GUI. As you know, there is ?mark as read? for ?warning? messages but there isn?t for ?ERROR?. (In fact, the ?mark as read? button is exist but it does not work.) So I sent him to run this command on cli. /usr/lpp/mmfs/gui/cli/lshealth --reset On my test VM, all of the error messages has been cleared when I run the command?. But, for the client?s system, client said that ?All of the error / warning messages had been appeared again include the one which I had delete by clicking ?mark as read?.? Does anyone who has similar experience like this? and How Could I solve this problem? Or, Is there any way to clear the event one by one? * I sent the same message to the Slack 'scale-help' channel. Thanks. Jay. Best Regards, JaeYoon(Jay) Yum IBM Korea, Three IFC, [cid:image005.jpg at 01D6B922.3FE99E70] 10 Gukjegeumyung-ro, Yeongdeungpo-gu, IBM Systems Hardware, Storage Technical Sales Seoul, Korea Mobile : +82-10-4995-4814 07326 e-mail: jyyum at kr.ibm.com ? ??? ??? ??, ??? ??, ???? ???? ???? ?????. ??? ??? ??? ??? ????, ????? ??? ?? ?????,??IBM ??? ??? ??? ????, ????? ???? ????. (If you don't wish to receive e-mail from sender, please send e-mail directly. For IBM e-mail, please click here). ??? ???? ??,??, ?? ??? ???????(??: 02-3781-7800,??? mktg at kr.ibm.com )? ?? ?? ???? ?? ???? ? ????. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 1851 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 105 bytes Desc: image003.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.jpg Type: image/jpeg Size: 2457 bytes Desc: image005.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.png Type: image/png Size: 166 bytes Desc: image006.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image007.png Type: image/png Size: 616 bytes Desc: image007.png URL: From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 13 02:33:17 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 13 Nov 2020 02:33:17 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Fri Nov 13 09:21:17 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 13 Nov 2020 10:21:17 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Fri Nov 13 09:37:04 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 13 Nov 2020 10:37:04 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 13 13:31:21 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 13 Nov 2020 13:31:21 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Fri Nov 13 13:38:48 2020 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 13 Nov 2020 13:38:48 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu?= =?utf-8?q?=09usage=09of=09mmfsd_process?= In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From kkr at lbl.gov Fri Nov 13 21:11:16 2020 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Fri, 13 Nov 2020 13:11:16 -0800 Subject: [gpfsug-discuss] REMINDER - SC20 Sessions - Monday Nov. 16 and Wednesday Nov. 18 Message-ID: <7B85E526-88D4-44AE-B034-4EC5A61E524C@lbl.gov> Hi all, A Reminder to attend and also submit any panel questions for the Wednesday session. So far, there are 3 questions around these topics: 1) excessive prefetch when reading small fractions of many large files 2) improved the integration between TSM and GPFS 3) number of security vulnerabilities in GPFS, the GUI, ESS, or something else related Bring on your tough questions and make it interesting. Cheers, Kristy ?original email--- The Spectrum Scale User Group will be hosting two 90 minute sessions at SC20 this year and we hope you can join us. The first one is: "Storage for AI" and will be held Monday, Nov. 16th, from 11:00-12:30 EST and the second one is "What's new in Spectrum Scale 5.1?" and will be held Wednesday, Nov. 18th from 11:00-12:30 EST. Please see the calendar at https://www.spectrumscaleug.org/eventslist/2020-11/ and register by clicking on a session on the calendar and then the "Please register here to join the session" link. Best, Kristy Kristy Kallback-Rose Senior HPC Storage Systems Analyst National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory From UWEFALKE at de.ibm.com Mon Nov 16 13:45:57 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Mon, 16 Nov 2020 14:45:57 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu?= =?utf-8?q?=09usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, while the other nodes can well block the local one, as Frederick suggests, there should at least be something visible locally waiting for these other nodes. Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms. I am no developer nor do I have the code, so don't expect too much. Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)? Lookups are metadata ops, do you have a separate pool for your metadata? How is that pool set up (doen to the physical block devices)? Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...). So, which waiters did you see (hope you saved them, if not, do it next time). What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)? To check the unreasonably short trace capture time: how large are the trcrpt files you obtain? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 14:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From andi at christiansen.xxx Mon Nov 16 19:44:14 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Mon, 16 Nov 2020 20:44:14 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Message-ID: <1388247256.209171.1605555854969@privateemail.com> Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Mon Nov 16 21:44:30 2020 From: stockf at us.ibm.com (Frederick Stock) Date: Mon, 16 Nov 2020 21:44:30 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Mon Nov 16 21:58:19 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Mon, 16 Nov 2020 13:58:19 -0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <20201116215819.wda6nophekamzs3v@thargelion> When we did a similar (though larger, at ~2.5PB) migration, we used rsync as well, but ran one rsync process per Isilon node, and made sure the NFS clients were hitting separate Isilon nodes for their reads. We also didn't have more than one rsync process running per client, as the Linux NFS client (at least in CentOS 6) was terrible when it came to concurrent access. Whatever method you end up using, I can guarantee you will be much happier once you are on GPFS. :) On Mon, Nov 16, 2020 at 08:44:14PM +0100, Andi Christiansen wrote: > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... > > as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From jonathan.buzzard at strath.ac.uk Mon Nov 16 22:58:49 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 16 Nov 2020 22:58:49 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <4de1fa02-a074-0901-cf12-31be9e843f5f@strath.ac.uk> On 16/11/2020 19:44, Andi Christiansen wrote: > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS > from two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching > from 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit > from scale nodes and 20Gbits from isilon so we should be able to reach > just under 20Gbit... > > > if anyone have any ideas they are welcome! > My biggest recommendation when doing this is to use a sqlite database to keep track of what is going on. The main issue is that you are almost certainly going to need to do more than one rsync pass unless your source Isilon system has no user activity, and with 700TB to move that seems unlikely. Typically you do an initial rsync to move the bulk of the data while the users are still live, then shutdown user access to the source system and do the final rsync which hopefully has a significantly smaller amount of data to actually move. So this is what I have done on a number of occasions now. I create a very simple sqlite DB with a list of source and destination folders and a status code. Initially the status code is set to -1. Then I have a perl script which looks at the sqlite DB, picks a row with a status code of -1, and sets the status code to -2, aka that directory is in progress. It then proceeds to run the rsync and when it finishes it updates the status code to the exit code of the rsync process. As long as all the rsync processes have access to the same copy of the sqlite DB (simplest to put it on either the source or destination file system) then all is good. You can fire off multiple rsync's on multiple nodes and they will all keep churning away till there is no more work to be done. The advantage is you can easily interrogate the DB to find out the state of play. That is how many of your transfers have completed, how many are yet to be done, which ones are currently being transferred etc. without logging onto multiple nodes. *MOST* importantly you can see if any of the rsync's had an error, by simply looking for status codes greater than zero. I cannot stress how important this is. Noting that if the source is still active you will see errors down to files being deleted on the source file system before rsync has a chance to copy them. However this has a specific exit code (24) so is easy to spot and not worry about. Finally it is also very simple to set the status codes to -1 again and set the process away again. So the final run is easier to do. If you want to mail me off list I can dig out a copy of the perl code I used if your interested. There are several version as I have tended to tailor to each transfer. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Mon Nov 16 23:12:47 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 16 Nov 2020 23:12:47 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <20201116215819.wda6nophekamzs3v@thargelion> References: <1388247256.209171.1605555854969@privateemail.com> <20201116215819.wda6nophekamzs3v@thargelion> Message-ID: <8d4d2987-77dd-e3e1-1c98-a635f1b96ddd@strath.ac.uk> On 16/11/2020 21:58, Skylar Thompson wrote: > When we did a similar (though larger, at ~2.5PB) migration, we used rsync > as well, but ran one rsync process per Isilon node, and made sure the NFS > clients were hitting separate Isilon nodes for their reads. We also didn't > have more than one rsync process running per client, as the Linux NFS > client (at least in CentOS 6) was terrible when it came to concurrent access. > The million dollar question IMHO is the number of files and their sizes. Basically if you have a million 1KB files to move it is going to take much longer than a 100 1GB files. That is the overhead of dealing with each file is a real bitch and kills your attainable transfer speed stone dead. One option I have used in the past is to use your last backup and restore to the new system, then rsync in the changes. That way you don't impact the source file system which is live. Another option I have used is to inform users in advance that data will be transferred based on a metric of how many files and how much data they have. So the less data and fewer files the quicker you will get access to the new system once access to the old system is turned off. It is amazing how much users clear up junk under this scenario. Last time I did this a single user went from over 17 million files to 11 thousand! In total many many TB of data just vanished from the system (around half of the data when puff) as users actually got around to some house keeping LOL. Moving less data and files is always less painful. > Whatever method you end up using, I can guarantee you will be much happier > once you are on GPFS. :) > Goes without saying :-) JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From UWEFALKE at de.ibm.com Tue Nov 17 08:50:56 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 17 Nov 2020 09:50:56 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Andi, what about leaving NFS completeley out and using rsync (multiple rsyncs in parallel, of course) directly between your source and target servers? I am not sure how many TCP connections (suppose it is NFS4) in parallel are opened between client and server, using a 2x bonded interface well requires at least two. That combined with the DB approach suggested by Jonathan to control the activity of the rsync streams would be my best guess. If you have many small files, the overhead might still kill you. Tarring them up into larger aggregates for transfer would help a lot, but then you must be sure they won't change or you need to implement your own version control for that class of files. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Andi Christiansen To: "gpfsug-discuss at spectrumscale.org" Date: 16/11/2020 20:44 Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Tue Nov 17 08:57:07 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 17 Nov 2020 09:57:07 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but it is over two nodes, so no bonding. But still, I'd expect to open several TCP connections in parallel per source-target pair (like with several rsyncs per source node) would bear an advantage (and still I thing NFS doesn't do that, but I can be wrong). If more nodes have access to the Isilon data they could also participate (and don't need NFS exports for that). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 17/11/2020 09:50 Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Hi Andi, what about leaving NFS completeley out and using rsync (multiple rsyncs in parallel, of course) directly between your source and target servers? I am not sure how many TCP connections (suppose it is NFS4) in parallel are opened between client and server, using a 2x bonded interface well requires at least two. That combined with the DB approach suggested by Jonathan to control the activity of the rsync streams would be my best guess. If you have many small files, the overhead might still kill you. Tarring them up into larger aggregates for transfer would help a lot, but then you must be sure they won't change or you need to implement your own version control for that class of files. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Andi Christiansen To: "gpfsug-discuss at spectrumscale.org" Date: 16/11/2020 20:44 Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From andi at christiansen.xxx Tue Nov 17 11:51:58 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 12:51:58 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <616234716.258600.1605613918767@privateemail.com> Hi all, thanks for all the information, there was some interesting things amount it.. I kept on going with rsync and ended up making a file with all top level user directories and splitting them into chunks of 347 per rsync session(total 42000 ish folders). yesterday we had only 14 sessions with 3000 folders in each and that was too much work for one rsync session.. i divided them out among all GPFS nodes to have them fetch an area each and actually doing that 3 times on each node and that has now boosted the bandwidth usage from 3Gbit to around 16Gbit in total.. all nodes have been seing doing work above 7Gbit individual which is actually near to what i was expecting without any modifications to the NFS server or TCP tuning.. CPU is around 30-50% on each server and mostly below or around 30% so it seems like it could have handled abit more sessions.. Small files are really a killer but with all 96+ sessions we have now its not often all sessions are handling small files at the same time so we have an average of about 10-12Gbit bandwidth usage. Thanks all! ill keep you in mind if for some reason we see it slowing down again but for now i think we will try to see if it will go the last mile with a bit more sessions on each :) Best Regards Andi Christiansen > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but > it is over two nodes, so no bonding. But still, I'd expect to open several > TCP connections in parallel per source-target pair (like with several > rsyncs per source node) would bear an advantage (and still I thing NFS > doesn't do that, but I can be wrong). > If more nodes have access to the Isilon data they could also participate > (and don't need NFS exports for that). > > Mit freundlichen Gr??en / Kind regards > > Dr. Uwe Falke > IT Specialist > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > Services > +49 175 575 2877 Mobile > Rathausstr. 7, 09111 Chemnitz, Germany > uwefalke at de.ibm.com > > IBM Services > > IBM Data Privacy Statement > > IBM Deutschland Business & Technology Services GmbH > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > Sitz der Gesellschaft: Ehningen > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > From: Uwe Falke/Germany/IBM > To: gpfsug main discussion list > Date: 17/11/2020 09:50 > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > from Isilon to Scale over NFS? > > > Hi Andi, > > what about leaving NFS completeley out and using rsync (multiple rsyncs > in parallel, of course) directly between your source and target servers? > I am not sure how many TCP connections (suppose it is NFS4) in parallel > are opened between client and server, using a 2x bonded interface well > requires at least two. That combined with the DB approach suggested by > Jonathan to control the activity of the rsync streams would be my best > guess. > If you have many small files, the overhead might still kill you. Tarring > them up into larger aggregates for transfer would help a lot, but then you > must be sure they won't change or you need to implement your own version > control for that class of files. > > Mit freundlichen Gr??en / Kind regards > > Dr. Uwe Falke > IT Specialist > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > Services > +49 175 575 2877 Mobile > Rathausstr. 7, 09111 Chemnitz, Germany > uwefalke at de.ibm.com > > IBM Services > > IBM Data Privacy Statement > > IBM Deutschland Business & Technology Services GmbH > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > Sitz der Gesellschaft: Ehningen > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > From: Andi Christiansen > To: "gpfsug-discuss at spectrumscale.org" > > Date: 16/11/2020 20:44 > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > Isilon to Scale over NFS? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS from > two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching from > 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > scale nodes and 20Gbits from isilon so we should be able to reach just > under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Tue Nov 17 12:07:30 2020 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 17 Nov 2020 13:07:30 +0100 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <616234716.258600.1605613918767@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: Nice to see it working well! But, what about ACLs? Does you rsync pull in all needed metadata, or do you also need to sync ACLs ? Any plans for how to solve that ? On Tue, Nov 17, 2020 at 12:52 PM Andi Christiansen wrote: > Hi all, > > thanks for all the information, there was some interesting things amount > it.. > > I kept on going with rsync and ended up making a file with all top level > user directories and splitting them into chunks of 347 per rsync > session(total 42000 ish folders). yesterday we had only 14 sessions with > 3000 folders in each and that was too much work for one rsync session.. > > i divided them out among all GPFS nodes to have them fetch an area each > and actually doing that 3 times on each node and that has now boosted the > bandwidth usage from 3Gbit to around 16Gbit in total.. > > all nodes have been seing doing work above 7Gbit individual which is > actually near to what i was expecting without any modifications to the NFS > server or TCP tuning.. > > CPU is around 30-50% on each server and mostly below or around 30% so it > seems like it could have handled abit more sessions.. > > Small files are really a killer but with all 96+ sessions we have now its > not often all sessions are handling small files at the same time so we have > an average of about 10-12Gbit bandwidth usage. > > Thanks all! ill keep you in mind if for some reason we see it slowing down > again but for now i think we will try to see if it will go the last mile > with a bit more sessions on each :) > > Best Regards > Andi Christiansen > > > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, > but > > it is over two nodes, so no bonding. But still, I'd expect to open > several > > TCP connections in parallel per source-target pair (like with several > > rsyncs per source node) would bear an advantage (and still I thing NFS > > doesn't do that, but I can be wrong). > > If more nodes have access to the Isilon data they could also participate > > (and don't need NFS exports for that). > > > > Mit freundlichen Gr??en / Kind regards > > > > Dr. Uwe Falke > > IT Specialist > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > Services > > +49 175 575 2877 Mobile > > Rathausstr. 7, 09111 Chemnitz, Germany > > uwefalke at de.ibm.com > > > > IBM Services > > > > IBM Data Privacy Statement > > > > IBM Deutschland Business & Technology Services GmbH > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > Sitz der Gesellschaft: Ehningen > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > From: Uwe Falke/Germany/IBM > > To: gpfsug main discussion list > > Date: 17/11/2020 09:50 > > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > > from Isilon to Scale over NFS? > > > > > > Hi Andi, > > > > what about leaving NFS completeley out and using rsync (multiple rsyncs > > in parallel, of course) directly between your source and target servers? > > I am not sure how many TCP connections (suppose it is NFS4) in parallel > > are opened between client and server, using a 2x bonded interface well > > requires at least two. That combined with the DB approach suggested by > > Jonathan to control the activity of the rsync streams would be my best > > guess. > > If you have many small files, the overhead might still kill you. Tarring > > them up into larger aggregates for transfer would help a lot, but then > you > > must be sure they won't change or you need to implement your own version > > control for that class of files. > > > > Mit freundlichen Gr??en / Kind regards > > > > Dr. Uwe Falke > > IT Specialist > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > Services > > +49 175 575 2877 Mobile > > Rathausstr. 7, 09111 Chemnitz, Germany > > uwefalke at de.ibm.com > > > > IBM Services > > > > IBM Data Privacy Statement > > > > IBM Deutschland Business & Technology Services GmbH > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > Sitz der Gesellschaft: Ehningen > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > From: Andi Christiansen > > To: "gpfsug-discuss at spectrumscale.org" > > > > Date: 16/11/2020 20:44 > > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > > Isilon to Scale over NFS? > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > Hi all, > > > > i have got a case where a customer wants 700TB migrated from isilon to > > Scale and the only way for him is exporting the same directory on NFS > from > > two different nodes... > > > > as of now we are using multiple rsync processes on different parts of > > folders within the main directory. this is really slow and will take > > forever.. right now 14 rsync processes spread across 3 nodes fetching > from > > 2.. > > > > does anyone know of a way to speed it up? right now we see from 1Gbit to > > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit > from > > scale nodes and 20Gbits from isilon so we should be able to reach just > > under 20Gbit... > > > > > > if anyone have any ideas they are welcome! > > > > > > Thanks in advance > > Andi Christiansen _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andi at christiansen.xxx Tue Nov 17 12:24:22 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 13:24:22 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: <1023406427.259407.1605615862969@privateemail.com> Hi Jan, We are syncing ACLs, groups, owners and timestamps aswell :) /Andi Christiansen > On 11/17/2020 1:07 PM Jan-Frode Myklebust wrote: > > > Nice to see it working well! > > But, what about ACLs? Does you rsync pull in all needed metadata, or do you also need to sync ACLs ? Any plans for how to solve that ? > > On Tue, Nov 17, 2020 at 12:52 PM Andi Christiansen wrote: > > > > Hi all, > > > > thanks for all the information, there was some interesting things amount it.. > > > > I kept on going with rsync and ended up making a file with all top level user directories and splitting them into chunks of 347 per rsync session(total 42000 ish folders). yesterday we had only 14 sessions with 3000 folders in each and that was too much work for one rsync session.. > > > > i divided them out among all GPFS nodes to have them fetch an area each and actually doing that 3 times on each node and that has now boosted the bandwidth usage from 3Gbit to around 16Gbit in total.. > > > > all nodes have been seing doing work above 7Gbit individual which is actually near to what i was expecting without any modifications to the NFS server or TCP tuning.. > > > > CPU is around 30-50% on each server and mostly below or around 30% so it seems like it could have handled abit more sessions.. > > > > Small files are really a killer but with all 96+ sessions we have now its not often all sessions are handling small files at the same time so we have an average of about 10-12Gbit bandwidth usage. > > > > Thanks all! ill keep you in mind if for some reason we see it slowing down again but for now i think we will try to see if it will go the last mile with a bit more sessions on each :) > > > > Best Regards > > Andi Christiansen > > > > > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > > > > > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but > > > it is over two nodes, so no bonding. But still, I'd expect to open several > > > TCP connections in parallel per source-target pair (like with several > > > rsyncs per source node) would bear an advantage (and still I thing NFS > > > doesn't do that, but I can be wrong). > > > If more nodes have access to the Isilon data they could also participate > > > (and don't need NFS exports for that). > > > > > > Mit freundlichen Gr??en / Kind regards > > > > > > Dr. Uwe Falke > > > IT Specialist > > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > > Services > > > +49 175 575 2877 Mobile > > > Rathausstr. 7, 09111 Chemnitz, Germany > > > uwefalke at de.ibm.com mailto:uwefalke at de.ibm.com > > > > > > IBM Services > > > > > > IBM Data Privacy Statement > > > > > > IBM Deutschland Business & Technology Services GmbH > > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > > Sitz der Gesellschaft: Ehningen > > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > > > From: Uwe Falke/Germany/IBM > > > To: gpfsug main discussion list > > > Date: 17/11/2020 09:50 > > > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > > > from Isilon to Scale over NFS? > > > > > > > > > Hi Andi, > > > > > > what about leaving NFS completeley out and using rsync (multiple rsyncs > > > in parallel, of course) directly between your source and target servers? > > > I am not sure how many TCP connections (suppose it is NFS4) in parallel > > > are opened between client and server, using a 2x bonded interface well > > > requires at least two. That combined with the DB approach suggested by > > > Jonathan to control the activity of the rsync streams would be my best > > > guess. > > > If you have many small files, the overhead might still kill you. Tarring > > > them up into larger aggregates for transfer would help a lot, but then you > > > must be sure they won't change or you need to implement your own version > > > control for that class of files. > > > > > > Mit freundlichen Gr??en / Kind regards > > > > > > Dr. Uwe Falke > > > IT Specialist > > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > > Services > > > +49 175 575 2877 Mobile > > > Rathausstr. 7, 09111 Chemnitz, Germany > > > uwefalke at de.ibm.com mailto:uwefalke at de.ibm.com > > > > > > IBM Services > > > > > > IBM Data Privacy Statement > > > > > > IBM Deutschland Business & Technology Services GmbH > > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > > Sitz der Gesellschaft: Ehningen > > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > > > > > > From: Andi Christiansen > > > To: "gpfsug-discuss at spectrumscale.org mailto:gpfsug-discuss at spectrumscale.org " > > > > > > Date: 16/11/2020 20:44 > > > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > > > Isilon to Scale over NFS? > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org mailto:gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > > > > > Hi all, > > > > > > i have got a case where a customer wants 700TB migrated from isilon to > > > Scale and the only way for him is exporting the same directory on NFS from > > > two different nodes... > > > > > > as of now we are using multiple rsync processes on different parts of > > > folders within the main directory. this is really slow and will take > > > forever.. right now 14 rsync processes spread across 3 nodes fetching from > > > 2.. > > > > > > does anyone know of a way to speed it up? right now we see from 1Gbit to > > > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > > > scale nodes and 20Gbits from isilon so we should be able to reach just > > > under 20Gbit... > > > > > > > > > if anyone have any ideas they are welcome! > > > > > > > > > Thanks in advance > > > Andi Christiansen _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss athttp://spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss athttp://spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss athttp://spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Nov 17 13:53:43 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 17 Nov 2020 13:53:43 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <616234716.258600.1605613918767@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> On 17/11/2020 11:51, Andi Christiansen wrote: > Hi all, > > thanks for all the information, there was some interesting things > amount it.. > > I kept on going with rsync and ended up making a file with all top > level user directories and splitting them into chunks of 347 per > rsync session(total 42000 ish folders). yesterday we had only 14 > sessions with 3000 folders in each and that was too much work for one > rsync session.. Unless you use something similar to my DB suggestion it is almost inevitable that some of those rsync sessions are going to have issues and you will have no way to track it or even know it has happened unless you do a single final giant catchup/check rsync. I should add that a copy of the sqlite DB is cover your backside protection when a user pops up claiming that you failed to transfer one of their vitally important files six months down the line and the old system is turned off and scrapped. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From skylar2 at uw.edu Tue Nov 17 14:59:43 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Tue, 17 Nov 2020 06:59:43 -0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> Message-ID: <20201117145943.5cxyfpfyrk7udmn4@thargelion> On Tue, Nov 17, 2020 at 01:53:43PM +0000, Jonathan Buzzard wrote: > On 17/11/2020 11:51, Andi Christiansen wrote: > > Hi all, > > > > thanks for all the information, there was some interesting things > > amount it.. > > > > I kept on going with rsync and ended up making a file with all top > > level user directories and splitting them into chunks of 347 per > > rsync session(total 42000 ish folders). yesterday we had only 14 > > sessions with 3000 folders in each and that was too much work for one > > rsync session.. > > Unless you use something similar to my DB suggestion it is almost inevitable > that some of those rsync sessions are going to have issues and you will have > no way to track it or even know it has happened unless you do a single final > giant catchup/check rsync. > > I should add that a copy of the sqlite DB is cover your backside protection > when a user pops up claiming that you failed to transfer one of their > vitally important files six months down the line and the old system is > turned off and scrapped. That's not a bad idea, and I like it more than the method I setup where we captured the output of find from both sides of the transfer and preserved it for posterity, but obviously did require a hard-stop date on the source. Fortunately, we seem committed to GPFS so it might be we never have to do another bulk transfer outside of the filesystem... -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From S.J.Thompson at bham.ac.uk Tue Nov 17 15:55:41 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 17 Nov 2020 15:55:41 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <20201117145943.5cxyfpfyrk7udmn4@thargelion> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> <20201117145943.5cxyfpfyrk7udmn4@thargelion> Message-ID: <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> > Fortunately, we seem committed to GPFS so it might be we never have to do > another bulk transfer outside of the filesystem... Until you want to move a v3 or v4 created file-system to v5 block sizes __ I hopes we won't be doing that sort of thing again... Simon From jonathan.buzzard at strath.ac.uk Tue Nov 17 19:45:29 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 17 Nov 2020 19:45:29 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> <20201117145943.5cxyfpfyrk7udmn4@thargelion> <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> Message-ID: <1a1be12b-a4f2-f2b3-4cdf-e34bc5eace24@strath.ac.uk> On 17/11/2020 15:55, Simon Thompson wrote: > >> Fortunately, we seem committed to GPFS so it might be we never have to do >> another bulk transfer outside of the filesystem... > > Until you want to move a v3 or v4 created file-system to v5 block sizes __ You forget the v2 to v3 for more than two billion files switch. Either that or you where not using it back then. Then there was the v3.2 if you ever want to mount it on Windows. > > I hopes we won't be doing that sort of thing again... > Yep, going to be recycling my scripts in the coming week for a v4 to v5 with capacity upgrade on our DSS-G. That basically involves a trashing of the file system and a restore from backup. Going to be doing the your data will be restored based on a metric of how many files and how much data you have ploy again :-) I too hope that will be the last time I have to do anything similar but my experience of the last couple of decades says that is likely to be a forlorn hope :-( I speculate that one day the 10,000 file set limit will be lifted, but only if you reformat your file system... JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From andi at christiansen.xxx Tue Nov 17 20:40:39 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 21:40:39 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> Message-ID: <82434297.276248.1605645639435@privateemail.com> Hi Jonathan, yes you are correct! but we plan to resync this once or twice every week for the next 3-4months to be sure everything is as it should be. Right now we are focused on getting them synced up and then we will run scheduled resyncs/checks once or twice a week depending on the data growth :) Thanks Andi Christiansen > On 11/17/2020 2:53 PM Jonathan Buzzard wrote: > > > On 17/11/2020 11:51, Andi Christiansen wrote: > > Hi all, > > > > thanks for all the information, there was some interesting things > > amount it.. > > > > I kept on going with rsync and ended up making a file with all top > > level user directories and splitting them into chunks of 347 per > > rsync session(total 42000 ish folders). yesterday we had only 14 > > sessions with 3000 folders in each and that was too much work for one > > rsync session.. > > Unless you use something similar to my DB suggestion it is almost > inevitable that some of those rsync sessions are going to have issues > and you will have no way to track it or even know it has happened unless > you do a single final giant catchup/check rsync. > > I should add that a copy of the sqlite DB is cover your backside > protection when a user pops up claiming that you failed to transfer one > of their vitally important files six months down the line and the old > system is turned off and scrapped. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From chris.schlipalius at pawsey.org.au Tue Nov 17 23:17:18 2020 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Wed, 18 Nov 2020 07:17:18 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Message-ID: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> So at my last job we used to rsync data between isilons across campus, and isilon to Windows File Cluster (and back). I recommend using dry run to generate a list of files and then use this to run with rysnc. This allows you also to be able to break up the transfer into batches, and check if files have changed before sync (say if your isilon files are not RO. Also ensure you have a recent version of rsync that preserves extended attributes and check your ACLS. A dry run example: https://unix.stackexchange.com/a/261372 I always felt more comfortable having a list of files before a sync?. Regards, Chris Schlipalius Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey Supercomputing Centre (CSIRO) 1 Bryce Avenue Kensington WA 6151 Australia Tel +61 8 6436 8815 Email chris.schlipalius at pawsey.org.au Web www.pawsey.org.au -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Nov 18 11:48:52 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 18 Nov 2020 11:48:52 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> Message-ID: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> On 17/11/2020 23:17, Chris Schlipalius wrote: > So at my last job we used to rsync data between isilons across campus, > and isilon to Windows File Cluster (and back). > > I recommend using dry run to generate a list of files and then use this > to run with rysnc. > > This allows you also to be able to break up the transfer into batches, > and check if files have changed before sync (say if your isilon files > are not RO. > > Also ensure you have a recent version of rsync that preserves extended > attributes and check your ACLS. > > A dry run example: > > https://unix.stackexchange.com/a/261372 > > I always felt more comfortable having a list of files before a sync?. > I would counsel in the strongest possible terms against that approach. Basically you have to be assured that none of your file names have "wacky" characters in them, because handling "wacky" characters in file names is exceedingly difficult. I cannot stress how hard it is and the above example does not handle all "wacky" characters in file names. So what do I mean by "wacky" characters. Well remember a file name can have just about anything in it on Linux with the exception of '/', and users especially when using a GUI, and even more so if they are Mac users can and do use what I will call "wacky" characters in their file names. The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. Then there is the use of the wildcard characters, especially '?' but also '*'. Not too difficult to handle you might say. Right now deal with a file name with a newline character in it :-) Don't ask me how or why you even do that but let me assure you that I have seen them on more than one occasion. And now your dry run list is broken... Not only that if you have a few hundred million files to move a list just becomes unwieldy anyway. One thing I didn't mention is that I would run anything with in a screen (or tmux if that is your poison) and turn on logging. For those interested I am in the process of cleaning up the script a bit and will post it somewhere in due course. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From andi at christiansen.xxx Wed Nov 18 11:54:47 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Wed, 18 Nov 2020 12:54:47 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> Message-ID: <1947408989.293430.1605700487095@privateemail.com> Hi Jonathan, i would be very interested in seeing your scripts when they are posted. Let me know where to get them! Thanks a bunch! Andi Christiansen > On 11/18/2020 12:48 PM Jonathan Buzzard wrote: > > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > So at my last job we used to rsync data between isilons across campus, > > and isilon to Windows File Cluster (and back). > > > > I recommend using dry run to generate a list of files and then use this > > to run with rysnc. > > > > This allows you also to be able to break up the transfer into batches, > > and check if files have changed before sync (say if your isilon files > > are not RO. > > > > Also ensure you have a recent version of rsync that preserves extended > > attributes and check your ACLS. > > > > A dry run example: > > > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > I would counsel in the strongest possible terms against that approach. > > Basically you have to be assured that none of your file names have > "wacky" characters in them, because handling "wacky" characters in file > names is exceedingly difficult. I cannot stress how hard it is and the > above example does not handle all "wacky" characters in file names. > > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and > users especially when using a GUI, and even more so if they are Mac > users can and do use what I will call "wacky" characters in their file > names. > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. > > Not too difficult to handle you might say. Right now deal with a file > name with a newline character in it :-) Don't ask me how or why you even > do that but let me assure you that I have seen them on more than one > occasion. And now your dry run list is broken... > > Not only that if you have a few hundred million files to move a list > just becomes unwieldy anyway. > > One thing I didn't mention is that I would run anything with in a screen > (or tmux if that is your poison) and turn on logging. > > For those interested I am in the process of cleaning up the script a bit > and will post it somewhere in due course. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From cal.sawyer at framestore.com Wed Nov 18 12:18:57 2020 From: cal.sawyer at framestore.com (Cal Sawyer) Date: Wed, 18 Nov 2020 12:18:57 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 21 In-Reply-To: References: Message-ID: Hello Not a Scale user per se (we run a 3rdparty offshoot of Scale). In a past life managing Nexenta with OpenSolaris DR storage, I used nc/netcat for bulk data sync, which is far more efficient than rsync. With a bit of planning and analysis of directory structure on the target, nc runs could be parallelised as well, although not quite in the same way as running rsync via parallels. Of course, nc has to be available on Isilon but i have no experience with that platform. The only caveat in using nc is the amount of change to the target data as copying progresses (is the target datastore static or still seeing changes?). nc has to be followed with rsync to apply any changes and/or verify the integrity of the bulk copy. https://nakkaya.com/2009/04/15/using-netcat-for-file-transfers/ Are your Isilon and Scale systems located in the same network space? I'd also suggest that if possible, add a quad-port 10GbE (or larger: 25/100GbE) NIC to your servers to gain a wider data path and conduct your copy operations on those interfaces regards [image: Framestore] Cal Sawyer ? Senior Systems Engineer London ? New York ? Los Angeles ? Chicago ? Montr?al ? Mumbai 28 Chancery Lane London WC2A 1LB [T] +44 (0)20 7344 8000 W3W: warm.soil.patio On Wed, 18 Nov 2020 at 12:00, wrote: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Chris Schlipalius) > 2. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Jonathan Buzzard) > 3. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Andi Christiansen) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 18 Nov 2020 07:17:18 +0800 > From: Chris Schlipalius > To: > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <578BE691-DEE4-43AC-97D2-546AC406E14A at pawsey.org.au> > Content-Type: text/plain; charset="utf-8" > > So at my last job we used to rsync data between isilons across campus, and > isilon to Windows File Cluster (and back). > > I recommend using dry run to generate a list of files and then use this to > run with rysnc. > > This allows you also to be able to break up the transfer into batches, and > check if files have changed before sync (say if your isilon files are not > RO. > > Also ensure you have a recent version of rsync that preserves extended > attributes and check your ACLS. > > > > A dry run example: > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > > > > > Regards, > > Chris Schlipalius > > > > Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey > Supercomputing Centre (CSIRO) > > 1 Bryce Avenue > > Kensington WA 6151 > > Australia > > > > Tel +61 8 6436 8815 > > Email chris.schlipalius at pawsey.org.au > > Web www.pawsey.org.au > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20201118/c99c2fb1/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Wed, 18 Nov 2020 11:48:52 +0000 > From: Jonathan Buzzard > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <7983810e-f51c-8cf7-a750-5c3285870bd4 at strath.ac.uk> > Content-Type: text/plain; charset=utf-8; format=flowed > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > So at my last job we used to rsync data between isilons across campus, > > and isilon to Windows File Cluster (and back). > > > > I recommend using dry run to generate a list of files and then use this > > to run with rysnc. > > > > This allows you also to be able to break up the transfer into batches, > > and check if files have changed before sync (say if your isilon files > > are not RO. > > > > Also ensure you have a recent version of rsync that preserves extended > > attributes and check your ACLS. > > > > A dry run example: > > > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > I would counsel in the strongest possible terms against that approach. > > Basically you have to be assured that none of your file names have > "wacky" characters in them, because handling "wacky" characters in file > names is exceedingly difficult. I cannot stress how hard it is and the > above example does not handle all "wacky" characters in file names. > > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and > users especially when using a GUI, and even more so if they are Mac > users can and do use what I will call "wacky" characters in their file > names. > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. > > Not too difficult to handle you might say. Right now deal with a file > name with a newline character in it :-) Don't ask me how or why you even > do that but let me assure you that I have seen them on more than one > occasion. And now your dry run list is broken... > > Not only that if you have a few hundred million files to move a list > just becomes unwieldy anyway. > > One thing I didn't mention is that I would run anything with in a screen > (or tmux if that is your poison) and turn on logging. > > For those interested I am in the process of cleaning up the script a bit > and will post it somewhere in due course. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > ------------------------------ > > Message: 3 > Date: Wed, 18 Nov 2020 12:54:47 +0100 (CET) > From: Andi Christiansen > To: gpfsug main discussion list , > Jonathan Buzzard > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <1947408989.293430.1605700487095 at privateemail.com> > Content-Type: text/plain; charset=UTF-8 > > Hi Jonathan, > > i would be very interested in seeing your scripts when they are posted. > Let me know where to get them! > > Thanks a bunch! > Andi Christiansen > > > On 11/18/2020 12:48 PM Jonathan Buzzard > wrote: > > > > > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > > So at my last job we used to rsync data between isilons across campus, > > > and isilon to Windows File Cluster (and back). > > > > > > I recommend using dry run to generate a list of files and then use > this > > > to run with rysnc. > > > > > > This allows you also to be able to break up the transfer into batches, > > > and check if files have changed before sync (say if your isilon files > > > are not RO. > > > > > > Also ensure you have a recent version of rsync that preserves extended > > > attributes and check your ACLS. > > > > > > A dry run example: > > > > > > https://unix.stackexchange.com/a/261372 > > > > > > I always felt more comfortable having a list of files before a sync?. > > > > > > > I would counsel in the strongest possible terms against that approach. > > > > Basically you have to be assured that none of your file names have > > "wacky" characters in them, because handling "wacky" characters in file > > names is exceedingly difficult. I cannot stress how hard it is and the > > above example does not handle all "wacky" characters in file names. > > > > So what do I mean by "wacky" characters. Well remember a file name can > > have just about anything in it on Linux with the exception of '/', and > > users especially when using a GUI, and even more so if they are Mac > > users can and do use what I will call "wacky" characters in their file > > names. > > > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > > Then there is the use of the wildcard characters, especially '?' but > > also '*'. > > > > Not too difficult to handle you might say. Right now deal with a file > > name with a newline character in it :-) Don't ask me how or why you even > > do that but let me assure you that I have seen them on more than one > > occasion. And now your dry run list is broken... > > > > Not only that if you have a few hundred million files to move a list > > just becomes unwieldy anyway. > > > > One thing I didn't mention is that I would run anything with in a screen > > (or tmux if that is your poison) and turn on logging. > > > > For those interested I am in the process of cleaning up the script a bit > > and will post it somewhere in due course. > > > > > > JAB. > > > > -- > > Jonathan A. Buzzard Tel: +44141-5483420 > > HPC System Administrator, ARCHIE-WeSt. > > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 106, Issue 21 > *********************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Wed Nov 18 23:05:40 2020 From: valdis.kletnieks at vt.edu (Valdis Kl=?utf-8?Q?=c4=93?=tnieks) Date: Wed, 18 Nov 2020 18:05:40 -0500 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> Message-ID: <39863.1605740740@turing-police> On Wed, 18 Nov 2020 11:48:52 +0000, Jonathan Buzzard said: > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and You want to see some fireworks? At least at one time, it was possible to use a file system debugger that's all too trusting of hexadecimal input and create a directory entry of '../'. Let's just say that fs/namei.c was also far too trusting, and fsck was more than happy to make *different* errors than the kernel was.... > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. Don't forget ESC, CR, LF, backticks, forward ticks, semicolons, and pretty much anything else that will give a shell indigestion. SQL isn't the only thing prone to injection attacks.. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 832 bytes Desc: not available URL: From chris.schlipalius at pawsey.org.au Wed Nov 18 23:57:26 2020 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Thu, 19 Nov 2020 07:57:26 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: Message-ID: <6288DF78-A9DF-4BE9-B166-4478EF8C2A20@pawsey.org.au> ? I would counsel in the strongest possible terms against that approach. ? Basically you have to be assured that none of your file names have "wacky" characters in them, because handling "wacky" characters in file ? names is exceedingly difficult. I cannot stress how hard it is and the above example does not handle all "wacky" characters in file names. Well that?s indeed another kettle of fish if you have irregular/special naming of files, no I didn?t cover that and if you have millions of files, yes a list would be unwieldy, then I would be tarring up dirs. before moving? and then untarring on GPFS ?or breaking up the list into sets or sub lists. If you have these wacky types of file names well there are fixes as in the rsync manpages? yes not easy but possible.. Ie 1. -s, --protect-args 2. As per usual you can escape the spaces, or substitute for spaces. rsync -avuz user at server1.com:"${remote_path// /\\ }" . 3. Single quote the file name and path inside double quotes. ? One thing I didn't mention is that I would run anything with in a screen (or tmux if that is your poison) and turn on logging. Absolutely agree? ? For those interested I am in the process of cleaning up the script a bit and will post it somewhere in due course. ? JAB. Would be interesting to see?. I?ve also had success on GPFS with DCP and possibly this would be another option Regards, Chris Schlipalius Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey Supercomputing Centre (CSIRO) 1 Bryce Avenue Kensington WA 6151 Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 15:34:39 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 15:34:39 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Message-ID: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Nov 19 15:49:30 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 15:49:30 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Message-ID: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From spectrumscale at kiranghag.com Thu Nov 19 16:40:47 2020 From: spectrumscale at kiranghag.com (KG) Date: Thu, 19 Nov 2020 22:10:47 +0530 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: You can also set mount priority on filesystems so that gpfs can try to mount them in order...parent first On Thu, Nov 19, 2020, 21:19 Jonathan Buzzard wrote: > On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > > Hi, > > > > > > I have a filesystem holding many projects (i.e., mounted under > > /projects), each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects > > of it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > Technically, mounting a filesystem on top of an existing filesystem > > should be possible, but, is this discouraged for any reason? How GPFS > > would behave with that and is there a technical reason for avoiding this > > setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > This has all the hallmarks of either a Windows admin or a newbie > Linux/Unix admin :-) > > Simply put /projects is mounted on top of whatever file system is > providing the root file system in the first place LOL. > > Linux/Unix and/or GPFS does not give a monkeys about mounting another > file system *ANYWHERE* in it period because there is no other way of > doing it. > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Nov 19 16:42:07 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 19 Nov 2020 16:42:07 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Message-ID: <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> If it is a remote cluster mount from your clients (hopefully!), you might want to look at priority to order mounting of the file-systems. I don?t know what would happen if the overmounted file-system went away, you would likely want to test. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 15:39 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 16:48:07 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 16:48:07 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch>, <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: Hi Jonathan, thanks for sharing your opinions. In the sentence "Technically, mounting a filesystem on top of an existing filesystem should be possible" , I guess I was referring to that... I was concerned about other technical reasons, such like how would this would affect GPFS policies, or how to properly proceed with proper mounting, or any other technical reasons to consider. For the GPFS policies, I usually applied some of the existing GPFS policies based on directories, but after checking I realized that one can manage via device (never used policies in that way, at least for the simple but necessary use cases I have on the existing filesystems). Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Jonathan Buzzard Sent: Thursday, November 19, 2020 4:49:30 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 17:01:37 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 17:01:37 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch>, <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> Message-ID: <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Hi Simon, that's a very good point, thanks a lot :) I have it remotely mounted on a client cluster, so I will consider priorities when mounting the filesystems with remote cluster mount. That's very useful. Also, as far as I saw, same approach can be also applied to local mounts (via mmchfs) during daemon startup with the same option --mount-priority. Thanks a lot for the hints, these are very useful. I'll test that. Cheers, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson Sent: Thursday, November 19, 2020 5:42:07 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem If it is a remote cluster mount from your clients (hopefully!), you might want to look at priority to order mounting of the file-systems. I don?t know what would happen if the overmounted file-system went away, you would likely want to test. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 15:39 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Thu Nov 19 17:34:05 2020 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 19 Nov 2020 18:34:05 +0100 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: I would not mount a GPFS filesystem within a GPFS filesystem. Technically it should work, but I?d expect it to cause surprises if ever the lower filesystem experienced problems. Alone, a filesystem might recover automatically by remounting. But if there?s another filesystem mounted within, I expect it will be a problem.. Much better to use symlinks. -jf tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < marc.caubet at psi.ch>: > Hi Simon, > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > client cluster, so I will consider priorities when mounting the filesystems > with remote cluster mount. That's very useful. > > Also, as far as I saw, same approach can be also applied to local mounts > (via mmchfs) during daemon startup with the same option --mount-priority. > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > Cheers, > > Marc > _________________________________________________________ > Paul Scherrer Institut > High Performance Computing & Emerging Technologies > Marc Caubet Serrabou > Building/Room: OHSA/014 > Forschungsstrasse, 111 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > ------------------------------ > *From:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > S.J.Thompson at bham.ac.uk> > *Sent:* Thursday, November 19, 2020 5:42:07 PM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > filesystem > > > If it is a remote cluster mount from your clients (hopefully!), you might > want to look at priority to order mounting of the file-systems. I don?t > know what would happen if the overmounted file-system went away, you would > likely want to test. > > > > Simon > > > > *From: * on behalf of " > marc.caubet at psi.ch" > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Thursday, 19 November 2020 at 15:39 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > filesystem > > > > Hi, > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects of > it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > > Technically, mounting a filesystem on top of an existing filesystem should > be possible, but, is this discouraged for any reason? How GPFS would behave > with that and is there a technical reason for avoiding this setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. > > > > Thanks a lot, > > Marc > > _________________________________________________________ > Paul Scherrer Institut > High Performance Computing & Emerging Technologies > Marc Caubet Serrabou > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Thu Nov 19 17:38:07 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Thu, 19 Nov 2020 09:38:07 -0800 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: <20201119173807.kormirvbweqs3un6@thargelion> Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:08:13 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:08:13 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: On 19/11/2020 17:34, Jan-Frode Myklebust wrote: > > I would not mount a GPFS filesystem within a GPFS filesystem. > Technically it should work, but I?d expect it to cause surprises if ever > the lower filesystem experienced problems. Alone, a filesystem might > recover automatically by remounting. But if there?s another filesystem > mounted within, I expect it will be a problem.. > > Much better to use symlinks. > Think about that for a minute... I guess if you are worried about /projects going away (which would suggest something really bad has happened anyway) would be to mount the GPFS file system that is currently holding /projects somewhere else and then bind mount everything into /projects At this point I would note that bind mounts are much better than symlinks which suck for this sort of application. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:12:03 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:12:03 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: <2f789d09-3704-2d41-ef2a-953de178dce2@strath.ac.uk> On 19/11/2020 16:40, KG wrote: > You can also set mount priority on filesystems so that gpfs can try to > mount them in order...parent first > One of the things that systemd brings to the table https://github.com/systemd/systemd/commit/3519d230c8bafe834b2dac26ace49fcfba139823 JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From marc.caubet at psi.ch Thu Nov 19 18:13:08 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 18:13:08 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <20201119173807.kormirvbweqs3un6@thargelion> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> , <20201119173807.kormirvbweqs3un6@thargelion> Message-ID: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Hi all, thanks a lot for your comments. Agreed, I better avoid it for now. I was concerned about how GPFS would behave in such case. For production I will take the safe route, but, just out of curiosity, I'll give it a try on a couple of test filesystems. Thanks a lot for your help, it was very helpful, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Skylar Thompson Sent: Thursday, November 19, 2020 6:38:07 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:32:39 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:32:39 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> <20201119173807.kormirvbweqs3un6@thargelion> <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Message-ID: <5b8edf06-a4ab-a39e-5a02-86fd7565b90a@strath.ac.uk> On 19/11/2020 18:13, Caubet Serrabou Marc (PSI) wrote: > > Hi all, > > > thanks a lot for your comments. Agreed, I?better avoid it for now. I was > concerned about how GPFS would behave in such case. For production I > will take the safe route, but, just out of curiosity, I'll give it a try > on a couple of test filesystems. > Don't use symlinks there is a range of applications that will break and you will confuse the hell out of your users as the fact you are not under /projects/new but /random/new is not hidden. Besides which if the symlink goes away because /projects goes away then it is all a bust anyway. If you are worried about /projects going away then the best plan is to mount the GPFS file systems somewhere else and then bind mount the directories into /projects on all the machines where they are mounted. GPFS is quite happy with this. We bind mount /gpfs/users into /users and /gpfs/software into /opt/software by default. In the past I have bind mounted random paths for every user (hundred plus) into /home JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From novosirj at rutgers.edu Thu Nov 19 18:34:09 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 19 Nov 2020 18:34:09 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: > On Nov 19, 2020, at 10:49 AM, Jonathan Buzzard wrote: > > On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: >> Hi, >> I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. >> I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. >> Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? >> Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. > > This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) > > Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. > > Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. Some others have said, but I disagree. It wasn?t that long ago that GPFS acted really screwy with systemd because it did something in a way other than Linux expected. As it is now, their devices are not /dev/whatever or server:/wherever like just about every other filesystem type. Not unreasonable to believe it would ?act funny? compared to other FS. I like GPFS a lot, but this is not one of my favorite characteristics of it. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From UWEFALKE at de.ibm.com Thu Nov 19 19:18:41 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 19 Nov 2020 20:18:41 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Mounting_filesystem_on_top_of_an_exist?= =?utf-8?q?ing=09filesystem?= In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch><0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: Just the risk your parent system dies which will block your access to the child file system mounted on a mount point within. If that is not bothering , go ahead mount stacks . As for the symling though : it is also gone if the parent dies :-). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: KG To: gpfsug main discussion list Date: 19/11/2020 17:41 Subject: [EXTERNAL] Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Sent by: gpfsug-discuss-bounces at spectrumscale.org You can also set mount priority on filesystems so that gpfs can try to mount them in order...parent first On Thu, Nov 19, 2020, 21:19 Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Thu Nov 19 19:37:52 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 19 Nov 2020 19:37:52 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> <20201119173807.kormirvbweqs3un6@thargelion> <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Message-ID: <738D41AC-6A07-453E-A2D1-C1882BE52EDC@bham.ac.uk> My understanding was that this was perfectly acceptable in a GPFS system. i.e. mounting parts of file-systems in others. It has been suggested to us as a way of using different vendor GPFS systems (e.g. an ESS with someone elses) as a way of working round the licensing rules about ESS and anything else, but still giving a single user ?name space?. We didn?t go that route, and of course I might have misunderstood what was being suggested. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 18:13 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi all, thanks a lot for your comments. Agreed, I better avoid it for now. I was concerned about how GPFS would behave in such case. For production I will take the safe route, but, just out of curiosity, I'll give it a try on a couple of test filesystems. Thanks a lot for your help, it was very helpful, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Skylar Thompson Sent: Thursday, November 19, 2020 6:38:07 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 20 19:13:41 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 20 Nov 2020 19:13:41 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Here is the output of waiters on 2 hosts that were having the issue today: HOST 1 [2020-11-20 09:07:53 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === Waiting 0.0035 sec since 09:08:07, monitored, thread 135497 FileBlockReadFetchHandlerThread: on ThCond 0x7F615C152468 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.180 Waiting 0.0036 sec since 09:08:07, monitored, thread 139228 PrefetchWorkerThread: on ThCond 0x7F627000D5D8 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.181 [2020-11-20 09:08:07 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === HOST 2 [2020-11-20 09:08:49 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === Waiting 0.0034 sec since 09:08:50, monitored, thread 345318 SharedHashTabFetchHandlerThread: on ThCond 0x7F049C001F08 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.133 [2020-11-20 09:08:50 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === [2020-11-20 09:08:52 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === You can see the waiters go from 0 to 1-2 , but they are hardly blocking. Yes there are separate pools for metadata for all of the filesystems here. I did another trace today when the problem was happening - this time I was able to get a longer trace using the following command: /usr/lpp/mmfs/bin/mmtracectl --start --trace=io --trace-file-size=512M --tracedev-write-mode=blocking --tracedev-buffer-size=64M -N nyzls149m This is what the trsum output looks like: Elapsed trace time: 62.412092000 seconds Elapsed trace time from first VFS call to last: 62.412091999 Time idle between VFS calls: 0.002913000 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 0.003487000 9 387.444 rdwr 0.273721000 183 1495.743 read_inode2 0.007304000 325 22.474 follow_link 0.013952000 58 240.552 pagein 0.025974000 66 393.545 getattr 0.002792000 26 107.385 revalidate 0.009406000 2172 4.331 create 66.194479000 3 22064826.333 open 1.725505000 88 19608.011 unlink 18.685099000 1 18685099.000 setattr 0.011627000 14 830.500 lookup 2379.215514000 502 4739473.135 delete_inode 0.015553000 328 47.418 rename 98.099073000 5 19619814.600 release 0.050574000 89 568.247 permission 0.007454000 73 102.110 getxattr 0.002346000 32 73.312 statfs 0.000081000 6 13.500 mmap 0.049809000 18 2767.167 removexattr 0.000827000 14 59.071 llseek 0.000441000 47 9.383 readdir 0.002667000 34 78.441 Ops 4093 Secs 62.409178999 Ops/Sec 65.583 MaxFilesToCache is set to 12000 : [common] maxFilesToCache 12000 I only see gpfs_i_lookup in the tracefile, no gpfs_v_lookups # grep gpfs_i_lookup trcrpt.2020-11-20_09.20.38.283986.nyzls149m |wc -l 1097 They mostly look like this - 62.346560 238895 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922178971A40 iNum 21980113 (0x14F63D1) cnP 0xFFFF922178971C88 retP 0x0 code 0 rc 0 62.346955 238895 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF91A8A4991E00 dentryP 0xFFFF92C545A93500 name '20170323.txt' d_flags 0x80 d_count 1 unhashed 1 62.367701 218442 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922071300000 iNum 29629892 (0x1C41DC4) cnP 0xFFFF922071300248 retP 0x0 code 0 rc 0 62.367734 218444 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF9193CF457800 dentryP 0xFFFF9229527A89C0 name 'node.py' d_flags 0x80 d_count 1 unhashed 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Monday, November 16, 2020 8:46 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, while the other nodes can well block the local one, as Frederick suggests, there should at least be something visible locally waiting for these other nodes. Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms. I am no developer nor do I have the code, so don't expect too much. Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)? Lookups are metadata ops, do you have a separate pool for your metadata? How is that pool set up (doen to the physical block devices)? Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...). So, which waiters did you see (hope you saved them, if not, do it next time). What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)? To check the unreasonably short trace capture time: how large are the trcrpt files you obtain? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 14:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hooft at natlab.research.philips.com Sat Nov 21 00:37:01 2020 From: hooft at natlab.research.philips.com (Peter van Hooft) Date: Sat, 21 Nov 2020 01:37:01 +0100 Subject: [gpfsug-discuss] mmchdisk /dev/fs start -a progress Message-ID: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> Hello, Is it possible to find out the progress of the 'mmchdisk /dev/fs start -a' command when the controlling terminal had been lost? We can see the task running on the fs manager node with 'mmdiag --commands' with attributes 'hold PIT/disk waitTime 0' We are starting to worry the mmchdisk is taking too long, and see continuously waiters like Waiting 3.1946 sec since 01:28:23, ignored, thread 22092 TSCHDISKCmdThread: on ThCond 0x180267573D0 (SGManagementMgrDataCondvar), reason 'waiting for stripe group to recover' Thanks for any hints. Peter van Hooft Philips Research From jonathan.buzzard at strath.ac.uk Sat Nov 21 10:13:42 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sat, 21 Nov 2020 10:13:42 +0000 Subject: [gpfsug-discuss] mmchdisk /dev/fs start -a progress In-Reply-To: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> References: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> Message-ID: On 21/11/2020 00:37, Peter van Hooft wrote: > > Hello, > > Is it possible to find out the progress of the 'mmchdisk /dev/fs start -a' > command when the controlling terminal had been lost? > I don't think so. You are lucky it is still running > We can see the task running on the fs manager node with 'mmdiag --commands' with > attributes 'hold PIT/disk waitTime 0' > We are starting to worry the mmchdisk is taking too long, and see continuously waiters like > Waiting 3.1946 sec since 01:28:23, ignored, thread 22092 TSCHDISKCmdThread: on ThCond 0x180267573D0 (SGManagementMgrDataCondvar), reason 'waiting for stripe group to recover' > > Thanks for any hints. > Not that this is going to help this time, but it is why you should *ALWAYS* without exception run these sorts of commands within a screen/tmux session so when you loose the connection to the server you can just reconnect and pick it up again. This is introductory system administration 101. No critical or long running command should ever be dependant on a remote controlling terminal. If you can't run them locally then run them in a screen or tmux session. There are plenty of good howto's for both screen and tmux on the internet. Depending on which distribution you use I would note that RedHat have very annoyingly and for completely specious reasons removed screen from RHEL8 and left tmux. So if you are starting from scratch tmux is the one to learn :-( JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From robert.horton at icr.ac.uk Mon Nov 23 15:06:05 2020 From: robert.horton at icr.ac.uk (Robert Horton) Date: Mon, 23 Nov 2020 15:06:05 +0000 Subject: [gpfsug-discuss] AFM experiences? Message-ID: Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. From novosirj at rutgers.edu Mon Nov 23 15:30:47 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 23 Nov 2020 15:30:47 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dean.flanders at fmi.ch Mon Nov 23 17:58:12 2020 From: dean.flanders at fmi.ch (Flanders, Dean) Date: Mon, 23 Nov 2020 17:58:12 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> References: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> Message-ID: Hello Rob, We looked at AFM years ago for DR, but after reading the bug reports, we avoided it, and also have had seen a case where it had to be removed from one customer, so we have kept things simple. Now looking again a few years later there are still issues, IBM Spectrum Scale Active File Management (AFM) issues which may result in undetected data corruption, and that was just my first google hit. We have kept it simple, and use a parallel rsync process with policy engine and can hit wire speed for copying of millions of small files in order to have isolation between the sites at GB/s. I am not saying it is bad, just that it needs an appropriate risk/reward ratio to implement as it increases overall complexity. Kind regards, Dean From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Ryan Novosielski Sent: Monday, November 23, 2020 4:31 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM experiences? We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton > wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Nov 23 21:54:39 2020 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 23 Nov 2020 21:54:39 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: Message-ID: Rob, Talk to Jake Carroll from the University of Queensland, he has done a number of presentations at Scale User Groups of UQ?s MeDiCI data fabric which is based on Spectrum Scale and does very aggressive use of AFM. Their use of AFM is not only on campus, but to remote Storage clusters between 30km and 1500km away from their Home cluster. They have also tested AFM between Australia, Japan, and USA Sent from my iPhone > On 24 Nov 2020, at 01:20, Robert Horton wrote: > > ?Hi all, > > We're thinking about deploying AFM and would be interested in hearing > from anyone who has used it in anger - particularly independent writer. > > Our scenario is we have a relatively large but slow (mainly because it > is stretched over two sites with a 10G link) cluster for long/medium- > term storage and a smaller but faster cluster for scratch storage in > our HPC system. What we're thinking of doing is using some/all of the > scratch capacity as an IW cache of some/all of the main cluster, the > idea to reduce the need for people to manually move data between the > two. > > It seems to generally work as expected in a small test environment, > although we have a few concerns: > > - Quota management on the home cluster - we need a way of ensuring > people don't write data to the cache which can't be accomodated on > home. Probably not insurmountable but needs a bit of thought... > > - It seems inodes on the cache only get freed when they are deleted on > the cache cluster - not if they get deleted from the home cluster or > when the blocks are evicted from the cache. Does this become an issue > in time? > > If anyone has done anything similar I'd be interested to hear how you > got on. It would be intresting to know if you created a cache fileset > for each home fileset or just one for the whole lot, as well as any > other pearls of wisdom you may have to offer. > > Thanks! > Rob > > -- > Robert Horton | Research Data Storage Lead > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > Twitter @ICR_London > Facebook: www.facebook.com/theinstituteofcancerresearch > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Mon Nov 23 23:14:08 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 23 Nov 2020 23:14:08 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: <2C7317A6-B9DF-450A-92A6-AE156396204A@rutgers.edu> Ours are about 50 and 100 km from the home cluster, but it?s over 100Gb fiber. > On Nov 23, 2020, at 4:54 PM, Andrew Beattie wrote: > > Rob, > > Talk to Jake Carroll from the University of Queensland, he has done a number of presentations at Scale User Groups of UQ?s MeDiCI data fabric which is based on Spectrum Scale and does very aggressive use of AFM. > > Their use of AFM is not only on campus, but to remote Storage clusters between 30km and 1500km away from their Home cluster. They have also tested AFM between Australia, Japan, and USA > > Sent from my iPhone > > > On 24 Nov 2020, at 01:20, Robert Horton wrote: > > > > ?Hi all, > > > > We're thinking about deploying AFM and would be interested in hearing > > from anyone who has used it in anger - particularly independent writer. > > > > Our scenario is we have a relatively large but slow (mainly because it > > is stretched over two sites with a 10G link) cluster for long/medium- > > term storage and a smaller but faster cluster for scratch storage in > > our HPC system. What we're thinking of doing is using some/all of the > > scratch capacity as an IW cache of some/all of the main cluster, the > > idea to reduce the need for people to manually move data between the > > two. > > > > It seems to generally work as expected in a small test environment, > > although we have a few concerns: > > > > - Quota management on the home cluster - we need a way of ensuring > > people don't write data to the cache which can't be accomodated on > > home. Probably not insurmountable but needs a bit of thought... > > > > - It seems inodes on the cache only get freed when they are deleted on > > the cache cluster - not if they get deleted from the home cluster or > > when the blocks are evicted from the cache. Does this become an issue > > in time? > > > > If anyone has done anything similar I'd be interested to hear how you > > got on. It would be intresting to know if you created a cache fileset > > for each home fileset or just one for the whole lot, as well as any > > other pearls of wisdom you may have to offer. > > > > Thanks! > > Rob > > > > -- > > Robert Horton | Research Data Storage Lead > > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > > Twitter @ICR_London > > Facebook: www.facebook.com/theinstituteofcancerresearch > > > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > > > This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From vpuvvada at in.ibm.com Tue Nov 24 02:32:01 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:02:01 +0530 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: >- Quota management on the home cluster - we need a way of ensuring >people don't write data to the cache which can't be accomodated on >home. Probably not insurmountable but needs a bit of thought... You could set same quotas between cache and home clusters. AFM does not support replication of filesystem metadata like quotas, fileset configuration etc... >- It seems inodes on the cache only get freed when they are deleted on >the cache cluster - not if they get deleted from the home cluster or >when the blocks are evicted from the cache. Does this become an issue >in time? AFM periodically revalidates with home cluster. If the files/dirs were already deleted at home cluster, AFM moves them to /.ptrash directory at cache cluster during the revalidation. These files can be removed manually by user or auto eviction process. If the .ptrash directory is not cleaned up on time, it might result into quota issues at cache cluster. ~Venkat (vpuvvada at in.ibm.com) From: Robert Horton To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2020 08:51 PM Subject: [EXTERNAL] [gpfsug-discuss] AFM experiences? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Tue Nov 24 02:37:18 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:07:18 +0530 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> Message-ID: Dean, This is one of the corner case which is associated with sparse files at the home cluster. You could try with latest versions of scale, AFM indepedent-writer mode have many performance/functional improvements in newer releases. ~Venkat (vpuvvada at in.ibm.com) From: "Flanders, Dean" To: gpfsug main discussion list Date: 11/23/2020 11:44 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] AFM experiences? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Rob, We looked at AFM years ago for DR, but after reading the bug reports, we avoided it, and also have had seen a case where it had to be removed from one customer, so we have kept things simple. Now looking again a few years later there are still issues, IBM Spectrum Scale Active File Management (AFM) issues which may result in undetected data corruption, and that was just my first google hit. We have kept it simple, and use a parallel rsync process with policy engine and can hit wire speed for copying of millions of small files in order to have isolation between the sites at GB/s. I am not saying it is bad, just that it needs an appropriate risk/reward ratio to implement as it increases overall complexity. Kind regards, Dean From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Ryan Novosielski Sent: Monday, November 23, 2020 4:31 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM experiences? We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Tue Nov 24 02:41:21 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:11:21 +0530 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: AFM provides near zero downtime for migration. As of today, AFM migration does not support ACLs or other EAs migration from non scale (GPFS) source. https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm ~Venkat (vpuvvada at in.ibm.com) From: "Frederick Stock" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 11/17/2020 03:14 AM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Have you considered using the AFM feature of Spectrum Scale? I doubt it will provide any speed improvement but it would allow for data to be accessed as it was being migrated. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: Andi Christiansen Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Date: Mon, Nov 16, 2020 2:44 PM Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at googlemail.com Tue Nov 24 12:16:55 2020 From: luke.raimbach at googlemail.com (Luke Raimbach) Date: Tue, 24 Nov 2020 12:16:55 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: Hi Rob, Some things to think about from experiences a year or so ago... If you intend to perform any HPC workload (writing / updating / deleting files) inside a cache, then appropriately specified gateway nodes will be your friend: 1. When creating, updating or deleting files in the cache, each operation requires acknowledgement from the gateway handling that particular cache, before returning ACK to the application. This will add a latency overhead to the workload - if your storage is IB connected to the compute cluster and using verbsRdmaSend for example, this will increase your happiness. Connecting low-spec gateway nodes over 10GbE with the expectation that they will "drain down" over time was a sore learning experience in the early days of AFM for me. 2. AFM queues can quickly eat up memory. I think around 350bytes of memory is consumed for each operation in the AFM queue, so if you have huge file churn inside a cache then the queue will grow very quickly. If you run out of memory, the node dies and you enter cache recovery when it comes back up (or another node takes over). This can end up cycling the node as it tries to revalidate a cache and keep up with any other queues. Get more memory! I've not used AFM for a while now and I think the latter enormity has some mitigation against create / delete cycles (i.e. the create operation is expunged from the queue instead of two operations being played back to the home). I expect IBM experts will tell you more about those improvements. Also, several smaller caches are better than one large one (parallel execution of queues helps utilise the available bandwidth and you have a better failover spread if you have multiple gateways, for example). Independent Writer mode comes with some small danger (user error or impatience mainly) inasmuch as whoever updates a file last will win; e.g. home user A writes a file, then cache user B updates the file after reading it and tells user A the update is complete, when really the gateway queue is long and the change is waiting to go back home. User A uses the file expecting the changes are made, then updates it with some results. Meanwhile the AFM queue drains down and user B's change arrives after user A has completed their changes. The interim version of the file user B modified will persist at home and user A's latest changes are lost. Some careful thought about workflow (or good user training about eventual consistency) will save some potential misery on this front. Hope this helps, Luke On Mon, 23 Nov 2020 at 15:19, Robert Horton wrote: > Hi all, > > We're thinking about deploying AFM and would be interested in hearing > from anyone who has used it in anger - particularly independent writer. > > Our scenario is we have a relatively large but slow (mainly because it > is stretched over two sites with a 10G link) cluster for long/medium- > term storage and a smaller but faster cluster for scratch storage in > our HPC system. What we're thinking of doing is using some/all of the > scratch capacity as an IW cache of some/all of the main cluster, the > idea to reduce the need for people to manually move data between the > two. > > It seems to generally work as expected in a small test environment, > although we have a few concerns: > > - Quota management on the home cluster - we need a way of ensuring > people don't write data to the cache which can't be accomodated on > home. Probably not insurmountable but needs a bit of thought... > > - It seems inodes on the cache only get freed when they are deleted on > the cache cluster - not if they get deleted from the home cluster or > when the blocks are evicted from the cache. Does this become an issue > in time? > > If anyone has done anything similar I'd be interested to hear how you > got on. It would be intresting to know if you created a cache fileset > for each home fileset or just one for the whole lot, as well as any > other pearls of wisdom you may have to offer. > > Thanks! > Rob > > -- > Robert Horton | Research Data Storage Lead > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > Twitter @ICR_London > Facebook: www.facebook.com/theinstituteofcancerresearch > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable > Company Limited by Guarantee, Registered in England under Company No. > 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addressee only. If > the message is received by anyone other than the addressee, please return > the message to the sender by replying to it and then delete the message > from your computer and network. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yeep at robust.my Tue Nov 24 14:09:34 2020 From: yeep at robust.my (T.A. Yeep) Date: Tue, 24 Nov 2020 22:09:34 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Venkat, If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? On Tue, Nov 24, 2020 at 10:41 AM Venkateswara R Puvvada wrote: > AFM provides near zero downtime for migration. As of today, AFM > migration does not support ACLs or other EAs migration from non scale > (GPFS) source. > > > https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm > > ~Venkat (vpuvvada at in.ibm.com) > > > > From: "Frederick Stock" > To: gpfsug-discuss at spectrumscale.org > Cc: gpfsug-discuss at spectrumscale.org > Date: 11/17/2020 03:14 AM > Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data > from Isilon to Scale over NFS? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Have you considered using the AFM feature of Spectrum Scale? I doubt it > will provide any speed improvement but it would allow for data to be > accessed as it was being migrated. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 > stockf at us.ibm.com > > > ----- Original message ----- > From: Andi Christiansen > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: "gpfsug-discuss at spectrumscale.org" > Cc: > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon > to Scale over NFS? > Date: Mon, Nov 16, 2020 2:44 PM > > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS from > two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching from > 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > scale nodes and 20Gbits from isilon so we should be able to reach just > under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Best regards *T.A. Yeep*Mobile: +6-016-719 8506 | Tel: +6-03-7628 0526 | www.robusthpc.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Nov 24 09:39:47 2020 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Tue, 24 Nov 2020 09:39:47 +0000 Subject: [gpfsug-discuss] SSUG::Digital with CIUK Message-ID: <> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: meeting.ics Type: text/calendar Size: 2623 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 3499622 bytes Desc: not available URL: From prasad.surampudi at theatsgroup.com Tue Nov 24 16:05:19 2020 From: prasad.surampudi at theatsgroup.com (Prasad Surampudi) Date: Tue, 24 Nov 2020 16:05:19 +0000 Subject: [gpfsug-discuss] mmhealth reports fserrinvalid errors on CNFS servers Message-ID: We are seeing fserrinvalid error on couple of filesystems in Spectrum Scale cluster. These errors are reported but mmhealth only couple of nodes (CNFS servers) in the cluster, but mmhealth on other nodes shows no issues. Any idea what this error means? And why its reported on CNFS servers and not on other nodes? What need to be done to fix this issue? sudo /usr/lpp/mmfs/bin/mmhealth node show FILESYSTEM -v Node name: cnfs05-gpfs Component Status Reasons ------------------------------------------------------------------- FILESYSTEM DEGRADED fserrinvalid(vol) argus HEALTHY - dytech HEALTHY - enlnt_E HEALTHY - enlnt_Es HEALTHY - haaforfs HEALTHY - haaforfs2 HEALTHY - historical HEALTHY - prcfs HEALTHY - qmtfs HEALTHY - research HEALTHY - research2 HEALTHY - schon_raw HEALTHY - uhdb_vol1 HEALTHY - vol DEGRADED fserrinvalid(vol) Event Parameter Severity Event Message ---------------------------------------------------------------------------------------------------------- fserrinvalid vol ERROR FS=vol,ErrNo=1124,Unknown error=0464000000010000000180A108BC000079B4000000000000003400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 -------------- next part -------------- An HTML attachment was scrubbed... URL: From NSCHULD at de.ibm.com Tue Nov 24 16:44:35 2020 From: NSCHULD at de.ibm.com (Norbert Schuld) Date: Tue, 24 Nov 2020 17:44:35 +0100 Subject: [gpfsug-discuss] =?utf-8?q?mmhealth_reports_fserrinvalid_errors_o?= =?utf-8?q?n_CNFS=09servers?= In-Reply-To: References: Message-ID: To get an explanation for any event one can ask the system: # mmhealth event show fserrinvalid Event Name: fserrinvalid Event ID: 999338 Description: Unrecognized FSSTRUCT error received. Check documentation Cause: A filesystem corruption detected User Action: Check error message for details and the mmfs.log.latest log for further details. See the topic Checking and repairing a file system in the IBM Spectrum Scale documentation: Administering. Managing file systems. If the file system is severely damaged, the best course of action is to follow the procedures in section: Additional information to collect for file system corruption or MMFS_FSSTRUCT errors Severity: ERROR State: DEGRADED The event is triggered by a callback which may not fire on all nodes, that is why only a subset of nodes have the information. Depending on the version of scale the procedure to remove the event varies: For newer release please use # mmhealth event resolve Missing arguments. Usage: mmhealth event resolve {EventName} [Identifier] For older releases it is described here: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_fsstruc.htm mmsysmonc event filesystem fsstruct_fixed Mit freundlichen Gr??en / Kind regards Norbert Schuld M925:IBM Spectrum Scale Software Development Phone: +49-160 70 70 335 IBM Deutschland Research & Development GmbH Email: nschuld at de.ibm.com Am Weiher 24 65451 Kelsterbach Knowing is not enough; we must apply. Willing is not enough; we must do. IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Prasad Surampudi To: "gpfsug-discuss at spectrumscale.org" Date: 24.11.2020 17:05 Subject: [EXTERNAL] [gpfsug-discuss] mmhealth reports fserrinvalid errors on CNFS servers Sent by: gpfsug-discuss-bounces at spectrumscale.org We are seeing fserrinvalid error on couple of filesystems in Spectrum Scale cluster. These errors are reported but mmhealth only couple of nodes (CNFS servers) in the cluster, but mmhealth on other nodes shows no issues. Any idea what this error means? And why its reported on CNFS servers and not on other nodes? What need to be done to fix this issue? sudo /usr/lpp/mmfs/bin/mmhealth node show FILESYSTEM -v Node name: cnfs05-gpfs Component Status Reasons ------------------------------------------------------------------- FILESYSTEM DEGRADED fserrinvalid(vol) argus HEALTHY - dytech HEALTHY - enlnt_E HEALTHY - enlnt_Es HEALTHY - haaforfs HEALTHY - haaforfs2 HEALTHY - historical HEALTHY - prcfs HEALTHY - qmtfs HEALTHY - research HEALTHY - research2 HEALTHY - schon_raw HEALTHY - uhdb_vol1 HEALTHY - vol DEGRADED fserrinvalid(vol) Event Parameter Severity Event Message ---------------------------------------------------------------------------------------------------------- fserrinvalid vol ERROR FS=vol,ErrNo=1124,Unknown error=0464000000010000000180A108BC000079B4000000000000003400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1D963707.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From jake.carroll at uq.edu.au Wed Nov 25 21:29:24 2020 From: jake.carroll at uq.edu.au (Jake Carroll) Date: Wed, 25 Nov 2020 21:29:24 +0000 Subject: [gpfsug-discuss] IB routers in ESS configuration + 3 different subnets - valid config? Message-ID: Hi. I am just in the process of sanity-checking a potential future configuration. Let's say I have an ESS 5000 and an ESS 3000 placed on the data centre floor to form the basis of a new scratch array. Let's then suppose that I have three existing supercomputers in that same location. Each of those supercomputers has a separate IB subnet and their networks are unrelated to each other, IB-wise. My understanding is that it is valid and possible to use MLNX EDR IB *routers* in order to be able to transport NSD communications back and forth across those separate subnets, back to the ESS (which lives on its own unique subnet). So at this point, I've got four unique subnets - one for the ESS, one for each super. As I understand it, there is an upper limit of *SIX* unique subnets on those EDR IB routers. As I understand it - for IPoIB transport, I'd also need some "gateway" boxes more or less - essentially some decent servers which I put EDR/HDR cards in as dog legs that act as an IPoIB gateway interface to each subnet. I appreciate that there is devil in the detail - but what I'm asking is if it is valid to "route" NSD with IB Routers (not switches) this way to separate subnets. Colleagues at IBM have all said "yeah....should work....we've not done it....but should be fine?" Colleagues at Mellanox (uhhh...nvidia...) say "Yes, this is valid and does exactly as the IB Router should and there is nothing unusual about this". If someone has experience doing this or could call out any oddity/weirdness/gotchas, I'd be very appreciative. I'm fairly sure this is all very low risk - but given nobody locally could tell me "Yeah, all certified and valid!" I'd like the wisdom of the wider crowd. Thank you. --jc -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Fri Nov 27 11:46:05 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Fri, 27 Nov 2020 17:16:05 +0530 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Yeep, >If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? rsync can be used to just fix metadata like ACLs and EAs. AFM does not revalidate the files with source system if rsync changes the ACLs on them. So ACLs can only be fixed after or during the cutover. ACL inheritance may be used by setting on ACLs on required parent dirs upfront if this option is sufficient, there was an user who migrated to scale using this method. ~Venkat (vpuvvada at in.ibm.com) From: "T.A. Yeep" To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 11/24/2020 07:40 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Venkat, If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? On Tue, Nov 24, 2020 at 10:41 AM Venkateswara R Puvvada < vpuvvada at in.ibm.com> wrote: AFM provides near zero downtime for migration. As of today, AFM migration does not support ACLs or other EAs migration from non scale (GPFS) source. https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm ~Venkat (vpuvvada at in.ibm.com) From: "Frederick Stock" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 11/17/2020 03:14 AM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Have you considered using the AFM feature of Spectrum Scale? I doubt it will provide any speed improvement but it would allow for data to be accessed as it was being migrated. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: Andi Christiansen Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Date: Mon, Nov 16, 2020 2:44 PM Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Best regards T.A. Yeep Mobile: +6-016-719 8506 | Tel: +6-03-7628 0526 | www.robusthpc.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Mon Nov 30 13:49:12 2020 From: carlz at us.ibm.com (Carl Zetie - carlz@us.ibm.com) Date: Mon, 30 Nov 2020 13:49:12 +0000 Subject: [gpfsug-discuss] Licensing costs for data lakes (SSUG follow-up) Message-ID: I am seeking some help on a topic I know many of you care deeply about: licensing costs I am trying to gather some more information about a request that has come up a couple of times, pricing for ?data lakes?. I would like to understand better what people are looking for here. - Is it as simple as ?much steeper discounts for very large deployments?? Or is a ?data lake? something specific, e.g. a large deployment that is not performance/latency sensitive; a storage pool that is [primarily] HDD; a tier that has specific read/write patterns such as moving entire large datasets in or out; or something else? Bear in mind that if we have special licensing for data lakes, we need a rigorous definition so that both you and we know whether your use of that licensing is compliant. Nobody likes ambiguity in licensing! - Are you expecting pricing to get very flat/discounting to get steep for large deployments? Or a different price tier/structure for ?data lakes? if we can rigorously define what one means? Do you agree or disagree with the proposition that if you keep adding storage hardware/capacity, that the software licensing cost should rise in proportion (even if that proportion is much smaller for a ?data lake? than for a performance tier)? - Feel free to be creative and imaginative. For example, would you be interested in a low-cost pricing model for storage that is an AFM Home and is _only_ accessed by using AFM to move data in and out of an AFM Cache (probably on the performance tier)? This would be conceptually similar to the way you can now (5.1) use AFM-Object to park data in a cheap object store. - Also feel free to answer questions I didn?t ask? If you prefer to discuss this in Slack rather than email, I started a discussion there a little while ago (please thread your comments!): https://ssug-poweraiug.slack.com/archives/CEVVCEE8M/p1605815075188800 Carl Zetie Program Director Offering Management Spectrum Scale ---- (919) 473 3318 ][ Research Triangle Park carlz at us.ibm.com [signature_1545794140] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 69558 bytes Desc: image001.png URL: From david_johnson at brown.edu Mon Nov 30 21:41:30 2020 From: david_johnson at brown.edu (David Johnson) Date: Mon, 30 Nov 2020 16:41:30 -0500 Subject: [gpfsug-discuss] internal details on GPFS inode expansion Message-ID: When GPFS needs to add inodes to the filesystem, it seems to pre-create about 4 million of them. Judging by the logs, it seems it only takes a few (13 maybe) seconds to do this. However we are suspecting that this might only be to request the additional inodes and that there is some background activity for some time afterwards. Would someone who has knowledge of the actual internals be willing to confirm or deny this, and if there is background activity, is it on all nodes in the cluster, NSD nodes, "default worker nodes"? Thanks, -- ddj Dave Johnson ddj at brown.edu From madhu.punjabi at in.ibm.com Mon Nov 2 08:17:23 2020 From: madhu.punjabi at in.ibm.com (Madhu P Punjabi) Date: Mon, 2 Nov 2020 08:17:23 +0000 Subject: [gpfsug-discuss] [NFS-Ganesha-Support] 'ganesha_mgr display_export - client not listed In-Reply-To: <660DD807-C723-44EF-BC51-57EFB296FFC4@id.ethz.ch> References: <660DD807-C723-44EF-BC51-57EFB296FFC4@id.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From christian.vieser at 1und1.de Mon Nov 2 13:44:50 2020 From: christian.vieser at 1und1.de (Christian Vieser) Date: Mon, 2 Nov 2020 14:44:50 +0100 Subject: [gpfsug-discuss] Alternative to Scale S3 API. In-Reply-To: <1109480230.484366.1603799162955@privateemail.com> References: <1109480230.484366.1603799162955@privateemail.com> Message-ID: <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> Hi Andi, we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. Regards, Christian Am 27.10.20 um 12:46 schrieb Andi Christiansen: > Hi all, > > We have over a longer period used the S3 API within spectrum Scale.. > And that has shown that it does not support very many applications > because of limitations of the API.. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtick at us.ibm.com Tue Nov 3 00:21:43 2020 From: jmtick at us.ibm.com (Jacob M Tick) Date: Tue, 3 Nov 2020 00:21:43 +0000 Subject: [gpfsug-discuss] Use cases for file audit logging and clustered watch folder Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 3 17:00:54 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 3 Nov 2020 17:00:54 +0000 Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale Message-ID: Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t get sent! Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results. >>> Join Here <<< This episode will start 15 minutes later as usual. * San Francisco, USA at 08:15 PST * New York, USA at 11:15 EST * London, United Kingdom at 16:15 GMT * Frankfurt, Germany at 17:15 CET * Pune, India at 21:45 IST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2488 bytes Desc: not available URL: From andi at christiansen.xxx Wed Nov 4 07:14:41 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET) Subject: [gpfsug-discuss] Alternative to Scale S3 API. In-Reply-To: <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> References: <1109480230.484366.1603799162955@privateemail.com> <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> Message-ID: <1512108314.679947.1604474081488@privateemail.com> Hi Christian, Thanks for the information! My question also triggered IBM to tell me the same so i think we will stay on S3 with Scale and hoping the same with the new release.. Yes, MinIO is really lacking some good documentation.. but definatly a cool software package that i will keep an eye on in the future... Best Regards Andi Christiansen > On 11/02/2020 2:44 PM Christian Vieser wrote: > > > > Hi Andi, > > we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. > > We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. > > Regards, > > Christian > > Am 27.10.20 um 12:46 schrieb Andi Christiansen: > > > > Hi all, > > > > We have over a longer period used the S3 API within spectrum Scale.. And that has shown that it does not support very many applications because of limitations of the API.. > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joe at excelero.com Wed Nov 4 12:19:07 2020 From: joe at excelero.com (joe at excelero.com) Date: Wed, 4 Nov 2020 06:19:07 -0600 Subject: [gpfsug-discuss] Accepted: gpfsug-discuss Digest, Vol 106, Issue 3 Message-ID: <924bb673-0b2a-420a-8ce2-be24c5e6e4e8@Spark> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: reply.ics Type: application/ics Size: 0 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Wed Nov 4 16:05:50 2020 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Wed, 4 Nov 2020 16:05:50 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 3 In-Reply-To: References: Message-ID: Could someone share the password for the event today? Thanks! Thanks, Oluwasijibomi (Siji) Saula HPC Systems Administrator / Information Technology Research 2 Building 220B / Fargo ND 58108-6050 p: 701.231.7749 / www.ndsu.edu [cid:image001.gif at 01D57DE0.91C300C0] ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Wednesday, November 4, 2020 6:00 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 106, Issue 3 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale (Simon Thompson) 2. Re: Alternative to Scale S3 API. (Andi Christiansen) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 Nov 2020 17:00:54 +0000 From: Simon Thompson To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale Message-ID: Content-Type: text/plain; charset="utf-8" Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t get sent! Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results. >>> Join Here <<< This episode will start 15 minutes later as usual. * San Francisco, USA at 08:15 PST * New York, USA at 11:15 EST * London, United Kingdom at 16:15 GMT * Frankfurt, Germany at 17:15 CET * Pune, India at 21:45 IST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2488 bytes Desc: not available URL: ------------------------------ Message: 2 Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET) From: Andi Christiansen To: gpfsug main discussion list , Christian Vieser Subject: Re: [gpfsug-discuss] Alternative to Scale S3 API. Message-ID: <1512108314.679947.1604474081488 at privateemail.com> Content-Type: text/plain; charset="utf-8" Hi Christian, Thanks for the information! My question also triggered IBM to tell me the same so i think we will stay on S3 with Scale and hoping the same with the new release.. Yes, MinIO is really lacking some good documentation.. but definatly a cool software package that i will keep an eye on in the future... Best Regards Andi Christiansen > On 11/02/2020 2:44 PM Christian Vieser wrote: > > > > Hi Andi, > > we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. > > We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. > > Regards, > > Christian > > Am 27.10.20 um 12:46 schrieb Andi Christiansen: > > > > Hi all, > > > > We have over a longer period used the S3 API within spectrum Scale.. And that has shown that it does not support very many applications because of limitations of the API.. > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 106, Issue 3 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From herrmann at sprintmail.com Sat Nov 7 21:10:36 2020 From: herrmann at sprintmail.com (Ron H) Date: Sat, 7 Nov 2020 16:10:36 -0500 Subject: [gpfsug-discuss] Use cases for file audit logging and clusteredwatch folder In-Reply-To: References: Message-ID: <8F771847BDEB4447919D30A16FE48FAB@rone8PC> Hi Jacob, Can you point me to a good overview of each of these features? I know File Audit and Watch is part of the DME Scale edition license, but I can?t seem to find a good explanation of what these features can offer. Thanks Ron From: Jacob M Tick Sent: Monday, November 02, 2020 7:21 PM To: gpfsug-discuss at spectrumscale.org Cc: April Brown Subject: [gpfsug-discuss] Use cases for file audit logging and clusteredwatch folder Hi All, I am reaching out on behalf of the Spectrum Scale development team to get some insight on how our customers are using the file audit logging and the clustered watch folder features. If you have it enabled in your test or production environment, could you please elaborate on how and why you are using the feature? Also, knowing how you have the function configured (ie: watching or auditing for certain events, only enabling on certain filesets, ect..) would help us out. Please respond back to April, John (both on CC), and I with any info you are willing to provide. Thanks in advance! Regards, Jake Tick Manager Spectrum Scale - Scalable Data Interfaces IBM Systems Group Email:jmtick at us.ibm.com IBM -------------------------------------------------------------------------------- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtick at us.ibm.com Mon Nov 9 17:31:00 2020 From: jmtick at us.ibm.com (Jacob M Tick) Date: Mon, 9 Nov 2020 17:31:00 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Use_cases_for_file_audit_logging_and?= =?utf-8?q?=09clusteredwatch_folder?= In-Reply-To: <8F771847BDEB4447919D30A16FE48FAB@rone8PC> References: <8F771847BDEB4447919D30A16FE48FAB@rone8PC>, Message-ID: An HTML attachment was scrubbed... URL: From Kamil.Czauz at Squarepoint-Capital.com Wed Nov 11 22:29:31 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Wed, 11 Nov 2020 22:29:31 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Message-ID: We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Nov 12 01:56:46 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 12 Nov 2020 02:56:46 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage_of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From luis.bolinches at fi.ibm.com Thu Nov 12 13:19:05 2020 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 12 Nov 2020 13:19:05 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage_of=09mmfsd_process?= In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From jyyum at kr.ibm.com Thu Nov 12 14:10:17 2020 From: jyyum at kr.ibm.com (Jae Yoon Yum) Date: Thu, 12 Nov 2020 14:10:17 +0000 Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163322.png Type: image/png Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163323.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: From Eric.Wendel at ibm.com Thu Nov 12 15:43:46 2020 From: Eric.Wendel at ibm.com (Eric Wendel - Eric.Wendel@ibm.com) Date: Thu, 12 Nov 2020 15:43:46 +0000 Subject: [gpfsug-discuss] Problems reading emails to the mailing list Message-ID: <31233620a4324240885aed7ad18a729a@ibm.com> Hi Folks, As you are no doubt aware, Lotus Notes and its ecosystem is virtually extinct. For those of us who have moved on to more modern email clients (including an increasing number of IBMERs like me), the email links we receive from SSUG (for example) 'OF0433B7F4.580A7B75-ON0025861E.004DD432-0025861E.004DD8A4 at notes.na.collabserv.com are useless because they can only be read if you have the Notes client installed. This is especially problematic for Linux users as the Linux client for Notes is discontinued. It would be very helpful if the SSUG could move to a modern email platform. Thanks, Eric Wendel eric.wendel at ibm.com -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of gpfsug-discuss-request at spectrumscale.org Sent: Thursday, November 12, 2020 8:10 AM To: gpfsug-discuss at spectrumscale.org Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 106, Issue 8 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Poor client performance with high cpu usage of mmfsd process (Luis Bolinches) 2. Question about the Clearing Spectrum Scale GUI event (Jae Yoon Yum) ---------------------------------------------------------------------- Message: 1 Date: Thu, 12 Nov 2020 13:19:05 +0000 From: "Luis Bolinches" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 12 Nov 2020 14:10:17 +0000 From: "Jae Yoon Yum" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163322.png Type: image/png Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163323.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 106, Issue 8 ********************************************** From stefan.roth at de.ibm.com Thu Nov 12 17:13:38 2020 From: stefan.roth at de.ibm.com (Stefan Roth) Date: Thu, 12 Nov 2020 18:13:38 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Question_about_the_Clearing_Spectrum_S?= =?utf-8?q?cale_GUI=09event?= In-Reply-To: References: Message-ID: Hello Jay, as long as those errors are still shown by "mmhealth node show" CLI command, they will again appear in the GUI. In the GUI events table you can show an "Event Type" column which is hidden by default. Events that have event type "Notice" can be cleared by the "Mark as Read" action. Events that have event type "State" can not be cleared by the "Mark as Read" action. They have to disappear by solving the problem. If a problem is solved the error should disappear from "mmhealth node show" and after that it will disappear from the GUI as well. Mit freundlichen Gr??en / Kind regards Stefan Roth Spectrum Scale Developement Phone: +49 162 4159934 IBM Deutschland Research & Development GmbH Email: stefan.roth at de.ibm.com Am Weiher 24 65451 Kelsterbach IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Jae Yoon Yum" To: gpfsug-discuss at spectrumscale.org Date: 12.11.2020 15:10 Subject: [EXTERNAL] [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? events on the Scale GUI. As you know, there is ?mark as read? for ?warning? messages but there isn?t for ?ERROR?. (In fact, the ?mark as read? button is exist but it does not work.) So I sent him to run this command on cli. /usr/lpp/mmfs/gui/cli/lshealth --reset On my test VM, all of the error messages has been cleared when I run the command?. But, for the client?s system, client said that ?All of the error / warning messages had been appeared again include the one which I had delete by clicking ?mark as read?.? Does anyone who has similar experience like this? and How Could I solve this problem? Or, Is there any way to clear the event one by one? * I sent the same message to the Slack 'scale-help' channel. Thanks. Jay. Best Regards, JaeYoon(Jay) IBM Korea, Three IFC, Yum 10 Gukjegeumyung-ro, Yeongdeungpo-gu, IBM Systems Seoul, Korea Hardware, Storage Technical Sales Mobile : +82-10-4995-4814 07326 e-mail: jyyum at kr.ibm.com ? ??? ??? ??, ??? ??, ???? ???? ???? ?????. ??? ??? ??? ??? ????, ????? ??? ?? ?????,??IBM ??? ? ?? ??? ????, ????? ???? ????. (If you don't wish to receive e-mail from sender, please send e-mail directly. For IBM e-mail, please click here). ??? ???? ??,??, ?? ??? ???????(??: 02-3781-7800,? ?? mktg at kr.ibm.com )? ?? ?? ???? ?? ???? ? ????. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E506389.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E764757.gif Type: image/gif Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E982001.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: From arc at b4restore.com Thu Nov 12 17:33:01 2020 From: arc at b4restore.com (=?utf-8?B?QW5kaSBOw7hyIENocmlzdGlhbnNlbg==?=) Date: Thu, 12 Nov 2020 17:33:01 +0000 Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event In-Reply-To: References: Message-ID: Hi Jay, First of you need to make sure your system is actually healthy. Events that are not fixed will reappear. I have had a lot of ?stale? entries happening over the last years and more often than not ?/usr/lpp/mmfs/gui/cli/lshealth ?reset? clears the entries if they are not actual faults.. As Stefan says if the errors/warnings are shown in ?mmhealth node show or mmhealth cluster show? they will reappear as they should. (I have sometimes seen stale entries there aswell) When I have encountered stale entries which wasn?t cleared with ?lshealth ?reset? I could clear them with ?mmsysmoncontrol restart?. I think I actually run that command maybe once or twice every month because of stale entries in the GUI og mmhealth itself.. don?t know why they happen but they seem to appear more frequently for me atleast.. I have high hopes for the 5.1.0.0/5.1.0.1 release as I have heard there should be some new things for the GUI as well.. not sure what they are yet though 😊 Hope this helps. Cheers A. Christiansen Fra: gpfsug-discuss-bounces at spectrumscale.org P? vegne af Stefan Roth Sendt: Thursday, November 12, 2020 6:14 PM Til: gpfsug main discussion list Emne: Re: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Hello Jay, as long as those errors are still shown by "mmhealth node show" CLI command, they will again appear in the GUI. In the GUI events table you can show an "Event Type" column which is hidden by default. Events that have event type "Notice" can be cleared by the "Mark as Read" action. Events that have event type "State" can not be cleared by the "Mark as Read" action. They have to disappear by solving the problem. If a problem is solved the error should disappear from "mmhealth node show" and after that it will disappear from the GUI as well. Mit freundlichen Gr??en / Kind regards Stefan Roth Spectrum Scale Developement ________________________________ Phone: +49 162 4159934 IBM Deutschland Research & Development GmbH [cid:image002.gif at 01D6B922.3FE99E70] Email: stefan.roth at de.ibm.com Am Weiher 24 65451 Kelsterbach ________________________________ IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 [cid:image003.gif at 01D6B922.3FE99E70]"Jae Yoon Yum" ---12.11.2020 15:10:35---Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? ev From: "Jae Yoon Yum" > To: gpfsug-discuss at spectrumscale.org Date: 12.11.2020 15:10 Subject: [EXTERNAL] [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? events on the Scale GUI. As you know, there is ?mark as read? for ?warning? messages but there isn?t for ?ERROR?. (In fact, the ?mark as read? button is exist but it does not work.) So I sent him to run this command on cli. /usr/lpp/mmfs/gui/cli/lshealth --reset On my test VM, all of the error messages has been cleared when I run the command?. But, for the client?s system, client said that ?All of the error / warning messages had been appeared again include the one which I had delete by clicking ?mark as read?.? Does anyone who has similar experience like this? and How Could I solve this problem? Or, Is there any way to clear the event one by one? * I sent the same message to the Slack 'scale-help' channel. Thanks. Jay. Best Regards, JaeYoon(Jay) Yum IBM Korea, Three IFC, [cid:image005.jpg at 01D6B922.3FE99E70] 10 Gukjegeumyung-ro, Yeongdeungpo-gu, IBM Systems Hardware, Storage Technical Sales Seoul, Korea Mobile : +82-10-4995-4814 07326 e-mail: jyyum at kr.ibm.com ? ??? ??? ??, ??? ??, ???? ???? ???? ?????. ??? ??? ??? ??? ????, ????? ??? ?? ?????,??IBM ??? ??? ??? ????, ????? ???? ????. (If you don't wish to receive e-mail from sender, please send e-mail directly. For IBM e-mail, please click here). ??? ???? ??,??, ?? ??? ???????(??: 02-3781-7800,??? mktg at kr.ibm.com )? ?? ?? ???? ?? ???? ? ????. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 1851 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 105 bytes Desc: image003.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.jpg Type: image/jpeg Size: 2457 bytes Desc: image005.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.png Type: image/png Size: 166 bytes Desc: image006.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image007.png Type: image/png Size: 616 bytes Desc: image007.png URL: From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 13 02:33:17 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 13 Nov 2020 02:33:17 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Fri Nov 13 09:21:17 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 13 Nov 2020 10:21:17 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Fri Nov 13 09:37:04 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 13 Nov 2020 10:37:04 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 13 13:31:21 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 13 Nov 2020 13:31:21 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Fri Nov 13 13:38:48 2020 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 13 Nov 2020 13:38:48 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu?= =?utf-8?q?=09usage=09of=09mmfsd_process?= In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From kkr at lbl.gov Fri Nov 13 21:11:16 2020 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Fri, 13 Nov 2020 13:11:16 -0800 Subject: [gpfsug-discuss] REMINDER - SC20 Sessions - Monday Nov. 16 and Wednesday Nov. 18 Message-ID: <7B85E526-88D4-44AE-B034-4EC5A61E524C@lbl.gov> Hi all, A Reminder to attend and also submit any panel questions for the Wednesday session. So far, there are 3 questions around these topics: 1) excessive prefetch when reading small fractions of many large files 2) improved the integration between TSM and GPFS 3) number of security vulnerabilities in GPFS, the GUI, ESS, or something else related Bring on your tough questions and make it interesting. Cheers, Kristy ?original email--- The Spectrum Scale User Group will be hosting two 90 minute sessions at SC20 this year and we hope you can join us. The first one is: "Storage for AI" and will be held Monday, Nov. 16th, from 11:00-12:30 EST and the second one is "What's new in Spectrum Scale 5.1?" and will be held Wednesday, Nov. 18th from 11:00-12:30 EST. Please see the calendar at https://www.spectrumscaleug.org/eventslist/2020-11/ and register by clicking on a session on the calendar and then the "Please register here to join the session" link. Best, Kristy Kristy Kallback-Rose Senior HPC Storage Systems Analyst National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory From UWEFALKE at de.ibm.com Mon Nov 16 13:45:57 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Mon, 16 Nov 2020 14:45:57 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu?= =?utf-8?q?=09usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, while the other nodes can well block the local one, as Frederick suggests, there should at least be something visible locally waiting for these other nodes. Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms. I am no developer nor do I have the code, so don't expect too much. Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)? Lookups are metadata ops, do you have a separate pool for your metadata? How is that pool set up (doen to the physical block devices)? Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...). So, which waiters did you see (hope you saved them, if not, do it next time). What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)? To check the unreasonably short trace capture time: how large are the trcrpt files you obtain? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 14:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From andi at christiansen.xxx Mon Nov 16 19:44:14 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Mon, 16 Nov 2020 20:44:14 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Message-ID: <1388247256.209171.1605555854969@privateemail.com> Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Mon Nov 16 21:44:30 2020 From: stockf at us.ibm.com (Frederick Stock) Date: Mon, 16 Nov 2020 21:44:30 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Mon Nov 16 21:58:19 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Mon, 16 Nov 2020 13:58:19 -0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <20201116215819.wda6nophekamzs3v@thargelion> When we did a similar (though larger, at ~2.5PB) migration, we used rsync as well, but ran one rsync process per Isilon node, and made sure the NFS clients were hitting separate Isilon nodes for their reads. We also didn't have more than one rsync process running per client, as the Linux NFS client (at least in CentOS 6) was terrible when it came to concurrent access. Whatever method you end up using, I can guarantee you will be much happier once you are on GPFS. :) On Mon, Nov 16, 2020 at 08:44:14PM +0100, Andi Christiansen wrote: > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... > > as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From jonathan.buzzard at strath.ac.uk Mon Nov 16 22:58:49 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 16 Nov 2020 22:58:49 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <4de1fa02-a074-0901-cf12-31be9e843f5f@strath.ac.uk> On 16/11/2020 19:44, Andi Christiansen wrote: > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS > from two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching > from 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit > from scale nodes and 20Gbits from isilon so we should be able to reach > just under 20Gbit... > > > if anyone have any ideas they are welcome! > My biggest recommendation when doing this is to use a sqlite database to keep track of what is going on. The main issue is that you are almost certainly going to need to do more than one rsync pass unless your source Isilon system has no user activity, and with 700TB to move that seems unlikely. Typically you do an initial rsync to move the bulk of the data while the users are still live, then shutdown user access to the source system and do the final rsync which hopefully has a significantly smaller amount of data to actually move. So this is what I have done on a number of occasions now. I create a very simple sqlite DB with a list of source and destination folders and a status code. Initially the status code is set to -1. Then I have a perl script which looks at the sqlite DB, picks a row with a status code of -1, and sets the status code to -2, aka that directory is in progress. It then proceeds to run the rsync and when it finishes it updates the status code to the exit code of the rsync process. As long as all the rsync processes have access to the same copy of the sqlite DB (simplest to put it on either the source or destination file system) then all is good. You can fire off multiple rsync's on multiple nodes and they will all keep churning away till there is no more work to be done. The advantage is you can easily interrogate the DB to find out the state of play. That is how many of your transfers have completed, how many are yet to be done, which ones are currently being transferred etc. without logging onto multiple nodes. *MOST* importantly you can see if any of the rsync's had an error, by simply looking for status codes greater than zero. I cannot stress how important this is. Noting that if the source is still active you will see errors down to files being deleted on the source file system before rsync has a chance to copy them. However this has a specific exit code (24) so is easy to spot and not worry about. Finally it is also very simple to set the status codes to -1 again and set the process away again. So the final run is easier to do. If you want to mail me off list I can dig out a copy of the perl code I used if your interested. There are several version as I have tended to tailor to each transfer. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Mon Nov 16 23:12:47 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 16 Nov 2020 23:12:47 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <20201116215819.wda6nophekamzs3v@thargelion> References: <1388247256.209171.1605555854969@privateemail.com> <20201116215819.wda6nophekamzs3v@thargelion> Message-ID: <8d4d2987-77dd-e3e1-1c98-a635f1b96ddd@strath.ac.uk> On 16/11/2020 21:58, Skylar Thompson wrote: > When we did a similar (though larger, at ~2.5PB) migration, we used rsync > as well, but ran one rsync process per Isilon node, and made sure the NFS > clients were hitting separate Isilon nodes for their reads. We also didn't > have more than one rsync process running per client, as the Linux NFS > client (at least in CentOS 6) was terrible when it came to concurrent access. > The million dollar question IMHO is the number of files and their sizes. Basically if you have a million 1KB files to move it is going to take much longer than a 100 1GB files. That is the overhead of dealing with each file is a real bitch and kills your attainable transfer speed stone dead. One option I have used in the past is to use your last backup and restore to the new system, then rsync in the changes. That way you don't impact the source file system which is live. Another option I have used is to inform users in advance that data will be transferred based on a metric of how many files and how much data they have. So the less data and fewer files the quicker you will get access to the new system once access to the old system is turned off. It is amazing how much users clear up junk under this scenario. Last time I did this a single user went from over 17 million files to 11 thousand! In total many many TB of data just vanished from the system (around half of the data when puff) as users actually got around to some house keeping LOL. Moving less data and files is always less painful. > Whatever method you end up using, I can guarantee you will be much happier > once you are on GPFS. :) > Goes without saying :-) JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From UWEFALKE at de.ibm.com Tue Nov 17 08:50:56 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 17 Nov 2020 09:50:56 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Andi, what about leaving NFS completeley out and using rsync (multiple rsyncs in parallel, of course) directly between your source and target servers? I am not sure how many TCP connections (suppose it is NFS4) in parallel are opened between client and server, using a 2x bonded interface well requires at least two. That combined with the DB approach suggested by Jonathan to control the activity of the rsync streams would be my best guess. If you have many small files, the overhead might still kill you. Tarring them up into larger aggregates for transfer would help a lot, but then you must be sure they won't change or you need to implement your own version control for that class of files. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Andi Christiansen To: "gpfsug-discuss at spectrumscale.org" Date: 16/11/2020 20:44 Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Tue Nov 17 08:57:07 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 17 Nov 2020 09:57:07 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but it is over two nodes, so no bonding. But still, I'd expect to open several TCP connections in parallel per source-target pair (like with several rsyncs per source node) would bear an advantage (and still I thing NFS doesn't do that, but I can be wrong). If more nodes have access to the Isilon data they could also participate (and don't need NFS exports for that). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 17/11/2020 09:50 Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Hi Andi, what about leaving NFS completeley out and using rsync (multiple rsyncs in parallel, of course) directly between your source and target servers? I am not sure how many TCP connections (suppose it is NFS4) in parallel are opened between client and server, using a 2x bonded interface well requires at least two. That combined with the DB approach suggested by Jonathan to control the activity of the rsync streams would be my best guess. If you have many small files, the overhead might still kill you. Tarring them up into larger aggregates for transfer would help a lot, but then you must be sure they won't change or you need to implement your own version control for that class of files. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Andi Christiansen To: "gpfsug-discuss at spectrumscale.org" Date: 16/11/2020 20:44 Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From andi at christiansen.xxx Tue Nov 17 11:51:58 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 12:51:58 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <616234716.258600.1605613918767@privateemail.com> Hi all, thanks for all the information, there was some interesting things amount it.. I kept on going with rsync and ended up making a file with all top level user directories and splitting them into chunks of 347 per rsync session(total 42000 ish folders). yesterday we had only 14 sessions with 3000 folders in each and that was too much work for one rsync session.. i divided them out among all GPFS nodes to have them fetch an area each and actually doing that 3 times on each node and that has now boosted the bandwidth usage from 3Gbit to around 16Gbit in total.. all nodes have been seing doing work above 7Gbit individual which is actually near to what i was expecting without any modifications to the NFS server or TCP tuning.. CPU is around 30-50% on each server and mostly below or around 30% so it seems like it could have handled abit more sessions.. Small files are really a killer but with all 96+ sessions we have now its not often all sessions are handling small files at the same time so we have an average of about 10-12Gbit bandwidth usage. Thanks all! ill keep you in mind if for some reason we see it slowing down again but for now i think we will try to see if it will go the last mile with a bit more sessions on each :) Best Regards Andi Christiansen > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but > it is over two nodes, so no bonding. But still, I'd expect to open several > TCP connections in parallel per source-target pair (like with several > rsyncs per source node) would bear an advantage (and still I thing NFS > doesn't do that, but I can be wrong). > If more nodes have access to the Isilon data they could also participate > (and don't need NFS exports for that). > > Mit freundlichen Gr??en / Kind regards > > Dr. Uwe Falke > IT Specialist > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > Services > +49 175 575 2877 Mobile > Rathausstr. 7, 09111 Chemnitz, Germany > uwefalke at de.ibm.com > > IBM Services > > IBM Data Privacy Statement > > IBM Deutschland Business & Technology Services GmbH > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > Sitz der Gesellschaft: Ehningen > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > From: Uwe Falke/Germany/IBM > To: gpfsug main discussion list > Date: 17/11/2020 09:50 > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > from Isilon to Scale over NFS? > > > Hi Andi, > > what about leaving NFS completeley out and using rsync (multiple rsyncs > in parallel, of course) directly between your source and target servers? > I am not sure how many TCP connections (suppose it is NFS4) in parallel > are opened between client and server, using a 2x bonded interface well > requires at least two. That combined with the DB approach suggested by > Jonathan to control the activity of the rsync streams would be my best > guess. > If you have many small files, the overhead might still kill you. Tarring > them up into larger aggregates for transfer would help a lot, but then you > must be sure they won't change or you need to implement your own version > control for that class of files. > > Mit freundlichen Gr??en / Kind regards > > Dr. Uwe Falke > IT Specialist > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > Services > +49 175 575 2877 Mobile > Rathausstr. 7, 09111 Chemnitz, Germany > uwefalke at de.ibm.com > > IBM Services > > IBM Data Privacy Statement > > IBM Deutschland Business & Technology Services GmbH > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > Sitz der Gesellschaft: Ehningen > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > From: Andi Christiansen > To: "gpfsug-discuss at spectrumscale.org" > > Date: 16/11/2020 20:44 > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > Isilon to Scale over NFS? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS from > two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching from > 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > scale nodes and 20Gbits from isilon so we should be able to reach just > under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Tue Nov 17 12:07:30 2020 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 17 Nov 2020 13:07:30 +0100 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <616234716.258600.1605613918767@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: Nice to see it working well! But, what about ACLs? Does you rsync pull in all needed metadata, or do you also need to sync ACLs ? Any plans for how to solve that ? On Tue, Nov 17, 2020 at 12:52 PM Andi Christiansen wrote: > Hi all, > > thanks for all the information, there was some interesting things amount > it.. > > I kept on going with rsync and ended up making a file with all top level > user directories and splitting them into chunks of 347 per rsync > session(total 42000 ish folders). yesterday we had only 14 sessions with > 3000 folders in each and that was too much work for one rsync session.. > > i divided them out among all GPFS nodes to have them fetch an area each > and actually doing that 3 times on each node and that has now boosted the > bandwidth usage from 3Gbit to around 16Gbit in total.. > > all nodes have been seing doing work above 7Gbit individual which is > actually near to what i was expecting without any modifications to the NFS > server or TCP tuning.. > > CPU is around 30-50% on each server and mostly below or around 30% so it > seems like it could have handled abit more sessions.. > > Small files are really a killer but with all 96+ sessions we have now its > not often all sessions are handling small files at the same time so we have > an average of about 10-12Gbit bandwidth usage. > > Thanks all! ill keep you in mind if for some reason we see it slowing down > again but for now i think we will try to see if it will go the last mile > with a bit more sessions on each :) > > Best Regards > Andi Christiansen > > > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, > but > > it is over two nodes, so no bonding. But still, I'd expect to open > several > > TCP connections in parallel per source-target pair (like with several > > rsyncs per source node) would bear an advantage (and still I thing NFS > > doesn't do that, but I can be wrong). > > If more nodes have access to the Isilon data they could also participate > > (and don't need NFS exports for that). > > > > Mit freundlichen Gr??en / Kind regards > > > > Dr. Uwe Falke > > IT Specialist > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > Services > > +49 175 575 2877 Mobile > > Rathausstr. 7, 09111 Chemnitz, Germany > > uwefalke at de.ibm.com > > > > IBM Services > > > > IBM Data Privacy Statement > > > > IBM Deutschland Business & Technology Services GmbH > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > Sitz der Gesellschaft: Ehningen > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > From: Uwe Falke/Germany/IBM > > To: gpfsug main discussion list > > Date: 17/11/2020 09:50 > > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > > from Isilon to Scale over NFS? > > > > > > Hi Andi, > > > > what about leaving NFS completeley out and using rsync (multiple rsyncs > > in parallel, of course) directly between your source and target servers? > > I am not sure how many TCP connections (suppose it is NFS4) in parallel > > are opened between client and server, using a 2x bonded interface well > > requires at least two. That combined with the DB approach suggested by > > Jonathan to control the activity of the rsync streams would be my best > > guess. > > If you have many small files, the overhead might still kill you. Tarring > > them up into larger aggregates for transfer would help a lot, but then > you > > must be sure they won't change or you need to implement your own version > > control for that class of files. > > > > Mit freundlichen Gr??en / Kind regards > > > > Dr. Uwe Falke > > IT Specialist > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > Services > > +49 175 575 2877 Mobile > > Rathausstr. 7, 09111 Chemnitz, Germany > > uwefalke at de.ibm.com > > > > IBM Services > > > > IBM Data Privacy Statement > > > > IBM Deutschland Business & Technology Services GmbH > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > Sitz der Gesellschaft: Ehningen > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > From: Andi Christiansen > > To: "gpfsug-discuss at spectrumscale.org" > > > > Date: 16/11/2020 20:44 > > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > > Isilon to Scale over NFS? > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > Hi all, > > > > i have got a case where a customer wants 700TB migrated from isilon to > > Scale and the only way for him is exporting the same directory on NFS > from > > two different nodes... > > > > as of now we are using multiple rsync processes on different parts of > > folders within the main directory. this is really slow and will take > > forever.. right now 14 rsync processes spread across 3 nodes fetching > from > > 2.. > > > > does anyone know of a way to speed it up? right now we see from 1Gbit to > > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit > from > > scale nodes and 20Gbits from isilon so we should be able to reach just > > under 20Gbit... > > > > > > if anyone have any ideas they are welcome! > > > > > > Thanks in advance > > Andi Christiansen _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andi at christiansen.xxx Tue Nov 17 12:24:22 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 13:24:22 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: <1023406427.259407.1605615862969@privateemail.com> Hi Jan, We are syncing ACLs, groups, owners and timestamps aswell :) /Andi Christiansen > On 11/17/2020 1:07 PM Jan-Frode Myklebust wrote: > > > Nice to see it working well! > > But, what about ACLs? Does you rsync pull in all needed metadata, or do you also need to sync ACLs ? Any plans for how to solve that ? > > On Tue, Nov 17, 2020 at 12:52 PM Andi Christiansen wrote: > > > > Hi all, > > > > thanks for all the information, there was some interesting things amount it.. > > > > I kept on going with rsync and ended up making a file with all top level user directories and splitting them into chunks of 347 per rsync session(total 42000 ish folders). yesterday we had only 14 sessions with 3000 folders in each and that was too much work for one rsync session.. > > > > i divided them out among all GPFS nodes to have them fetch an area each and actually doing that 3 times on each node and that has now boosted the bandwidth usage from 3Gbit to around 16Gbit in total.. > > > > all nodes have been seing doing work above 7Gbit individual which is actually near to what i was expecting without any modifications to the NFS server or TCP tuning.. > > > > CPU is around 30-50% on each server and mostly below or around 30% so it seems like it could have handled abit more sessions.. > > > > Small files are really a killer but with all 96+ sessions we have now its not often all sessions are handling small files at the same time so we have an average of about 10-12Gbit bandwidth usage. > > > > Thanks all! ill keep you in mind if for some reason we see it slowing down again but for now i think we will try to see if it will go the last mile with a bit more sessions on each :) > > > > Best Regards > > Andi Christiansen > > > > > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > > > > > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but > > > it is over two nodes, so no bonding. But still, I'd expect to open several > > > TCP connections in parallel per source-target pair (like with several > > > rsyncs per source node) would bear an advantage (and still I thing NFS > > > doesn't do that, but I can be wrong). > > > If more nodes have access to the Isilon data they could also participate > > > (and don't need NFS exports for that). > > > > > > Mit freundlichen Gr??en / Kind regards > > > > > > Dr. Uwe Falke > > > IT Specialist > > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > > Services > > > +49 175 575 2877 Mobile > > > Rathausstr. 7, 09111 Chemnitz, Germany > > > uwefalke at de.ibm.com mailto:uwefalke at de.ibm.com > > > > > > IBM Services > > > > > > IBM Data Privacy Statement > > > > > > IBM Deutschland Business & Technology Services GmbH > > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > > Sitz der Gesellschaft: Ehningen > > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > > > From: Uwe Falke/Germany/IBM > > > To: gpfsug main discussion list > > > Date: 17/11/2020 09:50 > > > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > > > from Isilon to Scale over NFS? > > > > > > > > > Hi Andi, > > > > > > what about leaving NFS completeley out and using rsync (multiple rsyncs > > > in parallel, of course) directly between your source and target servers? > > > I am not sure how many TCP connections (suppose it is NFS4) in parallel > > > are opened between client and server, using a 2x bonded interface well > > > requires at least two. That combined with the DB approach suggested by > > > Jonathan to control the activity of the rsync streams would be my best > > > guess. > > > If you have many small files, the overhead might still kill you. Tarring > > > them up into larger aggregates for transfer would help a lot, but then you > > > must be sure they won't change or you need to implement your own version > > > control for that class of files. > > > > > > Mit freundlichen Gr??en / Kind regards > > > > > > Dr. Uwe Falke > > > IT Specialist > > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > > Services > > > +49 175 575 2877 Mobile > > > Rathausstr. 7, 09111 Chemnitz, Germany > > > uwefalke at de.ibm.com mailto:uwefalke at de.ibm.com > > > > > > IBM Services > > > > > > IBM Data Privacy Statement > > > > > > IBM Deutschland Business & Technology Services GmbH > > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > > Sitz der Gesellschaft: Ehningen > > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > > > > > > From: Andi Christiansen > > > To: "gpfsug-discuss at spectrumscale.org mailto:gpfsug-discuss at spectrumscale.org " > > > > > > Date: 16/11/2020 20:44 > > > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > > > Isilon to Scale over NFS? > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org mailto:gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > > > > > Hi all, > > > > > > i have got a case where a customer wants 700TB migrated from isilon to > > > Scale and the only way for him is exporting the same directory on NFS from > > > two different nodes... > > > > > > as of now we are using multiple rsync processes on different parts of > > > folders within the main directory. this is really slow and will take > > > forever.. right now 14 rsync processes spread across 3 nodes fetching from > > > 2.. > > > > > > does anyone know of a way to speed it up? right now we see from 1Gbit to > > > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > > > scale nodes and 20Gbits from isilon so we should be able to reach just > > > under 20Gbit... > > > > > > > > > if anyone have any ideas they are welcome! > > > > > > > > > Thanks in advance > > > Andi Christiansen _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss athttp://spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss athttp://spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss athttp://spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Nov 17 13:53:43 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 17 Nov 2020 13:53:43 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <616234716.258600.1605613918767@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> On 17/11/2020 11:51, Andi Christiansen wrote: > Hi all, > > thanks for all the information, there was some interesting things > amount it.. > > I kept on going with rsync and ended up making a file with all top > level user directories and splitting them into chunks of 347 per > rsync session(total 42000 ish folders). yesterday we had only 14 > sessions with 3000 folders in each and that was too much work for one > rsync session.. Unless you use something similar to my DB suggestion it is almost inevitable that some of those rsync sessions are going to have issues and you will have no way to track it or even know it has happened unless you do a single final giant catchup/check rsync. I should add that a copy of the sqlite DB is cover your backside protection when a user pops up claiming that you failed to transfer one of their vitally important files six months down the line and the old system is turned off and scrapped. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From skylar2 at uw.edu Tue Nov 17 14:59:43 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Tue, 17 Nov 2020 06:59:43 -0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> Message-ID: <20201117145943.5cxyfpfyrk7udmn4@thargelion> On Tue, Nov 17, 2020 at 01:53:43PM +0000, Jonathan Buzzard wrote: > On 17/11/2020 11:51, Andi Christiansen wrote: > > Hi all, > > > > thanks for all the information, there was some interesting things > > amount it.. > > > > I kept on going with rsync and ended up making a file with all top > > level user directories and splitting them into chunks of 347 per > > rsync session(total 42000 ish folders). yesterday we had only 14 > > sessions with 3000 folders in each and that was too much work for one > > rsync session.. > > Unless you use something similar to my DB suggestion it is almost inevitable > that some of those rsync sessions are going to have issues and you will have > no way to track it or even know it has happened unless you do a single final > giant catchup/check rsync. > > I should add that a copy of the sqlite DB is cover your backside protection > when a user pops up claiming that you failed to transfer one of their > vitally important files six months down the line and the old system is > turned off and scrapped. That's not a bad idea, and I like it more than the method I setup where we captured the output of find from both sides of the transfer and preserved it for posterity, but obviously did require a hard-stop date on the source. Fortunately, we seem committed to GPFS so it might be we never have to do another bulk transfer outside of the filesystem... -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From S.J.Thompson at bham.ac.uk Tue Nov 17 15:55:41 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 17 Nov 2020 15:55:41 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <20201117145943.5cxyfpfyrk7udmn4@thargelion> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> <20201117145943.5cxyfpfyrk7udmn4@thargelion> Message-ID: <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> > Fortunately, we seem committed to GPFS so it might be we never have to do > another bulk transfer outside of the filesystem... Until you want to move a v3 or v4 created file-system to v5 block sizes __ I hopes we won't be doing that sort of thing again... Simon From jonathan.buzzard at strath.ac.uk Tue Nov 17 19:45:29 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 17 Nov 2020 19:45:29 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> <20201117145943.5cxyfpfyrk7udmn4@thargelion> <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> Message-ID: <1a1be12b-a4f2-f2b3-4cdf-e34bc5eace24@strath.ac.uk> On 17/11/2020 15:55, Simon Thompson wrote: > >> Fortunately, we seem committed to GPFS so it might be we never have to do >> another bulk transfer outside of the filesystem... > > Until you want to move a v3 or v4 created file-system to v5 block sizes __ You forget the v2 to v3 for more than two billion files switch. Either that or you where not using it back then. Then there was the v3.2 if you ever want to mount it on Windows. > > I hopes we won't be doing that sort of thing again... > Yep, going to be recycling my scripts in the coming week for a v4 to v5 with capacity upgrade on our DSS-G. That basically involves a trashing of the file system and a restore from backup. Going to be doing the your data will be restored based on a metric of how many files and how much data you have ploy again :-) I too hope that will be the last time I have to do anything similar but my experience of the last couple of decades says that is likely to be a forlorn hope :-( I speculate that one day the 10,000 file set limit will be lifted, but only if you reformat your file system... JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From andi at christiansen.xxx Tue Nov 17 20:40:39 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 21:40:39 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> Message-ID: <82434297.276248.1605645639435@privateemail.com> Hi Jonathan, yes you are correct! but we plan to resync this once or twice every week for the next 3-4months to be sure everything is as it should be. Right now we are focused on getting them synced up and then we will run scheduled resyncs/checks once or twice a week depending on the data growth :) Thanks Andi Christiansen > On 11/17/2020 2:53 PM Jonathan Buzzard wrote: > > > On 17/11/2020 11:51, Andi Christiansen wrote: > > Hi all, > > > > thanks for all the information, there was some interesting things > > amount it.. > > > > I kept on going with rsync and ended up making a file with all top > > level user directories and splitting them into chunks of 347 per > > rsync session(total 42000 ish folders). yesterday we had only 14 > > sessions with 3000 folders in each and that was too much work for one > > rsync session.. > > Unless you use something similar to my DB suggestion it is almost > inevitable that some of those rsync sessions are going to have issues > and you will have no way to track it or even know it has happened unless > you do a single final giant catchup/check rsync. > > I should add that a copy of the sqlite DB is cover your backside > protection when a user pops up claiming that you failed to transfer one > of their vitally important files six months down the line and the old > system is turned off and scrapped. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From chris.schlipalius at pawsey.org.au Tue Nov 17 23:17:18 2020 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Wed, 18 Nov 2020 07:17:18 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Message-ID: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> So at my last job we used to rsync data between isilons across campus, and isilon to Windows File Cluster (and back). I recommend using dry run to generate a list of files and then use this to run with rysnc. This allows you also to be able to break up the transfer into batches, and check if files have changed before sync (say if your isilon files are not RO. Also ensure you have a recent version of rsync that preserves extended attributes and check your ACLS. A dry run example: https://unix.stackexchange.com/a/261372 I always felt more comfortable having a list of files before a sync?. Regards, Chris Schlipalius Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey Supercomputing Centre (CSIRO) 1 Bryce Avenue Kensington WA 6151 Australia Tel +61 8 6436 8815 Email chris.schlipalius at pawsey.org.au Web www.pawsey.org.au -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Nov 18 11:48:52 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 18 Nov 2020 11:48:52 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> Message-ID: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> On 17/11/2020 23:17, Chris Schlipalius wrote: > So at my last job we used to rsync data between isilons across campus, > and isilon to Windows File Cluster (and back). > > I recommend using dry run to generate a list of files and then use this > to run with rysnc. > > This allows you also to be able to break up the transfer into batches, > and check if files have changed before sync (say if your isilon files > are not RO. > > Also ensure you have a recent version of rsync that preserves extended > attributes and check your ACLS. > > A dry run example: > > https://unix.stackexchange.com/a/261372 > > I always felt more comfortable having a list of files before a sync?. > I would counsel in the strongest possible terms against that approach. Basically you have to be assured that none of your file names have "wacky" characters in them, because handling "wacky" characters in file names is exceedingly difficult. I cannot stress how hard it is and the above example does not handle all "wacky" characters in file names. So what do I mean by "wacky" characters. Well remember a file name can have just about anything in it on Linux with the exception of '/', and users especially when using a GUI, and even more so if they are Mac users can and do use what I will call "wacky" characters in their file names. The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. Then there is the use of the wildcard characters, especially '?' but also '*'. Not too difficult to handle you might say. Right now deal with a file name with a newline character in it :-) Don't ask me how or why you even do that but let me assure you that I have seen them on more than one occasion. And now your dry run list is broken... Not only that if you have a few hundred million files to move a list just becomes unwieldy anyway. One thing I didn't mention is that I would run anything with in a screen (or tmux if that is your poison) and turn on logging. For those interested I am in the process of cleaning up the script a bit and will post it somewhere in due course. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From andi at christiansen.xxx Wed Nov 18 11:54:47 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Wed, 18 Nov 2020 12:54:47 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> Message-ID: <1947408989.293430.1605700487095@privateemail.com> Hi Jonathan, i would be very interested in seeing your scripts when they are posted. Let me know where to get them! Thanks a bunch! Andi Christiansen > On 11/18/2020 12:48 PM Jonathan Buzzard wrote: > > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > So at my last job we used to rsync data between isilons across campus, > > and isilon to Windows File Cluster (and back). > > > > I recommend using dry run to generate a list of files and then use this > > to run with rysnc. > > > > This allows you also to be able to break up the transfer into batches, > > and check if files have changed before sync (say if your isilon files > > are not RO. > > > > Also ensure you have a recent version of rsync that preserves extended > > attributes and check your ACLS. > > > > A dry run example: > > > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > I would counsel in the strongest possible terms against that approach. > > Basically you have to be assured that none of your file names have > "wacky" characters in them, because handling "wacky" characters in file > names is exceedingly difficult. I cannot stress how hard it is and the > above example does not handle all "wacky" characters in file names. > > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and > users especially when using a GUI, and even more so if they are Mac > users can and do use what I will call "wacky" characters in their file > names. > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. > > Not too difficult to handle you might say. Right now deal with a file > name with a newline character in it :-) Don't ask me how or why you even > do that but let me assure you that I have seen them on more than one > occasion. And now your dry run list is broken... > > Not only that if you have a few hundred million files to move a list > just becomes unwieldy anyway. > > One thing I didn't mention is that I would run anything with in a screen > (or tmux if that is your poison) and turn on logging. > > For those interested I am in the process of cleaning up the script a bit > and will post it somewhere in due course. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From cal.sawyer at framestore.com Wed Nov 18 12:18:57 2020 From: cal.sawyer at framestore.com (Cal Sawyer) Date: Wed, 18 Nov 2020 12:18:57 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 21 In-Reply-To: References: Message-ID: Hello Not a Scale user per se (we run a 3rdparty offshoot of Scale). In a past life managing Nexenta with OpenSolaris DR storage, I used nc/netcat for bulk data sync, which is far more efficient than rsync. With a bit of planning and analysis of directory structure on the target, nc runs could be parallelised as well, although not quite in the same way as running rsync via parallels. Of course, nc has to be available on Isilon but i have no experience with that platform. The only caveat in using nc is the amount of change to the target data as copying progresses (is the target datastore static or still seeing changes?). nc has to be followed with rsync to apply any changes and/or verify the integrity of the bulk copy. https://nakkaya.com/2009/04/15/using-netcat-for-file-transfers/ Are your Isilon and Scale systems located in the same network space? I'd also suggest that if possible, add a quad-port 10GbE (or larger: 25/100GbE) NIC to your servers to gain a wider data path and conduct your copy operations on those interfaces regards [image: Framestore] Cal Sawyer ? Senior Systems Engineer London ? New York ? Los Angeles ? Chicago ? Montr?al ? Mumbai 28 Chancery Lane London WC2A 1LB [T] +44 (0)20 7344 8000 W3W: warm.soil.patio On Wed, 18 Nov 2020 at 12:00, wrote: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Chris Schlipalius) > 2. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Jonathan Buzzard) > 3. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Andi Christiansen) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 18 Nov 2020 07:17:18 +0800 > From: Chris Schlipalius > To: > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <578BE691-DEE4-43AC-97D2-546AC406E14A at pawsey.org.au> > Content-Type: text/plain; charset="utf-8" > > So at my last job we used to rsync data between isilons across campus, and > isilon to Windows File Cluster (and back). > > I recommend using dry run to generate a list of files and then use this to > run with rysnc. > > This allows you also to be able to break up the transfer into batches, and > check if files have changed before sync (say if your isilon files are not > RO. > > Also ensure you have a recent version of rsync that preserves extended > attributes and check your ACLS. > > > > A dry run example: > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > > > > > Regards, > > Chris Schlipalius > > > > Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey > Supercomputing Centre (CSIRO) > > 1 Bryce Avenue > > Kensington WA 6151 > > Australia > > > > Tel +61 8 6436 8815 > > Email chris.schlipalius at pawsey.org.au > > Web www.pawsey.org.au > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20201118/c99c2fb1/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Wed, 18 Nov 2020 11:48:52 +0000 > From: Jonathan Buzzard > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <7983810e-f51c-8cf7-a750-5c3285870bd4 at strath.ac.uk> > Content-Type: text/plain; charset=utf-8; format=flowed > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > So at my last job we used to rsync data between isilons across campus, > > and isilon to Windows File Cluster (and back). > > > > I recommend using dry run to generate a list of files and then use this > > to run with rysnc. > > > > This allows you also to be able to break up the transfer into batches, > > and check if files have changed before sync (say if your isilon files > > are not RO. > > > > Also ensure you have a recent version of rsync that preserves extended > > attributes and check your ACLS. > > > > A dry run example: > > > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > I would counsel in the strongest possible terms against that approach. > > Basically you have to be assured that none of your file names have > "wacky" characters in them, because handling "wacky" characters in file > names is exceedingly difficult. I cannot stress how hard it is and the > above example does not handle all "wacky" characters in file names. > > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and > users especially when using a GUI, and even more so if they are Mac > users can and do use what I will call "wacky" characters in their file > names. > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. > > Not too difficult to handle you might say. Right now deal with a file > name with a newline character in it :-) Don't ask me how or why you even > do that but let me assure you that I have seen them on more than one > occasion. And now your dry run list is broken... > > Not only that if you have a few hundred million files to move a list > just becomes unwieldy anyway. > > One thing I didn't mention is that I would run anything with in a screen > (or tmux if that is your poison) and turn on logging. > > For those interested I am in the process of cleaning up the script a bit > and will post it somewhere in due course. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > ------------------------------ > > Message: 3 > Date: Wed, 18 Nov 2020 12:54:47 +0100 (CET) > From: Andi Christiansen > To: gpfsug main discussion list , > Jonathan Buzzard > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <1947408989.293430.1605700487095 at privateemail.com> > Content-Type: text/plain; charset=UTF-8 > > Hi Jonathan, > > i would be very interested in seeing your scripts when they are posted. > Let me know where to get them! > > Thanks a bunch! > Andi Christiansen > > > On 11/18/2020 12:48 PM Jonathan Buzzard > wrote: > > > > > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > > So at my last job we used to rsync data between isilons across campus, > > > and isilon to Windows File Cluster (and back). > > > > > > I recommend using dry run to generate a list of files and then use > this > > > to run with rysnc. > > > > > > This allows you also to be able to break up the transfer into batches, > > > and check if files have changed before sync (say if your isilon files > > > are not RO. > > > > > > Also ensure you have a recent version of rsync that preserves extended > > > attributes and check your ACLS. > > > > > > A dry run example: > > > > > > https://unix.stackexchange.com/a/261372 > > > > > > I always felt more comfortable having a list of files before a sync?. > > > > > > > I would counsel in the strongest possible terms against that approach. > > > > Basically you have to be assured that none of your file names have > > "wacky" characters in them, because handling "wacky" characters in file > > names is exceedingly difficult. I cannot stress how hard it is and the > > above example does not handle all "wacky" characters in file names. > > > > So what do I mean by "wacky" characters. Well remember a file name can > > have just about anything in it on Linux with the exception of '/', and > > users especially when using a GUI, and even more so if they are Mac > > users can and do use what I will call "wacky" characters in their file > > names. > > > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > > Then there is the use of the wildcard characters, especially '?' but > > also '*'. > > > > Not too difficult to handle you might say. Right now deal with a file > > name with a newline character in it :-) Don't ask me how or why you even > > do that but let me assure you that I have seen them on more than one > > occasion. And now your dry run list is broken... > > > > Not only that if you have a few hundred million files to move a list > > just becomes unwieldy anyway. > > > > One thing I didn't mention is that I would run anything with in a screen > > (or tmux if that is your poison) and turn on logging. > > > > For those interested I am in the process of cleaning up the script a bit > > and will post it somewhere in due course. > > > > > > JAB. > > > > -- > > Jonathan A. Buzzard Tel: +44141-5483420 > > HPC System Administrator, ARCHIE-WeSt. > > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 106, Issue 21 > *********************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Wed Nov 18 23:05:40 2020 From: valdis.kletnieks at vt.edu (Valdis Kl=?utf-8?Q?=c4=93?=tnieks) Date: Wed, 18 Nov 2020 18:05:40 -0500 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> Message-ID: <39863.1605740740@turing-police> On Wed, 18 Nov 2020 11:48:52 +0000, Jonathan Buzzard said: > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and You want to see some fireworks? At least at one time, it was possible to use a file system debugger that's all too trusting of hexadecimal input and create a directory entry of '../'. Let's just say that fs/namei.c was also far too trusting, and fsck was more than happy to make *different* errors than the kernel was.... > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. Don't forget ESC, CR, LF, backticks, forward ticks, semicolons, and pretty much anything else that will give a shell indigestion. SQL isn't the only thing prone to injection attacks.. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 832 bytes Desc: not available URL: From chris.schlipalius at pawsey.org.au Wed Nov 18 23:57:26 2020 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Thu, 19 Nov 2020 07:57:26 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: Message-ID: <6288DF78-A9DF-4BE9-B166-4478EF8C2A20@pawsey.org.au> ? I would counsel in the strongest possible terms against that approach. ? Basically you have to be assured that none of your file names have "wacky" characters in them, because handling "wacky" characters in file ? names is exceedingly difficult. I cannot stress how hard it is and the above example does not handle all "wacky" characters in file names. Well that?s indeed another kettle of fish if you have irregular/special naming of files, no I didn?t cover that and if you have millions of files, yes a list would be unwieldy, then I would be tarring up dirs. before moving? and then untarring on GPFS ?or breaking up the list into sets or sub lists. If you have these wacky types of file names well there are fixes as in the rsync manpages? yes not easy but possible.. Ie 1. -s, --protect-args 2. As per usual you can escape the spaces, or substitute for spaces. rsync -avuz user at server1.com:"${remote_path// /\\ }" . 3. Single quote the file name and path inside double quotes. ? One thing I didn't mention is that I would run anything with in a screen (or tmux if that is your poison) and turn on logging. Absolutely agree? ? For those interested I am in the process of cleaning up the script a bit and will post it somewhere in due course. ? JAB. Would be interesting to see?. I?ve also had success on GPFS with DCP and possibly this would be another option Regards, Chris Schlipalius Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey Supercomputing Centre (CSIRO) 1 Bryce Avenue Kensington WA 6151 Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 15:34:39 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 15:34:39 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Message-ID: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Nov 19 15:49:30 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 15:49:30 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Message-ID: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From spectrumscale at kiranghag.com Thu Nov 19 16:40:47 2020 From: spectrumscale at kiranghag.com (KG) Date: Thu, 19 Nov 2020 22:10:47 +0530 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: You can also set mount priority on filesystems so that gpfs can try to mount them in order...parent first On Thu, Nov 19, 2020, 21:19 Jonathan Buzzard wrote: > On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > > Hi, > > > > > > I have a filesystem holding many projects (i.e., mounted under > > /projects), each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects > > of it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > Technically, mounting a filesystem on top of an existing filesystem > > should be possible, but, is this discouraged for any reason? How GPFS > > would behave with that and is there a technical reason for avoiding this > > setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > This has all the hallmarks of either a Windows admin or a newbie > Linux/Unix admin :-) > > Simply put /projects is mounted on top of whatever file system is > providing the root file system in the first place LOL. > > Linux/Unix and/or GPFS does not give a monkeys about mounting another > file system *ANYWHERE* in it period because there is no other way of > doing it. > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Nov 19 16:42:07 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 19 Nov 2020 16:42:07 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Message-ID: <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> If it is a remote cluster mount from your clients (hopefully!), you might want to look at priority to order mounting of the file-systems. I don?t know what would happen if the overmounted file-system went away, you would likely want to test. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 15:39 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 16:48:07 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 16:48:07 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch>, <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: Hi Jonathan, thanks for sharing your opinions. In the sentence "Technically, mounting a filesystem on top of an existing filesystem should be possible" , I guess I was referring to that... I was concerned about other technical reasons, such like how would this would affect GPFS policies, or how to properly proceed with proper mounting, or any other technical reasons to consider. For the GPFS policies, I usually applied some of the existing GPFS policies based on directories, but after checking I realized that one can manage via device (never used policies in that way, at least for the simple but necessary use cases I have on the existing filesystems). Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Jonathan Buzzard Sent: Thursday, November 19, 2020 4:49:30 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 17:01:37 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 17:01:37 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch>, <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> Message-ID: <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Hi Simon, that's a very good point, thanks a lot :) I have it remotely mounted on a client cluster, so I will consider priorities when mounting the filesystems with remote cluster mount. That's very useful. Also, as far as I saw, same approach can be also applied to local mounts (via mmchfs) during daemon startup with the same option --mount-priority. Thanks a lot for the hints, these are very useful. I'll test that. Cheers, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson Sent: Thursday, November 19, 2020 5:42:07 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem If it is a remote cluster mount from your clients (hopefully!), you might want to look at priority to order mounting of the file-systems. I don?t know what would happen if the overmounted file-system went away, you would likely want to test. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 15:39 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Thu Nov 19 17:34:05 2020 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 19 Nov 2020 18:34:05 +0100 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: I would not mount a GPFS filesystem within a GPFS filesystem. Technically it should work, but I?d expect it to cause surprises if ever the lower filesystem experienced problems. Alone, a filesystem might recover automatically by remounting. But if there?s another filesystem mounted within, I expect it will be a problem.. Much better to use symlinks. -jf tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < marc.caubet at psi.ch>: > Hi Simon, > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > client cluster, so I will consider priorities when mounting the filesystems > with remote cluster mount. That's very useful. > > Also, as far as I saw, same approach can be also applied to local mounts > (via mmchfs) during daemon startup with the same option --mount-priority. > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > Cheers, > > Marc > _________________________________________________________ > Paul Scherrer Institut > High Performance Computing & Emerging Technologies > Marc Caubet Serrabou > Building/Room: OHSA/014 > Forschungsstrasse, 111 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > ------------------------------ > *From:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > S.J.Thompson at bham.ac.uk> > *Sent:* Thursday, November 19, 2020 5:42:07 PM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > filesystem > > > If it is a remote cluster mount from your clients (hopefully!), you might > want to look at priority to order mounting of the file-systems. I don?t > know what would happen if the overmounted file-system went away, you would > likely want to test. > > > > Simon > > > > *From: * on behalf of " > marc.caubet at psi.ch" > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Thursday, 19 November 2020 at 15:39 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > filesystem > > > > Hi, > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects of > it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > > Technically, mounting a filesystem on top of an existing filesystem should > be possible, but, is this discouraged for any reason? How GPFS would behave > with that and is there a technical reason for avoiding this setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. > > > > Thanks a lot, > > Marc > > _________________________________________________________ > Paul Scherrer Institut > High Performance Computing & Emerging Technologies > Marc Caubet Serrabou > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Thu Nov 19 17:38:07 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Thu, 19 Nov 2020 09:38:07 -0800 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: <20201119173807.kormirvbweqs3un6@thargelion> Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:08:13 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:08:13 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: On 19/11/2020 17:34, Jan-Frode Myklebust wrote: > > I would not mount a GPFS filesystem within a GPFS filesystem. > Technically it should work, but I?d expect it to cause surprises if ever > the lower filesystem experienced problems. Alone, a filesystem might > recover automatically by remounting. But if there?s another filesystem > mounted within, I expect it will be a problem.. > > Much better to use symlinks. > Think about that for a minute... I guess if you are worried about /projects going away (which would suggest something really bad has happened anyway) would be to mount the GPFS file system that is currently holding /projects somewhere else and then bind mount everything into /projects At this point I would note that bind mounts are much better than symlinks which suck for this sort of application. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:12:03 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:12:03 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: <2f789d09-3704-2d41-ef2a-953de178dce2@strath.ac.uk> On 19/11/2020 16:40, KG wrote: > You can also set mount priority on filesystems so that gpfs can try to > mount them in order...parent first > One of the things that systemd brings to the table https://github.com/systemd/systemd/commit/3519d230c8bafe834b2dac26ace49fcfba139823 JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From marc.caubet at psi.ch Thu Nov 19 18:13:08 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 18:13:08 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <20201119173807.kormirvbweqs3un6@thargelion> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> , <20201119173807.kormirvbweqs3un6@thargelion> Message-ID: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Hi all, thanks a lot for your comments. Agreed, I better avoid it for now. I was concerned about how GPFS would behave in such case. For production I will take the safe route, but, just out of curiosity, I'll give it a try on a couple of test filesystems. Thanks a lot for your help, it was very helpful, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Skylar Thompson Sent: Thursday, November 19, 2020 6:38:07 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:32:39 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:32:39 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> <20201119173807.kormirvbweqs3un6@thargelion> <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Message-ID: <5b8edf06-a4ab-a39e-5a02-86fd7565b90a@strath.ac.uk> On 19/11/2020 18:13, Caubet Serrabou Marc (PSI) wrote: > > Hi all, > > > thanks a lot for your comments. Agreed, I?better avoid it for now. I was > concerned about how GPFS would behave in such case. For production I > will take the safe route, but, just out of curiosity, I'll give it a try > on a couple of test filesystems. > Don't use symlinks there is a range of applications that will break and you will confuse the hell out of your users as the fact you are not under /projects/new but /random/new is not hidden. Besides which if the symlink goes away because /projects goes away then it is all a bust anyway. If you are worried about /projects going away then the best plan is to mount the GPFS file systems somewhere else and then bind mount the directories into /projects on all the machines where they are mounted. GPFS is quite happy with this. We bind mount /gpfs/users into /users and /gpfs/software into /opt/software by default. In the past I have bind mounted random paths for every user (hundred plus) into /home JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From novosirj at rutgers.edu Thu Nov 19 18:34:09 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 19 Nov 2020 18:34:09 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: > On Nov 19, 2020, at 10:49 AM, Jonathan Buzzard wrote: > > On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: >> Hi, >> I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. >> I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. >> Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? >> Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. > > This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) > > Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. > > Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. Some others have said, but I disagree. It wasn?t that long ago that GPFS acted really screwy with systemd because it did something in a way other than Linux expected. As it is now, their devices are not /dev/whatever or server:/wherever like just about every other filesystem type. Not unreasonable to believe it would ?act funny? compared to other FS. I like GPFS a lot, but this is not one of my favorite characteristics of it. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From UWEFALKE at de.ibm.com Thu Nov 19 19:18:41 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 19 Nov 2020 20:18:41 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Mounting_filesystem_on_top_of_an_exist?= =?utf-8?q?ing=09filesystem?= In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch><0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: Just the risk your parent system dies which will block your access to the child file system mounted on a mount point within. If that is not bothering , go ahead mount stacks . As for the symling though : it is also gone if the parent dies :-). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: KG To: gpfsug main discussion list Date: 19/11/2020 17:41 Subject: [EXTERNAL] Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Sent by: gpfsug-discuss-bounces at spectrumscale.org You can also set mount priority on filesystems so that gpfs can try to mount them in order...parent first On Thu, Nov 19, 2020, 21:19 Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Thu Nov 19 19:37:52 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 19 Nov 2020 19:37:52 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> <20201119173807.kormirvbweqs3un6@thargelion> <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Message-ID: <738D41AC-6A07-453E-A2D1-C1882BE52EDC@bham.ac.uk> My understanding was that this was perfectly acceptable in a GPFS system. i.e. mounting parts of file-systems in others. It has been suggested to us as a way of using different vendor GPFS systems (e.g. an ESS with someone elses) as a way of working round the licensing rules about ESS and anything else, but still giving a single user ?name space?. We didn?t go that route, and of course I might have misunderstood what was being suggested. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 18:13 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi all, thanks a lot for your comments. Agreed, I better avoid it for now. I was concerned about how GPFS would behave in such case. For production I will take the safe route, but, just out of curiosity, I'll give it a try on a couple of test filesystems. Thanks a lot for your help, it was very helpful, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Skylar Thompson Sent: Thursday, November 19, 2020 6:38:07 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 20 19:13:41 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 20 Nov 2020 19:13:41 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Here is the output of waiters on 2 hosts that were having the issue today: HOST 1 [2020-11-20 09:07:53 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === Waiting 0.0035 sec since 09:08:07, monitored, thread 135497 FileBlockReadFetchHandlerThread: on ThCond 0x7F615C152468 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.180 Waiting 0.0036 sec since 09:08:07, monitored, thread 139228 PrefetchWorkerThread: on ThCond 0x7F627000D5D8 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.181 [2020-11-20 09:08:07 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === HOST 2 [2020-11-20 09:08:49 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === Waiting 0.0034 sec since 09:08:50, monitored, thread 345318 SharedHashTabFetchHandlerThread: on ThCond 0x7F049C001F08 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.133 [2020-11-20 09:08:50 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === [2020-11-20 09:08:52 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === You can see the waiters go from 0 to 1-2 , but they are hardly blocking. Yes there are separate pools for metadata for all of the filesystems here. I did another trace today when the problem was happening - this time I was able to get a longer trace using the following command: /usr/lpp/mmfs/bin/mmtracectl --start --trace=io --trace-file-size=512M --tracedev-write-mode=blocking --tracedev-buffer-size=64M -N nyzls149m This is what the trsum output looks like: Elapsed trace time: 62.412092000 seconds Elapsed trace time from first VFS call to last: 62.412091999 Time idle between VFS calls: 0.002913000 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 0.003487000 9 387.444 rdwr 0.273721000 183 1495.743 read_inode2 0.007304000 325 22.474 follow_link 0.013952000 58 240.552 pagein 0.025974000 66 393.545 getattr 0.002792000 26 107.385 revalidate 0.009406000 2172 4.331 create 66.194479000 3 22064826.333 open 1.725505000 88 19608.011 unlink 18.685099000 1 18685099.000 setattr 0.011627000 14 830.500 lookup 2379.215514000 502 4739473.135 delete_inode 0.015553000 328 47.418 rename 98.099073000 5 19619814.600 release 0.050574000 89 568.247 permission 0.007454000 73 102.110 getxattr 0.002346000 32 73.312 statfs 0.000081000 6 13.500 mmap 0.049809000 18 2767.167 removexattr 0.000827000 14 59.071 llseek 0.000441000 47 9.383 readdir 0.002667000 34 78.441 Ops 4093 Secs 62.409178999 Ops/Sec 65.583 MaxFilesToCache is set to 12000 : [common] maxFilesToCache 12000 I only see gpfs_i_lookup in the tracefile, no gpfs_v_lookups # grep gpfs_i_lookup trcrpt.2020-11-20_09.20.38.283986.nyzls149m |wc -l 1097 They mostly look like this - 62.346560 238895 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922178971A40 iNum 21980113 (0x14F63D1) cnP 0xFFFF922178971C88 retP 0x0 code 0 rc 0 62.346955 238895 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF91A8A4991E00 dentryP 0xFFFF92C545A93500 name '20170323.txt' d_flags 0x80 d_count 1 unhashed 1 62.367701 218442 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922071300000 iNum 29629892 (0x1C41DC4) cnP 0xFFFF922071300248 retP 0x0 code 0 rc 0 62.367734 218444 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF9193CF457800 dentryP 0xFFFF9229527A89C0 name 'node.py' d_flags 0x80 d_count 1 unhashed 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Monday, November 16, 2020 8:46 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, while the other nodes can well block the local one, as Frederick suggests, there should at least be something visible locally waiting for these other nodes. Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms. I am no developer nor do I have the code, so don't expect too much. Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)? Lookups are metadata ops, do you have a separate pool for your metadata? How is that pool set up (doen to the physical block devices)? Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...). So, which waiters did you see (hope you saved them, if not, do it next time). What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)? To check the unreasonably short trace capture time: how large are the trcrpt files you obtain? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 14:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hooft at natlab.research.philips.com Sat Nov 21 00:37:01 2020 From: hooft at natlab.research.philips.com (Peter van Hooft) Date: Sat, 21 Nov 2020 01:37:01 +0100 Subject: [gpfsug-discuss] mmchdisk /dev/fs start -a progress Message-ID: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> Hello, Is it possible to find out the progress of the 'mmchdisk /dev/fs start -a' command when the controlling terminal had been lost? We can see the task running on the fs manager node with 'mmdiag --commands' with attributes 'hold PIT/disk waitTime 0' We are starting to worry the mmchdisk is taking too long, and see continuously waiters like Waiting 3.1946 sec since 01:28:23, ignored, thread 22092 TSCHDISKCmdThread: on ThCond 0x180267573D0 (SGManagementMgrDataCondvar), reason 'waiting for stripe group to recover' Thanks for any hints. Peter van Hooft Philips Research From jonathan.buzzard at strath.ac.uk Sat Nov 21 10:13:42 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sat, 21 Nov 2020 10:13:42 +0000 Subject: [gpfsug-discuss] mmchdisk /dev/fs start -a progress In-Reply-To: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> References: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> Message-ID: On 21/11/2020 00:37, Peter van Hooft wrote: > > Hello, > > Is it possible to find out the progress of the 'mmchdisk /dev/fs start -a' > command when the controlling terminal had been lost? > I don't think so. You are lucky it is still running > We can see the task running on the fs manager node with 'mmdiag --commands' with > attributes 'hold PIT/disk waitTime 0' > We are starting to worry the mmchdisk is taking too long, and see continuously waiters like > Waiting 3.1946 sec since 01:28:23, ignored, thread 22092 TSCHDISKCmdThread: on ThCond 0x180267573D0 (SGManagementMgrDataCondvar), reason 'waiting for stripe group to recover' > > Thanks for any hints. > Not that this is going to help this time, but it is why you should *ALWAYS* without exception run these sorts of commands within a screen/tmux session so when you loose the connection to the server you can just reconnect and pick it up again. This is introductory system administration 101. No critical or long running command should ever be dependant on a remote controlling terminal. If you can't run them locally then run them in a screen or tmux session. There are plenty of good howto's for both screen and tmux on the internet. Depending on which distribution you use I would note that RedHat have very annoyingly and for completely specious reasons removed screen from RHEL8 and left tmux. So if you are starting from scratch tmux is the one to learn :-( JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From robert.horton at icr.ac.uk Mon Nov 23 15:06:05 2020 From: robert.horton at icr.ac.uk (Robert Horton) Date: Mon, 23 Nov 2020 15:06:05 +0000 Subject: [gpfsug-discuss] AFM experiences? Message-ID: Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. From novosirj at rutgers.edu Mon Nov 23 15:30:47 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 23 Nov 2020 15:30:47 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dean.flanders at fmi.ch Mon Nov 23 17:58:12 2020 From: dean.flanders at fmi.ch (Flanders, Dean) Date: Mon, 23 Nov 2020 17:58:12 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> References: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> Message-ID: Hello Rob, We looked at AFM years ago for DR, but after reading the bug reports, we avoided it, and also have had seen a case where it had to be removed from one customer, so we have kept things simple. Now looking again a few years later there are still issues, IBM Spectrum Scale Active File Management (AFM) issues which may result in undetected data corruption, and that was just my first google hit. We have kept it simple, and use a parallel rsync process with policy engine and can hit wire speed for copying of millions of small files in order to have isolation between the sites at GB/s. I am not saying it is bad, just that it needs an appropriate risk/reward ratio to implement as it increases overall complexity. Kind regards, Dean From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Ryan Novosielski Sent: Monday, November 23, 2020 4:31 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM experiences? We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton > wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Nov 23 21:54:39 2020 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 23 Nov 2020 21:54:39 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: Message-ID: Rob, Talk to Jake Carroll from the University of Queensland, he has done a number of presentations at Scale User Groups of UQ?s MeDiCI data fabric which is based on Spectrum Scale and does very aggressive use of AFM. Their use of AFM is not only on campus, but to remote Storage clusters between 30km and 1500km away from their Home cluster. They have also tested AFM between Australia, Japan, and USA Sent from my iPhone > On 24 Nov 2020, at 01:20, Robert Horton wrote: > > ?Hi all, > > We're thinking about deploying AFM and would be interested in hearing > from anyone who has used it in anger - particularly independent writer. > > Our scenario is we have a relatively large but slow (mainly because it > is stretched over two sites with a 10G link) cluster for long/medium- > term storage and a smaller but faster cluster for scratch storage in > our HPC system. What we're thinking of doing is using some/all of the > scratch capacity as an IW cache of some/all of the main cluster, the > idea to reduce the need for people to manually move data between the > two. > > It seems to generally work as expected in a small test environment, > although we have a few concerns: > > - Quota management on the home cluster - we need a way of ensuring > people don't write data to the cache which can't be accomodated on > home. Probably not insurmountable but needs a bit of thought... > > - It seems inodes on the cache only get freed when they are deleted on > the cache cluster - not if they get deleted from the home cluster or > when the blocks are evicted from the cache. Does this become an issue > in time? > > If anyone has done anything similar I'd be interested to hear how you > got on. It would be intresting to know if you created a cache fileset > for each home fileset or just one for the whole lot, as well as any > other pearls of wisdom you may have to offer. > > Thanks! > Rob > > -- > Robert Horton | Research Data Storage Lead > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > Twitter @ICR_London > Facebook: www.facebook.com/theinstituteofcancerresearch > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Mon Nov 23 23:14:08 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 23 Nov 2020 23:14:08 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: <2C7317A6-B9DF-450A-92A6-AE156396204A@rutgers.edu> Ours are about 50 and 100 km from the home cluster, but it?s over 100Gb fiber. > On Nov 23, 2020, at 4:54 PM, Andrew Beattie wrote: > > Rob, > > Talk to Jake Carroll from the University of Queensland, he has done a number of presentations at Scale User Groups of UQ?s MeDiCI data fabric which is based on Spectrum Scale and does very aggressive use of AFM. > > Their use of AFM is not only on campus, but to remote Storage clusters between 30km and 1500km away from their Home cluster. They have also tested AFM between Australia, Japan, and USA > > Sent from my iPhone > > > On 24 Nov 2020, at 01:20, Robert Horton wrote: > > > > ?Hi all, > > > > We're thinking about deploying AFM and would be interested in hearing > > from anyone who has used it in anger - particularly independent writer. > > > > Our scenario is we have a relatively large but slow (mainly because it > > is stretched over two sites with a 10G link) cluster for long/medium- > > term storage and a smaller but faster cluster for scratch storage in > > our HPC system. What we're thinking of doing is using some/all of the > > scratch capacity as an IW cache of some/all of the main cluster, the > > idea to reduce the need for people to manually move data between the > > two. > > > > It seems to generally work as expected in a small test environment, > > although we have a few concerns: > > > > - Quota management on the home cluster - we need a way of ensuring > > people don't write data to the cache which can't be accomodated on > > home. Probably not insurmountable but needs a bit of thought... > > > > - It seems inodes on the cache only get freed when they are deleted on > > the cache cluster - not if they get deleted from the home cluster or > > when the blocks are evicted from the cache. Does this become an issue > > in time? > > > > If anyone has done anything similar I'd be interested to hear how you > > got on. It would be intresting to know if you created a cache fileset > > for each home fileset or just one for the whole lot, as well as any > > other pearls of wisdom you may have to offer. > > > > Thanks! > > Rob > > > > -- > > Robert Horton | Research Data Storage Lead > > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > > Twitter @ICR_London > > Facebook: www.facebook.com/theinstituteofcancerresearch > > > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > > > This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From vpuvvada at in.ibm.com Tue Nov 24 02:32:01 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:02:01 +0530 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: >- Quota management on the home cluster - we need a way of ensuring >people don't write data to the cache which can't be accomodated on >home. Probably not insurmountable but needs a bit of thought... You could set same quotas between cache and home clusters. AFM does not support replication of filesystem metadata like quotas, fileset configuration etc... >- It seems inodes on the cache only get freed when they are deleted on >the cache cluster - not if they get deleted from the home cluster or >when the blocks are evicted from the cache. Does this become an issue >in time? AFM periodically revalidates with home cluster. If the files/dirs were already deleted at home cluster, AFM moves them to /.ptrash directory at cache cluster during the revalidation. These files can be removed manually by user or auto eviction process. If the .ptrash directory is not cleaned up on time, it might result into quota issues at cache cluster. ~Venkat (vpuvvada at in.ibm.com) From: Robert Horton To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2020 08:51 PM Subject: [EXTERNAL] [gpfsug-discuss] AFM experiences? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Tue Nov 24 02:37:18 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:07:18 +0530 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> Message-ID: Dean, This is one of the corner case which is associated with sparse files at the home cluster. You could try with latest versions of scale, AFM indepedent-writer mode have many performance/functional improvements in newer releases. ~Venkat (vpuvvada at in.ibm.com) From: "Flanders, Dean" To: gpfsug main discussion list Date: 11/23/2020 11:44 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] AFM experiences? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Rob, We looked at AFM years ago for DR, but after reading the bug reports, we avoided it, and also have had seen a case where it had to be removed from one customer, so we have kept things simple. Now looking again a few years later there are still issues, IBM Spectrum Scale Active File Management (AFM) issues which may result in undetected data corruption, and that was just my first google hit. We have kept it simple, and use a parallel rsync process with policy engine and can hit wire speed for copying of millions of small files in order to have isolation between the sites at GB/s. I am not saying it is bad, just that it needs an appropriate risk/reward ratio to implement as it increases overall complexity. Kind regards, Dean From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Ryan Novosielski Sent: Monday, November 23, 2020 4:31 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM experiences? We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Tue Nov 24 02:41:21 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:11:21 +0530 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: AFM provides near zero downtime for migration. As of today, AFM migration does not support ACLs or other EAs migration from non scale (GPFS) source. https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm ~Venkat (vpuvvada at in.ibm.com) From: "Frederick Stock" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 11/17/2020 03:14 AM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Have you considered using the AFM feature of Spectrum Scale? I doubt it will provide any speed improvement but it would allow for data to be accessed as it was being migrated. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: Andi Christiansen Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Date: Mon, Nov 16, 2020 2:44 PM Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at googlemail.com Tue Nov 24 12:16:55 2020 From: luke.raimbach at googlemail.com (Luke Raimbach) Date: Tue, 24 Nov 2020 12:16:55 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: Hi Rob, Some things to think about from experiences a year or so ago... If you intend to perform any HPC workload (writing / updating / deleting files) inside a cache, then appropriately specified gateway nodes will be your friend: 1. When creating, updating or deleting files in the cache, each operation requires acknowledgement from the gateway handling that particular cache, before returning ACK to the application. This will add a latency overhead to the workload - if your storage is IB connected to the compute cluster and using verbsRdmaSend for example, this will increase your happiness. Connecting low-spec gateway nodes over 10GbE with the expectation that they will "drain down" over time was a sore learning experience in the early days of AFM for me. 2. AFM queues can quickly eat up memory. I think around 350bytes of memory is consumed for each operation in the AFM queue, so if you have huge file churn inside a cache then the queue will grow very quickly. If you run out of memory, the node dies and you enter cache recovery when it comes back up (or another node takes over). This can end up cycling the node as it tries to revalidate a cache and keep up with any other queues. Get more memory! I've not used AFM for a while now and I think the latter enormity has some mitigation against create / delete cycles (i.e. the create operation is expunged from the queue instead of two operations being played back to the home). I expect IBM experts will tell you more about those improvements. Also, several smaller caches are better than one large one (parallel execution of queues helps utilise the available bandwidth and you have a better failover spread if you have multiple gateways, for example). Independent Writer mode comes with some small danger (user error or impatience mainly) inasmuch as whoever updates a file last will win; e.g. home user A writes a file, then cache user B updates the file after reading it and tells user A the update is complete, when really the gateway queue is long and the change is waiting to go back home. User A uses the file expecting the changes are made, then updates it with some results. Meanwhile the AFM queue drains down and user B's change arrives after user A has completed their changes. The interim version of the file user B modified will persist at home and user A's latest changes are lost. Some careful thought about workflow (or good user training about eventual consistency) will save some potential misery on this front. Hope this helps, Luke On Mon, 23 Nov 2020 at 15:19, Robert Horton wrote: > Hi all, > > We're thinking about deploying AFM and would be interested in hearing > from anyone who has used it in anger - particularly independent writer. > > Our scenario is we have a relatively large but slow (mainly because it > is stretched over two sites with a 10G link) cluster for long/medium- > term storage and a smaller but faster cluster for scratch storage in > our HPC system. What we're thinking of doing is using some/all of the > scratch capacity as an IW cache of some/all of the main cluster, the > idea to reduce the need for people to manually move data between the > two. > > It seems to generally work as expected in a small test environment, > although we have a few concerns: > > - Quota management on the home cluster - we need a way of ensuring > people don't write data to the cache which can't be accomodated on > home. Probably not insurmountable but needs a bit of thought... > > - It seems inodes on the cache only get freed when they are deleted on > the cache cluster - not if they get deleted from the home cluster or > when the blocks are evicted from the cache. Does this become an issue > in time? > > If anyone has done anything similar I'd be interested to hear how you > got on. It would be intresting to know if you created a cache fileset > for each home fileset or just one for the whole lot, as well as any > other pearls of wisdom you may have to offer. > > Thanks! > Rob > > -- > Robert Horton | Research Data Storage Lead > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > Twitter @ICR_London > Facebook: www.facebook.com/theinstituteofcancerresearch > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable > Company Limited by Guarantee, Registered in England under Company No. > 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addressee only. If > the message is received by anyone other than the addressee, please return > the message to the sender by replying to it and then delete the message > from your computer and network. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yeep at robust.my Tue Nov 24 14:09:34 2020 From: yeep at robust.my (T.A. Yeep) Date: Tue, 24 Nov 2020 22:09:34 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Venkat, If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? On Tue, Nov 24, 2020 at 10:41 AM Venkateswara R Puvvada wrote: > AFM provides near zero downtime for migration. As of today, AFM > migration does not support ACLs or other EAs migration from non scale > (GPFS) source. > > > https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm > > ~Venkat (vpuvvada at in.ibm.com) > > > > From: "Frederick Stock" > To: gpfsug-discuss at spectrumscale.org > Cc: gpfsug-discuss at spectrumscale.org > Date: 11/17/2020 03:14 AM > Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data > from Isilon to Scale over NFS? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Have you considered using the AFM feature of Spectrum Scale? I doubt it > will provide any speed improvement but it would allow for data to be > accessed as it was being migrated. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 > stockf at us.ibm.com > > > ----- Original message ----- > From: Andi Christiansen > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: "gpfsug-discuss at spectrumscale.org" > Cc: > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon > to Scale over NFS? > Date: Mon, Nov 16, 2020 2:44 PM > > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS from > two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching from > 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > scale nodes and 20Gbits from isilon so we should be able to reach just > under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Best regards *T.A. Yeep*Mobile: +6-016-719 8506 | Tel: +6-03-7628 0526 | www.robusthpc.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Nov 24 09:39:47 2020 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Tue, 24 Nov 2020 09:39:47 +0000 Subject: [gpfsug-discuss] SSUG::Digital with CIUK Message-ID: <> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: meeting.ics Type: text/calendar Size: 2623 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 3499622 bytes Desc: not available URL: From prasad.surampudi at theatsgroup.com Tue Nov 24 16:05:19 2020 From: prasad.surampudi at theatsgroup.com (Prasad Surampudi) Date: Tue, 24 Nov 2020 16:05:19 +0000 Subject: [gpfsug-discuss] mmhealth reports fserrinvalid errors on CNFS servers Message-ID: We are seeing fserrinvalid error on couple of filesystems in Spectrum Scale cluster. These errors are reported but mmhealth only couple of nodes (CNFS servers) in the cluster, but mmhealth on other nodes shows no issues. Any idea what this error means? And why its reported on CNFS servers and not on other nodes? What need to be done to fix this issue? sudo /usr/lpp/mmfs/bin/mmhealth node show FILESYSTEM -v Node name: cnfs05-gpfs Component Status Reasons ------------------------------------------------------------------- FILESYSTEM DEGRADED fserrinvalid(vol) argus HEALTHY - dytech HEALTHY - enlnt_E HEALTHY - enlnt_Es HEALTHY - haaforfs HEALTHY - haaforfs2 HEALTHY - historical HEALTHY - prcfs HEALTHY - qmtfs HEALTHY - research HEALTHY - research2 HEALTHY - schon_raw HEALTHY - uhdb_vol1 HEALTHY - vol DEGRADED fserrinvalid(vol) Event Parameter Severity Event Message ---------------------------------------------------------------------------------------------------------- fserrinvalid vol ERROR FS=vol,ErrNo=1124,Unknown error=0464000000010000000180A108BC000079B4000000000000003400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 -------------- next part -------------- An HTML attachment was scrubbed... URL: From NSCHULD at de.ibm.com Tue Nov 24 16:44:35 2020 From: NSCHULD at de.ibm.com (Norbert Schuld) Date: Tue, 24 Nov 2020 17:44:35 +0100 Subject: [gpfsug-discuss] =?utf-8?q?mmhealth_reports_fserrinvalid_errors_o?= =?utf-8?q?n_CNFS=09servers?= In-Reply-To: References: Message-ID: To get an explanation for any event one can ask the system: # mmhealth event show fserrinvalid Event Name: fserrinvalid Event ID: 999338 Description: Unrecognized FSSTRUCT error received. Check documentation Cause: A filesystem corruption detected User Action: Check error message for details and the mmfs.log.latest log for further details. See the topic Checking and repairing a file system in the IBM Spectrum Scale documentation: Administering. Managing file systems. If the file system is severely damaged, the best course of action is to follow the procedures in section: Additional information to collect for file system corruption or MMFS_FSSTRUCT errors Severity: ERROR State: DEGRADED The event is triggered by a callback which may not fire on all nodes, that is why only a subset of nodes have the information. Depending on the version of scale the procedure to remove the event varies: For newer release please use # mmhealth event resolve Missing arguments. Usage: mmhealth event resolve {EventName} [Identifier] For older releases it is described here: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_fsstruc.htm mmsysmonc event filesystem fsstruct_fixed Mit freundlichen Gr??en / Kind regards Norbert Schuld M925:IBM Spectrum Scale Software Development Phone: +49-160 70 70 335 IBM Deutschland Research & Development GmbH Email: nschuld at de.ibm.com Am Weiher 24 65451 Kelsterbach Knowing is not enough; we must apply. Willing is not enough; we must do. IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Prasad Surampudi To: "gpfsug-discuss at spectrumscale.org" Date: 24.11.2020 17:05 Subject: [EXTERNAL] [gpfsug-discuss] mmhealth reports fserrinvalid errors on CNFS servers Sent by: gpfsug-discuss-bounces at spectrumscale.org We are seeing fserrinvalid error on couple of filesystems in Spectrum Scale cluster. These errors are reported but mmhealth only couple of nodes (CNFS servers) in the cluster, but mmhealth on other nodes shows no issues. Any idea what this error means? And why its reported on CNFS servers and not on other nodes? What need to be done to fix this issue? sudo /usr/lpp/mmfs/bin/mmhealth node show FILESYSTEM -v Node name: cnfs05-gpfs Component Status Reasons ------------------------------------------------------------------- FILESYSTEM DEGRADED fserrinvalid(vol) argus HEALTHY - dytech HEALTHY - enlnt_E HEALTHY - enlnt_Es HEALTHY - haaforfs HEALTHY - haaforfs2 HEALTHY - historical HEALTHY - prcfs HEALTHY - qmtfs HEALTHY - research HEALTHY - research2 HEALTHY - schon_raw HEALTHY - uhdb_vol1 HEALTHY - vol DEGRADED fserrinvalid(vol) Event Parameter Severity Event Message ---------------------------------------------------------------------------------------------------------- fserrinvalid vol ERROR FS=vol,ErrNo=1124,Unknown error=0464000000010000000180A108BC000079B4000000000000003400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1D963707.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From jake.carroll at uq.edu.au Wed Nov 25 21:29:24 2020 From: jake.carroll at uq.edu.au (Jake Carroll) Date: Wed, 25 Nov 2020 21:29:24 +0000 Subject: [gpfsug-discuss] IB routers in ESS configuration + 3 different subnets - valid config? Message-ID: Hi. I am just in the process of sanity-checking a potential future configuration. Let's say I have an ESS 5000 and an ESS 3000 placed on the data centre floor to form the basis of a new scratch array. Let's then suppose that I have three existing supercomputers in that same location. Each of those supercomputers has a separate IB subnet and their networks are unrelated to each other, IB-wise. My understanding is that it is valid and possible to use MLNX EDR IB *routers* in order to be able to transport NSD communications back and forth across those separate subnets, back to the ESS (which lives on its own unique subnet). So at this point, I've got four unique subnets - one for the ESS, one for each super. As I understand it, there is an upper limit of *SIX* unique subnets on those EDR IB routers. As I understand it - for IPoIB transport, I'd also need some "gateway" boxes more or less - essentially some decent servers which I put EDR/HDR cards in as dog legs that act as an IPoIB gateway interface to each subnet. I appreciate that there is devil in the detail - but what I'm asking is if it is valid to "route" NSD with IB Routers (not switches) this way to separate subnets. Colleagues at IBM have all said "yeah....should work....we've not done it....but should be fine?" Colleagues at Mellanox (uhhh...nvidia...) say "Yes, this is valid and does exactly as the IB Router should and there is nothing unusual about this". If someone has experience doing this or could call out any oddity/weirdness/gotchas, I'd be very appreciative. I'm fairly sure this is all very low risk - but given nobody locally could tell me "Yeah, all certified and valid!" I'd like the wisdom of the wider crowd. Thank you. --jc -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Fri Nov 27 11:46:05 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Fri, 27 Nov 2020 17:16:05 +0530 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Yeep, >If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? rsync can be used to just fix metadata like ACLs and EAs. AFM does not revalidate the files with source system if rsync changes the ACLs on them. So ACLs can only be fixed after or during the cutover. ACL inheritance may be used by setting on ACLs on required parent dirs upfront if this option is sufficient, there was an user who migrated to scale using this method. ~Venkat (vpuvvada at in.ibm.com) From: "T.A. Yeep" To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 11/24/2020 07:40 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Venkat, If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? On Tue, Nov 24, 2020 at 10:41 AM Venkateswara R Puvvada < vpuvvada at in.ibm.com> wrote: AFM provides near zero downtime for migration. As of today, AFM migration does not support ACLs or other EAs migration from non scale (GPFS) source. https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm ~Venkat (vpuvvada at in.ibm.com) From: "Frederick Stock" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 11/17/2020 03:14 AM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Have you considered using the AFM feature of Spectrum Scale? I doubt it will provide any speed improvement but it would allow for data to be accessed as it was being migrated. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: Andi Christiansen Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Date: Mon, Nov 16, 2020 2:44 PM Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Best regards T.A. Yeep Mobile: +6-016-719 8506 | Tel: +6-03-7628 0526 | www.robusthpc.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Mon Nov 30 13:49:12 2020 From: carlz at us.ibm.com (Carl Zetie - carlz@us.ibm.com) Date: Mon, 30 Nov 2020 13:49:12 +0000 Subject: [gpfsug-discuss] Licensing costs for data lakes (SSUG follow-up) Message-ID: I am seeking some help on a topic I know many of you care deeply about: licensing costs I am trying to gather some more information about a request that has come up a couple of times, pricing for ?data lakes?. I would like to understand better what people are looking for here. - Is it as simple as ?much steeper discounts for very large deployments?? Or is a ?data lake? something specific, e.g. a large deployment that is not performance/latency sensitive; a storage pool that is [primarily] HDD; a tier that has specific read/write patterns such as moving entire large datasets in or out; or something else? Bear in mind that if we have special licensing for data lakes, we need a rigorous definition so that both you and we know whether your use of that licensing is compliant. Nobody likes ambiguity in licensing! - Are you expecting pricing to get very flat/discounting to get steep for large deployments? Or a different price tier/structure for ?data lakes? if we can rigorously define what one means? Do you agree or disagree with the proposition that if you keep adding storage hardware/capacity, that the software licensing cost should rise in proportion (even if that proportion is much smaller for a ?data lake? than for a performance tier)? - Feel free to be creative and imaginative. For example, would you be interested in a low-cost pricing model for storage that is an AFM Home and is _only_ accessed by using AFM to move data in and out of an AFM Cache (probably on the performance tier)? This would be conceptually similar to the way you can now (5.1) use AFM-Object to park data in a cheap object store. - Also feel free to answer questions I didn?t ask? If you prefer to discuss this in Slack rather than email, I started a discussion there a little while ago (please thread your comments!): https://ssug-poweraiug.slack.com/archives/CEVVCEE8M/p1605815075188800 Carl Zetie Program Director Offering Management Spectrum Scale ---- (919) 473 3318 ][ Research Triangle Park carlz at us.ibm.com [signature_1545794140] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 69558 bytes Desc: image001.png URL: From david_johnson at brown.edu Mon Nov 30 21:41:30 2020 From: david_johnson at brown.edu (David Johnson) Date: Mon, 30 Nov 2020 16:41:30 -0500 Subject: [gpfsug-discuss] internal details on GPFS inode expansion Message-ID: When GPFS needs to add inodes to the filesystem, it seems to pre-create about 4 million of them. Judging by the logs, it seems it only takes a few (13 maybe) seconds to do this. However we are suspecting that this might only be to request the additional inodes and that there is some background activity for some time afterwards. Would someone who has knowledge of the actual internals be willing to confirm or deny this, and if there is background activity, is it on all nodes in the cluster, NSD nodes, "default worker nodes"? Thanks, -- ddj Dave Johnson ddj at brown.edu From madhu.punjabi at in.ibm.com Mon Nov 2 08:17:23 2020 From: madhu.punjabi at in.ibm.com (Madhu P Punjabi) Date: Mon, 2 Nov 2020 08:17:23 +0000 Subject: [gpfsug-discuss] [NFS-Ganesha-Support] 'ganesha_mgr display_export - client not listed In-Reply-To: <660DD807-C723-44EF-BC51-57EFB296FFC4@id.ethz.ch> References: <660DD807-C723-44EF-BC51-57EFB296FFC4@id.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From christian.vieser at 1und1.de Mon Nov 2 13:44:50 2020 From: christian.vieser at 1und1.de (Christian Vieser) Date: Mon, 2 Nov 2020 14:44:50 +0100 Subject: [gpfsug-discuss] Alternative to Scale S3 API. In-Reply-To: <1109480230.484366.1603799162955@privateemail.com> References: <1109480230.484366.1603799162955@privateemail.com> Message-ID: <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> Hi Andi, we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. Regards, Christian Am 27.10.20 um 12:46 schrieb Andi Christiansen: > Hi all, > > We have over a longer period used the S3 API within spectrum Scale.. > And that has shown that it does not support very many applications > because of limitations of the API.. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtick at us.ibm.com Tue Nov 3 00:21:43 2020 From: jmtick at us.ibm.com (Jacob M Tick) Date: Tue, 3 Nov 2020 00:21:43 +0000 Subject: [gpfsug-discuss] Use cases for file audit logging and clustered watch folder Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 3 17:00:54 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 3 Nov 2020 17:00:54 +0000 Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale Message-ID: Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t get sent! Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results. >>> Join Here <<< This episode will start 15 minutes later as usual. * San Francisco, USA at 08:15 PST * New York, USA at 11:15 EST * London, United Kingdom at 16:15 GMT * Frankfurt, Germany at 17:15 CET * Pune, India at 21:45 IST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2488 bytes Desc: not available URL: From andi at christiansen.xxx Wed Nov 4 07:14:41 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET) Subject: [gpfsug-discuss] Alternative to Scale S3 API. In-Reply-To: <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> References: <1109480230.484366.1603799162955@privateemail.com> <1dffa509-1bcc-0d5a-bf79-fa82746dca07@1und1.de> Message-ID: <1512108314.679947.1604474081488@privateemail.com> Hi Christian, Thanks for the information! My question also triggered IBM to tell me the same so i think we will stay on S3 with Scale and hoping the same with the new release.. Yes, MinIO is really lacking some good documentation.. but definatly a cool software package that i will keep an eye on in the future... Best Regards Andi Christiansen > On 11/02/2020 2:44 PM Christian Vieser wrote: > > > > Hi Andi, > > we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. > > We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. > > Regards, > > Christian > > Am 27.10.20 um 12:46 schrieb Andi Christiansen: > > > > Hi all, > > > > We have over a longer period used the S3 API within spectrum Scale.. And that has shown that it does not support very many applications because of limitations of the API.. > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From joe at excelero.com Wed Nov 4 12:19:07 2020 From: joe at excelero.com (joe at excelero.com) Date: Wed, 4 Nov 2020 06:19:07 -0600 Subject: [gpfsug-discuss] Accepted: gpfsug-discuss Digest, Vol 106, Issue 3 Message-ID: <924bb673-0b2a-420a-8ce2-be24c5e6e4e8@Spark> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: reply.ics Type: application/ics Size: 0 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Wed Nov 4 16:05:50 2020 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Wed, 4 Nov 2020 16:05:50 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 3 In-Reply-To: References: Message-ID: Could someone share the password for the event today? Thanks! Thanks, Oluwasijibomi (Siji) Saula HPC Systems Administrator / Information Technology Research 2 Building 220B / Fargo ND 58108-6050 p: 701.231.7749 / www.ndsu.edu [cid:image001.gif at 01D57DE0.91C300C0] ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Wednesday, November 4, 2020 6:00 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 106, Issue 3 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale (Simon Thompson) 2. Re: Alternative to Scale S3 API. (Andi Christiansen) ---------------------------------------------------------------------- Message: 1 Date: Tue, 3 Nov 2020 17:00:54 +0000 From: Simon Thompson To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] SSUG::Digital Scalable multi-node training for AI workloads on NVIDIA DGX, Red Hat OpenShift and IBM Spectrum Scale Message-ID: Content-Type: text/plain; charset="utf-8" Apologies, looks like the calendar invite for this week?s SSUG::Digital didn?t get sent! Nvidia and IBM did a complex proof-of-concept to demonstrate the scaling of AI workload using Nvidia DGX, Red Hat OpenShift and IBM Spectrum Scale at the example of ResNet-50 and the segmentation of images using the Audi A2D2 dataset. The project team published an IBM Redpaper with all the technical details and will present the key learnings and results. >>> Join Here <<< This episode will start 15 minutes later as usual. * San Francisco, USA at 08:15 PST * New York, USA at 11:15 EST * London, United Kingdom at 16:15 GMT * Frankfurt, Germany at 17:15 CET * Pune, India at 21:45 IST -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: text/calendar Size: 2488 bytes Desc: not available URL: ------------------------------ Message: 2 Date: Wed, 4 Nov 2020 08:14:41 +0100 (CET) From: Andi Christiansen To: gpfsug main discussion list , Christian Vieser Subject: Re: [gpfsug-discuss] Alternative to Scale S3 API. Message-ID: <1512108314.679947.1604474081488 at privateemail.com> Content-Type: text/plain; charset="utf-8" Hi Christian, Thanks for the information! My question also triggered IBM to tell me the same so i think we will stay on S3 with Scale and hoping the same with the new release.. Yes, MinIO is really lacking some good documentation.. but definatly a cool software package that i will keep an eye on in the future... Best Regards Andi Christiansen > On 11/02/2020 2:44 PM Christian Vieser wrote: > > > > Hi Andi, > > we suffer from the same issue. IBM support told me that Spectrum Scale 5.1 will come with a new release of the underlying Openstack components, so we still hope that some/most of limitations will vanish then. But I already know, that the new S3 policies won't be available, only the "legacy" S3 ACLs. > > We also tried MinIO but deemed that it's not "production ready". It's fine for quickly setting up a S3 service for development, but they release too often and with breaking changes, and documentation is lacking all aspects regarding maintenance. > > Regards, > > Christian > > Am 27.10.20 um 12:46 schrieb Andi Christiansen: > > > > Hi all, > > > > We have over a longer period used the S3 API within spectrum Scale.. And that has shown that it does not support very many applications because of limitations of the API.. > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 106, Issue 3 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From herrmann at sprintmail.com Sat Nov 7 21:10:36 2020 From: herrmann at sprintmail.com (Ron H) Date: Sat, 7 Nov 2020 16:10:36 -0500 Subject: [gpfsug-discuss] Use cases for file audit logging and clusteredwatch folder In-Reply-To: References: Message-ID: <8F771847BDEB4447919D30A16FE48FAB@rone8PC> Hi Jacob, Can you point me to a good overview of each of these features? I know File Audit and Watch is part of the DME Scale edition license, but I can?t seem to find a good explanation of what these features can offer. Thanks Ron From: Jacob M Tick Sent: Monday, November 02, 2020 7:21 PM To: gpfsug-discuss at spectrumscale.org Cc: April Brown Subject: [gpfsug-discuss] Use cases for file audit logging and clusteredwatch folder Hi All, I am reaching out on behalf of the Spectrum Scale development team to get some insight on how our customers are using the file audit logging and the clustered watch folder features. If you have it enabled in your test or production environment, could you please elaborate on how and why you are using the feature? Also, knowing how you have the function configured (ie: watching or auditing for certain events, only enabling on certain filesets, ect..) would help us out. Please respond back to April, John (both on CC), and I with any info you are willing to provide. Thanks in advance! Regards, Jake Tick Manager Spectrum Scale - Scalable Data Interfaces IBM Systems Group Email:jmtick at us.ibm.com IBM -------------------------------------------------------------------------------- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jmtick at us.ibm.com Mon Nov 9 17:31:00 2020 From: jmtick at us.ibm.com (Jacob M Tick) Date: Mon, 9 Nov 2020 17:31:00 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Use_cases_for_file_audit_logging_and?= =?utf-8?q?=09clusteredwatch_folder?= In-Reply-To: <8F771847BDEB4447919D30A16FE48FAB@rone8PC> References: <8F771847BDEB4447919D30A16FE48FAB@rone8PC>, Message-ID: An HTML attachment was scrubbed... URL: From Kamil.Czauz at Squarepoint-Capital.com Wed Nov 11 22:29:31 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Wed, 11 Nov 2020 22:29:31 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Message-ID: We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Nov 12 01:56:46 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 12 Nov 2020 02:56:46 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage_of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From luis.bolinches at fi.ibm.com Thu Nov 12 13:19:05 2020 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Thu, 12 Nov 2020 13:19:05 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage_of=09mmfsd_process?= In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From jyyum at kr.ibm.com Thu Nov 12 14:10:17 2020 From: jyyum at kr.ibm.com (Jae Yoon Yum) Date: Thu, 12 Nov 2020 14:10:17 +0000 Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163322.png Type: image/png Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163323.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: From Eric.Wendel at ibm.com Thu Nov 12 15:43:46 2020 From: Eric.Wendel at ibm.com (Eric Wendel - Eric.Wendel@ibm.com) Date: Thu, 12 Nov 2020 15:43:46 +0000 Subject: [gpfsug-discuss] Problems reading emails to the mailing list Message-ID: <31233620a4324240885aed7ad18a729a@ibm.com> Hi Folks, As you are no doubt aware, Lotus Notes and its ecosystem is virtually extinct. For those of us who have moved on to more modern email clients (including an increasing number of IBMERs like me), the email links we receive from SSUG (for example) 'OF0433B7F4.580A7B75-ON0025861E.004DD432-0025861E.004DD8A4 at notes.na.collabserv.com are useless because they can only be read if you have the Notes client installed. This is especially problematic for Linux users as the Linux client for Notes is discontinued. It would be very helpful if the SSUG could move to a modern email platform. Thanks, Eric Wendel eric.wendel at ibm.com -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of gpfsug-discuss-request at spectrumscale.org Sent: Thursday, November 12, 2020 8:10 AM To: gpfsug-discuss at spectrumscale.org Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 106, Issue 8 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Poor client performance with high cpu usage of mmfsd process (Luis Bolinches) 2. Question about the Clearing Spectrum Scale GUI event (Jae Yoon Yum) ---------------------------------------------------------------------- Message: 1 Date: Thu, 12 Nov 2020 13:19:05 +0000 From: "Luis Bolinches" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Thu, 12 Nov 2020 14:10:17 +0000 From: "Jae Yoon Yum" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Message-ID: Content-Type: text/plain; charset="us-ascii" An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163322.png Type: image/png Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.14713274163323.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 106, Issue 8 ********************************************** From stefan.roth at de.ibm.com Thu Nov 12 17:13:38 2020 From: stefan.roth at de.ibm.com (Stefan Roth) Date: Thu, 12 Nov 2020 18:13:38 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Question_about_the_Clearing_Spectrum_S?= =?utf-8?q?cale_GUI=09event?= In-Reply-To: References: Message-ID: Hello Jay, as long as those errors are still shown by "mmhealth node show" CLI command, they will again appear in the GUI. In the GUI events table you can show an "Event Type" column which is hidden by default. Events that have event type "Notice" can be cleared by the "Mark as Read" action. Events that have event type "State" can not be cleared by the "Mark as Read" action. They have to disappear by solving the problem. If a problem is solved the error should disappear from "mmhealth node show" and after that it will disappear from the GUI as well. Mit freundlichen Gr??en / Kind regards Stefan Roth Spectrum Scale Developement Phone: +49 162 4159934 IBM Deutschland Research & Development GmbH Email: stefan.roth at de.ibm.com Am Weiher 24 65451 Kelsterbach IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Jae Yoon Yum" To: gpfsug-discuss at spectrumscale.org Date: 12.11.2020 15:10 Subject: [EXTERNAL] [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? events on the Scale GUI. As you know, there is ?mark as read? for ?warning? messages but there isn?t for ?ERROR?. (In fact, the ?mark as read? button is exist but it does not work.) So I sent him to run this command on cli. /usr/lpp/mmfs/gui/cli/lshealth --reset On my test VM, all of the error messages has been cleared when I run the command?. But, for the client?s system, client said that ?All of the error / warning messages had been appeared again include the one which I had delete by clicking ?mark as read?.? Does anyone who has similar experience like this? and How Could I solve this problem? Or, Is there any way to clear the event one by one? * I sent the same message to the Slack 'scale-help' channel. Thanks. Jay. Best Regards, JaeYoon(Jay) IBM Korea, Three IFC, Yum 10 Gukjegeumyung-ro, Yeongdeungpo-gu, IBM Systems Seoul, Korea Hardware, Storage Technical Sales Mobile : +82-10-4995-4814 07326 e-mail: jyyum at kr.ibm.com ? ??? ??? ??, ??? ??, ???? ???? ???? ?????. ??? ??? ??? ??? ????, ????? ??? ?? ?????,??IBM ??? ? ?? ??? ????, ????? ???? ????. (If you don't wish to receive e-mail from sender, please send e-mail directly. For IBM e-mail, please click here). ??? ???? ??,??, ?? ??? ???????(??: 02-3781-7800,? ?? mktg at kr.ibm.com )? ?? ?? ???? ?? ???? ? ????. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E506389.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E764757.gif Type: image/gif Size: 262 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1E982001.jpg Type: image/jpeg Size: 2457 bytes Desc: not available URL: From arc at b4restore.com Thu Nov 12 17:33:01 2020 From: arc at b4restore.com (=?utf-8?B?QW5kaSBOw7hyIENocmlzdGlhbnNlbg==?=) Date: Thu, 12 Nov 2020 17:33:01 +0000 Subject: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event In-Reply-To: References: Message-ID: Hi Jay, First of you need to make sure your system is actually healthy. Events that are not fixed will reappear. I have had a lot of ?stale? entries happening over the last years and more often than not ?/usr/lpp/mmfs/gui/cli/lshealth ?reset? clears the entries if they are not actual faults.. As Stefan says if the errors/warnings are shown in ?mmhealth node show or mmhealth cluster show? they will reappear as they should. (I have sometimes seen stale entries there aswell) When I have encountered stale entries which wasn?t cleared with ?lshealth ?reset? I could clear them with ?mmsysmoncontrol restart?. I think I actually run that command maybe once or twice every month because of stale entries in the GUI og mmhealth itself.. don?t know why they happen but they seem to appear more frequently for me atleast.. I have high hopes for the 5.1.0.0/5.1.0.1 release as I have heard there should be some new things for the GUI as well.. not sure what they are yet though 😊 Hope this helps. Cheers A. Christiansen Fra: gpfsug-discuss-bounces at spectrumscale.org P? vegne af Stefan Roth Sendt: Thursday, November 12, 2020 6:14 PM Til: gpfsug main discussion list Emne: Re: [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Hello Jay, as long as those errors are still shown by "mmhealth node show" CLI command, they will again appear in the GUI. In the GUI events table you can show an "Event Type" column which is hidden by default. Events that have event type "Notice" can be cleared by the "Mark as Read" action. Events that have event type "State" can not be cleared by the "Mark as Read" action. They have to disappear by solving the problem. If a problem is solved the error should disappear from "mmhealth node show" and after that it will disappear from the GUI as well. Mit freundlichen Gr??en / Kind regards Stefan Roth Spectrum Scale Developement ________________________________ Phone: +49 162 4159934 IBM Deutschland Research & Development GmbH [cid:image002.gif at 01D6B922.3FE99E70] Email: stefan.roth at de.ibm.com Am Weiher 24 65451 Kelsterbach ________________________________ IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 [cid:image003.gif at 01D6B922.3FE99E70]"Jae Yoon Yum" ---12.11.2020 15:10:35---Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? ev From: "Jae Yoon Yum" > To: gpfsug-discuss at spectrumscale.org Date: 12.11.2020 15:10 Subject: [EXTERNAL] [gpfsug-discuss] Question about the Clearing Spectrum Scale GUI event Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Team, I hope you all stay safe from COVID 19, One of my client wants to clear their ?ERROR? events on the Scale GUI. As you know, there is ?mark as read? for ?warning? messages but there isn?t for ?ERROR?. (In fact, the ?mark as read? button is exist but it does not work.) So I sent him to run this command on cli. /usr/lpp/mmfs/gui/cli/lshealth --reset On my test VM, all of the error messages has been cleared when I run the command?. But, for the client?s system, client said that ?All of the error / warning messages had been appeared again include the one which I had delete by clicking ?mark as read?.? Does anyone who has similar experience like this? and How Could I solve this problem? Or, Is there any way to clear the event one by one? * I sent the same message to the Slack 'scale-help' channel. Thanks. Jay. Best Regards, JaeYoon(Jay) Yum IBM Korea, Three IFC, [cid:image005.jpg at 01D6B922.3FE99E70] 10 Gukjegeumyung-ro, Yeongdeungpo-gu, IBM Systems Hardware, Storage Technical Sales Seoul, Korea Mobile : +82-10-4995-4814 07326 e-mail: jyyum at kr.ibm.com ? ??? ??? ??, ??? ??, ???? ???? ???? ?????. ??? ??? ??? ??? ????, ????? ??? ?? ?????,??IBM ??? ??? ??? ????, ????? ???? ????. (If you don't wish to receive e-mail from sender, please send e-mail directly. For IBM e-mail, please click here). ??? ???? ??,??, ?? ??? ???????(??: 02-3781-7800,??? mktg at kr.ibm.com )? ?? ?? ???? ?? ???? ? ????. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 1851 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 105 bytes Desc: image003.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image005.jpg Type: image/jpeg Size: 2457 bytes Desc: image005.jpg URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image006.png Type: image/png Size: 166 bytes Desc: image006.png URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image007.png Type: image/png Size: 616 bytes Desc: image007.png URL: From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 13 02:33:17 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 13 Nov 2020 02:33:17 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Fri Nov 13 09:21:17 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 13 Nov 2020 10:21:17 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Fri Nov 13 09:37:04 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 13 Nov 2020 10:37:04 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu_?= =?utf-8?q?usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 13 13:31:21 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 13 Nov 2020 13:31:21 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Fri Nov 13 13:38:48 2020 From: stockf at us.ibm.com (Frederick Stock) Date: Fri, 13 Nov 2020 13:38:48 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu?= =?utf-8?q?=09usage=09of=09mmfsd_process?= In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From kkr at lbl.gov Fri Nov 13 21:11:16 2020 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Fri, 13 Nov 2020 13:11:16 -0800 Subject: [gpfsug-discuss] REMINDER - SC20 Sessions - Monday Nov. 16 and Wednesday Nov. 18 Message-ID: <7B85E526-88D4-44AE-B034-4EC5A61E524C@lbl.gov> Hi all, A Reminder to attend and also submit any panel questions for the Wednesday session. So far, there are 3 questions around these topics: 1) excessive prefetch when reading small fractions of many large files 2) improved the integration between TSM and GPFS 3) number of security vulnerabilities in GPFS, the GUI, ESS, or something else related Bring on your tough questions and make it interesting. Cheers, Kristy ?original email--- The Spectrum Scale User Group will be hosting two 90 minute sessions at SC20 this year and we hope you can join us. The first one is: "Storage for AI" and will be held Monday, Nov. 16th, from 11:00-12:30 EST and the second one is "What's new in Spectrum Scale 5.1?" and will be held Wednesday, Nov. 18th from 11:00-12:30 EST. Please see the calendar at https://www.spectrumscaleug.org/eventslist/2020-11/ and register by clicking on a session on the calendar and then the "Please register here to join the session" link. Best, Kristy Kristy Kallback-Rose Senior HPC Storage Systems Analyst National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory From UWEFALKE at de.ibm.com Mon Nov 16 13:45:57 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Mon, 16 Nov 2020 14:45:57 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Poor_client_performance_with_high_cpu?= =?utf-8?q?=09usage=09of=09mmfsd_process?= In-Reply-To: References: Message-ID: Hi, while the other nodes can well block the local one, as Frederick suggests, there should at least be something visible locally waiting for these other nodes. Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms. I am no developer nor do I have the code, so don't expect too much. Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)? Lookups are metadata ops, do you have a separate pool for your metadata? How is that pool set up (doen to the physical block devices)? Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...). So, which waiters did you see (hope you saved them, if not, do it next time). What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)? To check the unreasonably short trace capture time: how large are the trcrpt files you obtain? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 14:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From andi at christiansen.xxx Mon Nov 16 19:44:14 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Mon, 16 Nov 2020 20:44:14 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Message-ID: <1388247256.209171.1605555854969@privateemail.com> Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Mon Nov 16 21:44:30 2020 From: stockf at us.ibm.com (Frederick Stock) Date: Mon, 16 Nov 2020 21:44:30 +0000 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Mon Nov 16 21:58:19 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Mon, 16 Nov 2020 13:58:19 -0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <20201116215819.wda6nophekamzs3v@thargelion> When we did a similar (though larger, at ~2.5PB) migration, we used rsync as well, but ran one rsync process per Isilon node, and made sure the NFS clients were hitting separate Isilon nodes for their reads. We also didn't have more than one rsync process running per client, as the Linux NFS client (at least in CentOS 6) was terrible when it came to concurrent access. Whatever method you end up using, I can guarantee you will be much happier once you are on GPFS. :) On Mon, Nov 16, 2020 at 08:44:14PM +0100, Andi Christiansen wrote: > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... > > as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From jonathan.buzzard at strath.ac.uk Mon Nov 16 22:58:49 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 16 Nov 2020 22:58:49 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <4de1fa02-a074-0901-cf12-31be9e843f5f@strath.ac.uk> On 16/11/2020 19:44, Andi Christiansen wrote: > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS > from two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching > from 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit > from scale nodes and 20Gbits from isilon so we should be able to reach > just under 20Gbit... > > > if anyone have any ideas they are welcome! > My biggest recommendation when doing this is to use a sqlite database to keep track of what is going on. The main issue is that you are almost certainly going to need to do more than one rsync pass unless your source Isilon system has no user activity, and with 700TB to move that seems unlikely. Typically you do an initial rsync to move the bulk of the data while the users are still live, then shutdown user access to the source system and do the final rsync which hopefully has a significantly smaller amount of data to actually move. So this is what I have done on a number of occasions now. I create a very simple sqlite DB with a list of source and destination folders and a status code. Initially the status code is set to -1. Then I have a perl script which looks at the sqlite DB, picks a row with a status code of -1, and sets the status code to -2, aka that directory is in progress. It then proceeds to run the rsync and when it finishes it updates the status code to the exit code of the rsync process. As long as all the rsync processes have access to the same copy of the sqlite DB (simplest to put it on either the source or destination file system) then all is good. You can fire off multiple rsync's on multiple nodes and they will all keep churning away till there is no more work to be done. The advantage is you can easily interrogate the DB to find out the state of play. That is how many of your transfers have completed, how many are yet to be done, which ones are currently being transferred etc. without logging onto multiple nodes. *MOST* importantly you can see if any of the rsync's had an error, by simply looking for status codes greater than zero. I cannot stress how important this is. Noting that if the source is still active you will see errors down to files being deleted on the source file system before rsync has a chance to copy them. However this has a specific exit code (24) so is easy to spot and not worry about. Finally it is also very simple to set the status codes to -1 again and set the process away again. So the final run is easier to do. If you want to mail me off list I can dig out a copy of the perl code I used if your interested. There are several version as I have tended to tailor to each transfer. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Mon Nov 16 23:12:47 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 16 Nov 2020 23:12:47 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <20201116215819.wda6nophekamzs3v@thargelion> References: <1388247256.209171.1605555854969@privateemail.com> <20201116215819.wda6nophekamzs3v@thargelion> Message-ID: <8d4d2987-77dd-e3e1-1c98-a635f1b96ddd@strath.ac.uk> On 16/11/2020 21:58, Skylar Thompson wrote: > When we did a similar (though larger, at ~2.5PB) migration, we used rsync > as well, but ran one rsync process per Isilon node, and made sure the NFS > clients were hitting separate Isilon nodes for their reads. We also didn't > have more than one rsync process running per client, as the Linux NFS > client (at least in CentOS 6) was terrible when it came to concurrent access. > The million dollar question IMHO is the number of files and their sizes. Basically if you have a million 1KB files to move it is going to take much longer than a 100 1GB files. That is the overhead of dealing with each file is a real bitch and kills your attainable transfer speed stone dead. One option I have used in the past is to use your last backup and restore to the new system, then rsync in the changes. That way you don't impact the source file system which is live. Another option I have used is to inform users in advance that data will be transferred based on a metric of how many files and how much data they have. So the less data and fewer files the quicker you will get access to the new system once access to the old system is turned off. It is amazing how much users clear up junk under this scenario. Last time I did this a single user went from over 17 million files to 11 thousand! In total many many TB of data just vanished from the system (around half of the data when puff) as users actually got around to some house keeping LOL. Moving less data and files is always less painful. > Whatever method you end up using, I can guarantee you will be much happier > once you are on GPFS. :) > Goes without saying :-) JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From UWEFALKE at de.ibm.com Tue Nov 17 08:50:56 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 17 Nov 2020 09:50:56 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: <1388247256.209171.1605555854969@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Andi, what about leaving NFS completeley out and using rsync (multiple rsyncs in parallel, of course) directly between your source and target servers? I am not sure how many TCP connections (suppose it is NFS4) in parallel are opened between client and server, using a 2x bonded interface well requires at least two. That combined with the DB approach suggested by Jonathan to control the activity of the rsync streams would be my best guess. If you have many small files, the overhead might still kill you. Tarring them up into larger aggregates for transfer would help a lot, but then you must be sure they won't change or you need to implement your own version control for that class of files. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Andi Christiansen To: "gpfsug-discuss at spectrumscale.org" Date: 16/11/2020 20:44 Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Tue Nov 17 08:57:07 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 17 Nov 2020 09:57:07 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but it is over two nodes, so no bonding. But still, I'd expect to open several TCP connections in parallel per source-target pair (like with several rsyncs per source node) would bear an advantage (and still I thing NFS doesn't do that, but I can be wrong). If more nodes have access to the Isilon data they could also participate (and don't need NFS exports for that). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 17/11/2020 09:50 Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Hi Andi, what about leaving NFS completeley out and using rsync (multiple rsyncs in parallel, of course) directly between your source and target servers? I am not sure how many TCP connections (suppose it is NFS4) in parallel are opened between client and server, using a 2x bonded interface well requires at least two. That combined with the DB approach suggested by Jonathan to control the activity of the rsync streams would be my best guess. If you have many small files, the overhead might still kill you. Tarring them up into larger aggregates for transfer would help a lot, but then you must be sure they won't change or you need to implement your own version control for that class of files. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Andi Christiansen To: "gpfsug-discuss at spectrumscale.org" Date: 16/11/2020 20:44 Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From andi at christiansen.xxx Tue Nov 17 11:51:58 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 12:51:58 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: <616234716.258600.1605613918767@privateemail.com> Hi all, thanks for all the information, there was some interesting things amount it.. I kept on going with rsync and ended up making a file with all top level user directories and splitting them into chunks of 347 per rsync session(total 42000 ish folders). yesterday we had only 14 sessions with 3000 folders in each and that was too much work for one rsync session.. i divided them out among all GPFS nodes to have them fetch an area each and actually doing that 3 times on each node and that has now boosted the bandwidth usage from 3Gbit to around 16Gbit in total.. all nodes have been seing doing work above 7Gbit individual which is actually near to what i was expecting without any modifications to the NFS server or TCP tuning.. CPU is around 30-50% on each server and mostly below or around 30% so it seems like it could have handled abit more sessions.. Small files are really a killer but with all 96+ sessions we have now its not often all sessions are handling small files at the same time so we have an average of about 10-12Gbit bandwidth usage. Thanks all! ill keep you in mind if for some reason we see it slowing down again but for now i think we will try to see if it will go the last mile with a bit more sessions on each :) Best Regards Andi Christiansen > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but > it is over two nodes, so no bonding. But still, I'd expect to open several > TCP connections in parallel per source-target pair (like with several > rsyncs per source node) would bear an advantage (and still I thing NFS > doesn't do that, but I can be wrong). > If more nodes have access to the Isilon data they could also participate > (and don't need NFS exports for that). > > Mit freundlichen Gr??en / Kind regards > > Dr. Uwe Falke > IT Specialist > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > Services > +49 175 575 2877 Mobile > Rathausstr. 7, 09111 Chemnitz, Germany > uwefalke at de.ibm.com > > IBM Services > > IBM Data Privacy Statement > > IBM Deutschland Business & Technology Services GmbH > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > Sitz der Gesellschaft: Ehningen > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > From: Uwe Falke/Germany/IBM > To: gpfsug main discussion list > Date: 17/11/2020 09:50 > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > from Isilon to Scale over NFS? > > > Hi Andi, > > what about leaving NFS completeley out and using rsync (multiple rsyncs > in parallel, of course) directly between your source and target servers? > I am not sure how many TCP connections (suppose it is NFS4) in parallel > are opened between client and server, using a 2x bonded interface well > requires at least two. That combined with the DB approach suggested by > Jonathan to control the activity of the rsync streams would be my best > guess. > If you have many small files, the overhead might still kill you. Tarring > them up into larger aggregates for transfer would help a lot, but then you > must be sure they won't change or you need to implement your own version > control for that class of files. > > Mit freundlichen Gr??en / Kind regards > > Dr. Uwe Falke > IT Specialist > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > Services > +49 175 575 2877 Mobile > Rathausstr. 7, 09111 Chemnitz, Germany > uwefalke at de.ibm.com > > IBM Services > > IBM Data Privacy Statement > > IBM Deutschland Business & Technology Services GmbH > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > Sitz der Gesellschaft: Ehningen > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > From: Andi Christiansen > To: "gpfsug-discuss at spectrumscale.org" > > Date: 16/11/2020 20:44 > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > Isilon to Scale over NFS? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS from > two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching from > 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > scale nodes and 20Gbits from isilon so we should be able to reach just > under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Tue Nov 17 12:07:30 2020 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 17 Nov 2020 13:07:30 +0100 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <616234716.258600.1605613918767@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: Nice to see it working well! But, what about ACLs? Does you rsync pull in all needed metadata, or do you also need to sync ACLs ? Any plans for how to solve that ? On Tue, Nov 17, 2020 at 12:52 PM Andi Christiansen wrote: > Hi all, > > thanks for all the information, there was some interesting things amount > it.. > > I kept on going with rsync and ended up making a file with all top level > user directories and splitting them into chunks of 347 per rsync > session(total 42000 ish folders). yesterday we had only 14 sessions with > 3000 folders in each and that was too much work for one rsync session.. > > i divided them out among all GPFS nodes to have them fetch an area each > and actually doing that 3 times on each node and that has now boosted the > bandwidth usage from 3Gbit to around 16Gbit in total.. > > all nodes have been seing doing work above 7Gbit individual which is > actually near to what i was expecting without any modifications to the NFS > server or TCP tuning.. > > CPU is around 30-50% on each server and mostly below or around 30% so it > seems like it could have handled abit more sessions.. > > Small files are really a killer but with all 96+ sessions we have now its > not often all sessions are handling small files at the same time so we have > an average of about 10-12Gbit bandwidth usage. > > Thanks all! ill keep you in mind if for some reason we see it slowing down > again but for now i think we will try to see if it will go the last mile > with a bit more sessions on each :) > > Best Regards > Andi Christiansen > > > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, > but > > it is over two nodes, so no bonding. But still, I'd expect to open > several > > TCP connections in parallel per source-target pair (like with several > > rsyncs per source node) would bear an advantage (and still I thing NFS > > doesn't do that, but I can be wrong). > > If more nodes have access to the Isilon data they could also participate > > (and don't need NFS exports for that). > > > > Mit freundlichen Gr??en / Kind regards > > > > Dr. Uwe Falke > > IT Specialist > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > Services > > +49 175 575 2877 Mobile > > Rathausstr. 7, 09111 Chemnitz, Germany > > uwefalke at de.ibm.com > > > > IBM Services > > > > IBM Data Privacy Statement > > > > IBM Deutschland Business & Technology Services GmbH > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > Sitz der Gesellschaft: Ehningen > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > From: Uwe Falke/Germany/IBM > > To: gpfsug main discussion list > > Date: 17/11/2020 09:50 > > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > > from Isilon to Scale over NFS? > > > > > > Hi Andi, > > > > what about leaving NFS completeley out and using rsync (multiple rsyncs > > in parallel, of course) directly between your source and target servers? > > I am not sure how many TCP connections (suppose it is NFS4) in parallel > > are opened between client and server, using a 2x bonded interface well > > requires at least two. That combined with the DB approach suggested by > > Jonathan to control the activity of the rsync streams would be my best > > guess. > > If you have many small files, the overhead might still kill you. Tarring > > them up into larger aggregates for transfer would help a lot, but then > you > > must be sure they won't change or you need to implement your own version > > control for that class of files. > > > > Mit freundlichen Gr??en / Kind regards > > > > Dr. Uwe Falke > > IT Specialist > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > Services > > +49 175 575 2877 Mobile > > Rathausstr. 7, 09111 Chemnitz, Germany > > uwefalke at de.ibm.com > > > > IBM Services > > > > IBM Data Privacy Statement > > > > IBM Deutschland Business & Technology Services GmbH > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > Sitz der Gesellschaft: Ehningen > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > From: Andi Christiansen > > To: "gpfsug-discuss at spectrumscale.org" > > > > Date: 16/11/2020 20:44 > > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > > Isilon to Scale over NFS? > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > Hi all, > > > > i have got a case where a customer wants 700TB migrated from isilon to > > Scale and the only way for him is exporting the same directory on NFS > from > > two different nodes... > > > > as of now we are using multiple rsync processes on different parts of > > folders within the main directory. this is really slow and will take > > forever.. right now 14 rsync processes spread across 3 nodes fetching > from > > 2.. > > > > does anyone know of a way to speed it up? right now we see from 1Gbit to > > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit > from > > scale nodes and 20Gbits from isilon so we should be able to reach just > > under 20Gbit... > > > > > > if anyone have any ideas they are welcome! > > > > > > Thanks in advance > > Andi Christiansen _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From andi at christiansen.xxx Tue Nov 17 12:24:22 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 13:24:22 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: <1023406427.259407.1605615862969@privateemail.com> Hi Jan, We are syncing ACLs, groups, owners and timestamps aswell :) /Andi Christiansen > On 11/17/2020 1:07 PM Jan-Frode Myklebust wrote: > > > Nice to see it working well! > > But, what about ACLs? Does you rsync pull in all needed metadata, or do you also need to sync ACLs ? Any plans for how to solve that ? > > On Tue, Nov 17, 2020 at 12:52 PM Andi Christiansen wrote: > > > > Hi all, > > > > thanks for all the information, there was some interesting things amount it.. > > > > I kept on going with rsync and ended up making a file with all top level user directories and splitting them into chunks of 347 per rsync session(total 42000 ish folders). yesterday we had only 14 sessions with 3000 folders in each and that was too much work for one rsync session.. > > > > i divided them out among all GPFS nodes to have them fetch an area each and actually doing that 3 times on each node and that has now boosted the bandwidth usage from 3Gbit to around 16Gbit in total.. > > > > all nodes have been seing doing work above 7Gbit individual which is actually near to what i was expecting without any modifications to the NFS server or TCP tuning.. > > > > CPU is around 30-50% on each server and mostly below or around 30% so it seems like it could have handled abit more sessions.. > > > > Small files are really a killer but with all 96+ sessions we have now its not often all sessions are handling small files at the same time so we have an average of about 10-12Gbit bandwidth usage. > > > > Thanks all! ill keep you in mind if for some reason we see it slowing down again but for now i think we will try to see if it will go the last mile with a bit more sessions on each :) > > > > Best Regards > > Andi Christiansen > > > > > On 11/17/2020 9:57 AM Uwe Falke wrote: > > > > > > > > > Hi, Andi, sorry I just took your 20Gbit for the sign of 2x10Gbps bons, but > > > it is over two nodes, so no bonding. But still, I'd expect to open several > > > TCP connections in parallel per source-target pair (like with several > > > rsyncs per source node) would bear an advantage (and still I thing NFS > > > doesn't do that, but I can be wrong). > > > If more nodes have access to the Isilon data they could also participate > > > (and don't need NFS exports for that). > > > > > > Mit freundlichen Gr??en / Kind regards > > > > > > Dr. Uwe Falke > > > IT Specialist > > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > > Services > > > +49 175 575 2877 Mobile > > > Rathausstr. 7, 09111 Chemnitz, Germany > > > uwefalke at de.ibm.com mailto:uwefalke at de.ibm.com > > > > > > IBM Services > > > > > > IBM Data Privacy Statement > > > > > > IBM Deutschland Business & Technology Services GmbH > > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > > Sitz der Gesellschaft: Ehningen > > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > > > From: Uwe Falke/Germany/IBM > > > To: gpfsug main discussion list > > > Date: 17/11/2020 09:50 > > > Subject: Re: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data > > > from Isilon to Scale over NFS? > > > > > > > > > Hi Andi, > > > > > > what about leaving NFS completeley out and using rsync (multiple rsyncs > > > in parallel, of course) directly between your source and target servers? > > > I am not sure how many TCP connections (suppose it is NFS4) in parallel > > > are opened between client and server, using a 2x bonded interface well > > > requires at least two. That combined with the DB approach suggested by > > > Jonathan to control the activity of the rsync streams would be my best > > > guess. > > > If you have many small files, the overhead might still kill you. Tarring > > > them up into larger aggregates for transfer would help a lot, but then you > > > must be sure they won't change or you need to implement your own version > > > control for that class of files. > > > > > > Mit freundlichen Gr??en / Kind regards > > > > > > Dr. Uwe Falke > > > IT Specialist > > > Hybrid Cloud Infrastructure / Technology Consulting & Implementation > > > Services > > > +49 175 575 2877 Mobile > > > Rathausstr. 7, 09111 Chemnitz, Germany > > > uwefalke at de.ibm.com mailto:uwefalke at de.ibm.com > > > > > > IBM Services > > > > > > IBM Data Privacy Statement > > > > > > IBM Deutschland Business & Technology Services GmbH > > > Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl > > > Sitz der Gesellschaft: Ehningen > > > Registergericht: Amtsgericht Stuttgart, HRB 17122 > > > > > > > > > > > > > > > From: Andi Christiansen > > > To: "gpfsug-discuss at spectrumscale.org mailto:gpfsug-discuss at spectrumscale.org " > > > > > > Date: 16/11/2020 20:44 > > > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from > > > Isilon to Scale over NFS? > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org mailto:gpfsug-discuss-bounces at spectrumscale.org > > > > > > > > > > > > Hi all, > > > > > > i have got a case where a customer wants 700TB migrated from isilon to > > > Scale and the only way for him is exporting the same directory on NFS from > > > two different nodes... > > > > > > as of now we are using multiple rsync processes on different parts of > > > folders within the main directory. this is really slow and will take > > > forever.. right now 14 rsync processes spread across 3 nodes fetching from > > > 2.. > > > > > > does anyone know of a way to speed it up? right now we see from 1Gbit to > > > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > > > scale nodes and 20Gbits from isilon so we should be able to reach just > > > under 20Gbit... > > > > > > > > > if anyone have any ideas they are welcome! > > > > > > > > > Thanks in advance > > > Andi Christiansen _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss athttp://spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss athttp://spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss athttp://spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Nov 17 13:53:43 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 17 Nov 2020 13:53:43 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <616234716.258600.1605613918767@privateemail.com> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> Message-ID: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> On 17/11/2020 11:51, Andi Christiansen wrote: > Hi all, > > thanks for all the information, there was some interesting things > amount it.. > > I kept on going with rsync and ended up making a file with all top > level user directories and splitting them into chunks of 347 per > rsync session(total 42000 ish folders). yesterday we had only 14 > sessions with 3000 folders in each and that was too much work for one > rsync session.. Unless you use something similar to my DB suggestion it is almost inevitable that some of those rsync sessions are going to have issues and you will have no way to track it or even know it has happened unless you do a single final giant catchup/check rsync. I should add that a copy of the sqlite DB is cover your backside protection when a user pops up claiming that you failed to transfer one of their vitally important files six months down the line and the old system is turned off and scrapped. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From skylar2 at uw.edu Tue Nov 17 14:59:43 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Tue, 17 Nov 2020 06:59:43 -0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> Message-ID: <20201117145943.5cxyfpfyrk7udmn4@thargelion> On Tue, Nov 17, 2020 at 01:53:43PM +0000, Jonathan Buzzard wrote: > On 17/11/2020 11:51, Andi Christiansen wrote: > > Hi all, > > > > thanks for all the information, there was some interesting things > > amount it.. > > > > I kept on going with rsync and ended up making a file with all top > > level user directories and splitting them into chunks of 347 per > > rsync session(total 42000 ish folders). yesterday we had only 14 > > sessions with 3000 folders in each and that was too much work for one > > rsync session.. > > Unless you use something similar to my DB suggestion it is almost inevitable > that some of those rsync sessions are going to have issues and you will have > no way to track it or even know it has happened unless you do a single final > giant catchup/check rsync. > > I should add that a copy of the sqlite DB is cover your backside protection > when a user pops up claiming that you failed to transfer one of their > vitally important files six months down the line and the old system is > turned off and scrapped. That's not a bad idea, and I like it more than the method I setup where we captured the output of find from both sides of the transfer and preserved it for posterity, but obviously did require a hard-stop date on the source. Fortunately, we seem committed to GPFS so it might be we never have to do another bulk transfer outside of the filesystem... -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From S.J.Thompson at bham.ac.uk Tue Nov 17 15:55:41 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 17 Nov 2020 15:55:41 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <20201117145943.5cxyfpfyrk7udmn4@thargelion> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> <20201117145943.5cxyfpfyrk7udmn4@thargelion> Message-ID: <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> > Fortunately, we seem committed to GPFS so it might be we never have to do > another bulk transfer outside of the filesystem... Until you want to move a v3 or v4 created file-system to v5 block sizes __ I hopes we won't be doing that sort of thing again... Simon From jonathan.buzzard at strath.ac.uk Tue Nov 17 19:45:29 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 17 Nov 2020 19:45:29 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> <20201117145943.5cxyfpfyrk7udmn4@thargelion> <55E3401C-2F59-4B47-A176-CDF7BCACBE2E@bham.ac.uk> Message-ID: <1a1be12b-a4f2-f2b3-4cdf-e34bc5eace24@strath.ac.uk> On 17/11/2020 15:55, Simon Thompson wrote: > >> Fortunately, we seem committed to GPFS so it might be we never have to do >> another bulk transfer outside of the filesystem... > > Until you want to move a v3 or v4 created file-system to v5 block sizes __ You forget the v2 to v3 for more than two billion files switch. Either that or you where not using it back then. Then there was the v3.2 if you ever want to mount it on Windows. > > I hopes we won't be doing that sort of thing again... > Yep, going to be recycling my scripts in the coming week for a v4 to v5 with capacity upgrade on our DSS-G. That basically involves a trashing of the file system and a restore from backup. Going to be doing the your data will be restored based on a metric of how many files and how much data you have ploy again :-) I too hope that will be the last time I have to do anything similar but my experience of the last couple of decades says that is likely to be a forlorn hope :-( I speculate that one day the 10,000 file set limit will be lifted, but only if you reformat your file system... JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From andi at christiansen.xxx Tue Nov 17 20:40:39 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Tue, 17 Nov 2020 21:40:39 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> References: <1388247256.209171.1605555854969@privateemail.com> <616234716.258600.1605613918767@privateemail.com> <12435be3-3eb4-00b9-5939-e83eb1649168@strath.ac.uk> Message-ID: <82434297.276248.1605645639435@privateemail.com> Hi Jonathan, yes you are correct! but we plan to resync this once or twice every week for the next 3-4months to be sure everything is as it should be. Right now we are focused on getting them synced up and then we will run scheduled resyncs/checks once or twice a week depending on the data growth :) Thanks Andi Christiansen > On 11/17/2020 2:53 PM Jonathan Buzzard wrote: > > > On 17/11/2020 11:51, Andi Christiansen wrote: > > Hi all, > > > > thanks for all the information, there was some interesting things > > amount it.. > > > > I kept on going with rsync and ended up making a file with all top > > level user directories and splitting them into chunks of 347 per > > rsync session(total 42000 ish folders). yesterday we had only 14 > > sessions with 3000 folders in each and that was too much work for one > > rsync session.. > > Unless you use something similar to my DB suggestion it is almost > inevitable that some of those rsync sessions are going to have issues > and you will have no way to track it or even know it has happened unless > you do a single final giant catchup/check rsync. > > I should add that a copy of the sqlite DB is cover your backside > protection when a user pops up claiming that you failed to transfer one > of their vitally important files six months down the line and the old > system is turned off and scrapped. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From chris.schlipalius at pawsey.org.au Tue Nov 17 23:17:18 2020 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Wed, 18 Nov 2020 07:17:18 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Message-ID: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> So at my last job we used to rsync data between isilons across campus, and isilon to Windows File Cluster (and back). I recommend using dry run to generate a list of files and then use this to run with rysnc. This allows you also to be able to break up the transfer into batches, and check if files have changed before sync (say if your isilon files are not RO. Also ensure you have a recent version of rsync that preserves extended attributes and check your ACLS. A dry run example: https://unix.stackexchange.com/a/261372 I always felt more comfortable having a list of files before a sync?. Regards, Chris Schlipalius Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey Supercomputing Centre (CSIRO) 1 Bryce Avenue Kensington WA 6151 Australia Tel +61 8 6436 8815 Email chris.schlipalius at pawsey.org.au Web www.pawsey.org.au -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Nov 18 11:48:52 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 18 Nov 2020 11:48:52 +0000 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> Message-ID: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> On 17/11/2020 23:17, Chris Schlipalius wrote: > So at my last job we used to rsync data between isilons across campus, > and isilon to Windows File Cluster (and back). > > I recommend using dry run to generate a list of files and then use this > to run with rysnc. > > This allows you also to be able to break up the transfer into batches, > and check if files have changed before sync (say if your isilon files > are not RO. > > Also ensure you have a recent version of rsync that preserves extended > attributes and check your ACLS. > > A dry run example: > > https://unix.stackexchange.com/a/261372 > > I always felt more comfortable having a list of files before a sync?. > I would counsel in the strongest possible terms against that approach. Basically you have to be assured that none of your file names have "wacky" characters in them, because handling "wacky" characters in file names is exceedingly difficult. I cannot stress how hard it is and the above example does not handle all "wacky" characters in file names. So what do I mean by "wacky" characters. Well remember a file name can have just about anything in it on Linux with the exception of '/', and users especially when using a GUI, and even more so if they are Mac users can and do use what I will call "wacky" characters in their file names. The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. Then there is the use of the wildcard characters, especially '?' but also '*'. Not too difficult to handle you might say. Right now deal with a file name with a newline character in it :-) Don't ask me how or why you even do that but let me assure you that I have seen them on more than one occasion. And now your dry run list is broken... Not only that if you have a few hundred million files to move a list just becomes unwieldy anyway. One thing I didn't mention is that I would run anything with in a screen (or tmux if that is your poison) and turn on logging. For those interested I am in the process of cleaning up the script a bit and will post it somewhere in due course. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From andi at christiansen.xxx Wed Nov 18 11:54:47 2020 From: andi at christiansen.xxx (Andi Christiansen) Date: Wed, 18 Nov 2020 12:54:47 +0100 (CET) Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> Message-ID: <1947408989.293430.1605700487095@privateemail.com> Hi Jonathan, i would be very interested in seeing your scripts when they are posted. Let me know where to get them! Thanks a bunch! Andi Christiansen > On 11/18/2020 12:48 PM Jonathan Buzzard wrote: > > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > So at my last job we used to rsync data between isilons across campus, > > and isilon to Windows File Cluster (and back). > > > > I recommend using dry run to generate a list of files and then use this > > to run with rysnc. > > > > This allows you also to be able to break up the transfer into batches, > > and check if files have changed before sync (say if your isilon files > > are not RO. > > > > Also ensure you have a recent version of rsync that preserves extended > > attributes and check your ACLS. > > > > A dry run example: > > > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > I would counsel in the strongest possible terms against that approach. > > Basically you have to be assured that none of your file names have > "wacky" characters in them, because handling "wacky" characters in file > names is exceedingly difficult. I cannot stress how hard it is and the > above example does not handle all "wacky" characters in file names. > > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and > users especially when using a GUI, and even more so if they are Mac > users can and do use what I will call "wacky" characters in their file > names. > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. > > Not too difficult to handle you might say. Right now deal with a file > name with a newline character in it :-) Don't ask me how or why you even > do that but let me assure you that I have seen them on more than one > occasion. And now your dry run list is broken... > > Not only that if you have a few hundred million files to move a list > just becomes unwieldy anyway. > > One thing I didn't mention is that I would run anything with in a screen > (or tmux if that is your poison) and turn on logging. > > For those interested I am in the process of cleaning up the script a bit > and will post it somewhere in due course. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From cal.sawyer at framestore.com Wed Nov 18 12:18:57 2020 From: cal.sawyer at framestore.com (Cal Sawyer) Date: Wed, 18 Nov 2020 12:18:57 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 106, Issue 21 In-Reply-To: References: Message-ID: Hello Not a Scale user per se (we run a 3rdparty offshoot of Scale). In a past life managing Nexenta with OpenSolaris DR storage, I used nc/netcat for bulk data sync, which is far more efficient than rsync. With a bit of planning and analysis of directory structure on the target, nc runs could be parallelised as well, although not quite in the same way as running rsync via parallels. Of course, nc has to be available on Isilon but i have no experience with that platform. The only caveat in using nc is the amount of change to the target data as copying progresses (is the target datastore static or still seeing changes?). nc has to be followed with rsync to apply any changes and/or verify the integrity of the bulk copy. https://nakkaya.com/2009/04/15/using-netcat-for-file-transfers/ Are your Isilon and Scale systems located in the same network space? I'd also suggest that if possible, add a quad-port 10GbE (or larger: 25/100GbE) NIC to your servers to gain a wider data path and conduct your copy operations on those interfaces regards [image: Framestore] Cal Sawyer ? Senior Systems Engineer London ? New York ? Los Angeles ? Chicago ? Montr?al ? Mumbai 28 Chancery Lane London WC2A 1LB [T] +44 (0)20 7344 8000 W3W: warm.soil.patio On Wed, 18 Nov 2020 at 12:00, wrote: > Send gpfsug-discuss mailing list submissions to > gpfsug-discuss at spectrumscale.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > or, via email, send a message with subject or body 'help' to > gpfsug-discuss-request at spectrumscale.org > > You can reach the person managing the list at > gpfsug-discuss-owner at spectrumscale.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of gpfsug-discuss digest..." > > > Today's Topics: > > 1. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Chris Schlipalius) > 2. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Jonathan Buzzard) > 3. Re: Migrate/syncronize data from Isilon to Scale over NFS? > (Andi Christiansen) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 18 Nov 2020 07:17:18 +0800 > From: Chris Schlipalius > To: > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <578BE691-DEE4-43AC-97D2-546AC406E14A at pawsey.org.au> > Content-Type: text/plain; charset="utf-8" > > So at my last job we used to rsync data between isilons across campus, and > isilon to Windows File Cluster (and back). > > I recommend using dry run to generate a list of files and then use this to > run with rysnc. > > This allows you also to be able to break up the transfer into batches, and > check if files have changed before sync (say if your isilon files are not > RO. > > Also ensure you have a recent version of rsync that preserves extended > attributes and check your ACLS. > > > > A dry run example: > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > > > > > Regards, > > Chris Schlipalius > > > > Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey > Supercomputing Centre (CSIRO) > > 1 Bryce Avenue > > Kensington WA 6151 > > Australia > > > > Tel +61 8 6436 8815 > > Email chris.schlipalius at pawsey.org.au > > Web www.pawsey.org.au > > > > > > -------------- next part -------------- > An HTML attachment was scrubbed... > URL: < > http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20201118/c99c2fb1/attachment-0001.html > > > > ------------------------------ > > Message: 2 > Date: Wed, 18 Nov 2020 11:48:52 +0000 > From: Jonathan Buzzard > To: gpfsug-discuss at spectrumscale.org > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <7983810e-f51c-8cf7-a750-5c3285870bd4 at strath.ac.uk> > Content-Type: text/plain; charset=utf-8; format=flowed > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > So at my last job we used to rsync data between isilons across campus, > > and isilon to Windows File Cluster (and back). > > > > I recommend using dry run to generate a list of files and then use this > > to run with rysnc. > > > > This allows you also to be able to break up the transfer into batches, > > and check if files have changed before sync (say if your isilon files > > are not RO. > > > > Also ensure you have a recent version of rsync that preserves extended > > attributes and check your ACLS. > > > > A dry run example: > > > > https://unix.stackexchange.com/a/261372 > > > > I always felt more comfortable having a list of files before a sync?. > > > > I would counsel in the strongest possible terms against that approach. > > Basically you have to be assured that none of your file names have > "wacky" characters in them, because handling "wacky" characters in file > names is exceedingly difficult. I cannot stress how hard it is and the > above example does not handle all "wacky" characters in file names. > > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and > users especially when using a GUI, and even more so if they are Mac > users can and do use what I will call "wacky" characters in their file > names. > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. > > Not too difficult to handle you might say. Right now deal with a file > name with a newline character in it :-) Don't ask me how or why you even > do that but let me assure you that I have seen them on more than one > occasion. And now your dry run list is broken... > > Not only that if you have a few hundred million files to move a list > just becomes unwieldy anyway. > > One thing I didn't mention is that I would run anything with in a screen > (or tmux if that is your poison) and turn on logging. > > For those interested I am in the process of cleaning up the script a bit > and will post it somewhere in due course. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > ------------------------------ > > Message: 3 > Date: Wed, 18 Nov 2020 12:54:47 +0100 (CET) > From: Andi Christiansen > To: gpfsug main discussion list , > Jonathan Buzzard > Subject: Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to > Scale over NFS? > Message-ID: <1947408989.293430.1605700487095 at privateemail.com> > Content-Type: text/plain; charset=UTF-8 > > Hi Jonathan, > > i would be very interested in seeing your scripts when they are posted. > Let me know where to get them! > > Thanks a bunch! > Andi Christiansen > > > On 11/18/2020 12:48 PM Jonathan Buzzard > wrote: > > > > > > On 17/11/2020 23:17, Chris Schlipalius wrote: > > > So at my last job we used to rsync data between isilons across campus, > > > and isilon to Windows File Cluster (and back). > > > > > > I recommend using dry run to generate a list of files and then use > this > > > to run with rysnc. > > > > > > This allows you also to be able to break up the transfer into batches, > > > and check if files have changed before sync (say if your isilon files > > > are not RO. > > > > > > Also ensure you have a recent version of rsync that preserves extended > > > attributes and check your ACLS. > > > > > > A dry run example: > > > > > > https://unix.stackexchange.com/a/261372 > > > > > > I always felt more comfortable having a list of files before a sync?. > > > > > > > I would counsel in the strongest possible terms against that approach. > > > > Basically you have to be assured that none of your file names have > > "wacky" characters in them, because handling "wacky" characters in file > > names is exceedingly difficult. I cannot stress how hard it is and the > > above example does not handle all "wacky" characters in file names. > > > > So what do I mean by "wacky" characters. Well remember a file name can > > have just about anything in it on Linux with the exception of '/', and > > users especially when using a GUI, and even more so if they are Mac > > users can and do use what I will call "wacky" characters in their file > > names. > > > > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > > Then there is the use of the wildcard characters, especially '?' but > > also '*'. > > > > Not too difficult to handle you might say. Right now deal with a file > > name with a newline character in it :-) Don't ask me how or why you even > > do that but let me assure you that I have seen them on more than one > > occasion. And now your dry run list is broken... > > > > Not only that if you have a few hundred million files to move a list > > just becomes unwieldy anyway. > > > > One thing I didn't mention is that I would run anything with in a screen > > (or tmux if that is your poison) and turn on logging. > > > > For those interested I am in the process of cleaning up the script a bit > > and will post it somewhere in due course. > > > > > > JAB. > > > > -- > > Jonathan A. Buzzard Tel: +44141-5483420 > > HPC System Administrator, ARCHIE-WeSt. > > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > ------------------------------ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > End of gpfsug-discuss Digest, Vol 106, Issue 21 > *********************************************** > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Wed Nov 18 23:05:40 2020 From: valdis.kletnieks at vt.edu (Valdis Kl=?utf-8?Q?=c4=93?=tnieks) Date: Wed, 18 Nov 2020 18:05:40 -0500 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> References: <578BE691-DEE4-43AC-97D2-546AC406E14A@pawsey.org.au> <7983810e-f51c-8cf7-a750-5c3285870bd4@strath.ac.uk> Message-ID: <39863.1605740740@turing-police> On Wed, 18 Nov 2020 11:48:52 +0000, Jonathan Buzzard said: > So what do I mean by "wacky" characters. Well remember a file name can > have just about anything in it on Linux with the exception of '/', and You want to see some fireworks? At least at one time, it was possible to use a file system debugger that's all too trusting of hexadecimal input and create a directory entry of '../'. Let's just say that fs/namei.c was also far too trusting, and fsck was more than happy to make *different* errors than the kernel was.... > The obvious ones are spaces, but it's not just ASCII 0x20, but tabs too. > Then there is the use of the wildcard characters, especially '?' but > also '*'. Don't forget ESC, CR, LF, backticks, forward ticks, semicolons, and pretty much anything else that will give a shell indigestion. SQL isn't the only thing prone to injection attacks.. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 832 bytes Desc: not available URL: From chris.schlipalius at pawsey.org.au Wed Nov 18 23:57:26 2020 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Thu, 19 Nov 2020 07:57:26 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: Message-ID: <6288DF78-A9DF-4BE9-B166-4478EF8C2A20@pawsey.org.au> ? I would counsel in the strongest possible terms against that approach. ? Basically you have to be assured that none of your file names have "wacky" characters in them, because handling "wacky" characters in file ? names is exceedingly difficult. I cannot stress how hard it is and the above example does not handle all "wacky" characters in file names. Well that?s indeed another kettle of fish if you have irregular/special naming of files, no I didn?t cover that and if you have millions of files, yes a list would be unwieldy, then I would be tarring up dirs. before moving? and then untarring on GPFS ?or breaking up the list into sets or sub lists. If you have these wacky types of file names well there are fixes as in the rsync manpages? yes not easy but possible.. Ie 1. -s, --protect-args 2. As per usual you can escape the spaces, or substitute for spaces. rsync -avuz user at server1.com:"${remote_path// /\\ }" . 3. Single quote the file name and path inside double quotes. ? One thing I didn't mention is that I would run anything with in a screen (or tmux if that is your poison) and turn on logging. Absolutely agree? ? For those interested I am in the process of cleaning up the script a bit and will post it somewhere in due course. ? JAB. Would be interesting to see?. I?ve also had success on GPFS with DCP and possibly this would be another option Regards, Chris Schlipalius Team Lead, Data Storage Infrastructure, Supercomputing Platforms, Pawsey Supercomputing Centre (CSIRO) 1 Bryce Avenue Kensington WA 6151 Australia -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 15:34:39 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 15:34:39 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Message-ID: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Nov 19 15:49:30 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 15:49:30 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Message-ID: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From spectrumscale at kiranghag.com Thu Nov 19 16:40:47 2020 From: spectrumscale at kiranghag.com (KG) Date: Thu, 19 Nov 2020 22:10:47 +0530 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: You can also set mount priority on filesystems so that gpfs can try to mount them in order...parent first On Thu, Nov 19, 2020, 21:19 Jonathan Buzzard wrote: > On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > > Hi, > > > > > > I have a filesystem holding many projects (i.e., mounted under > > /projects), each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects > > of it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > Technically, mounting a filesystem on top of an existing filesystem > > should be possible, but, is this discouraged for any reason? How GPFS > > would behave with that and is there a technical reason for avoiding this > > setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > This has all the hallmarks of either a Windows admin or a newbie > Linux/Unix admin :-) > > Simply put /projects is mounted on top of whatever file system is > providing the root file system in the first place LOL. > > Linux/Unix and/or GPFS does not give a monkeys about mounting another > file system *ANYWHERE* in it period because there is no other way of > doing it. > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Nov 19 16:42:07 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 19 Nov 2020 16:42:07 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> Message-ID: <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> If it is a remote cluster mount from your clients (hopefully!), you might want to look at priority to order mounting of the file-systems. I don?t know what would happen if the overmounted file-system went away, you would likely want to test. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 15:39 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 16:48:07 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 16:48:07 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch>, <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: Hi Jonathan, thanks for sharing your opinions. In the sentence "Technically, mounting a filesystem on top of an existing filesystem should be possible" , I guess I was referring to that... I was concerned about other technical reasons, such like how would this would affect GPFS policies, or how to properly proceed with proper mounting, or any other technical reasons to consider. For the GPFS policies, I usually applied some of the existing GPFS policies based on directories, but after checking I realized that one can manage via device (never used policies in that way, at least for the simple but necessary use cases I have on the existing filesystems). Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Jonathan Buzzard Sent: Thursday, November 19, 2020 4:49:30 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From marc.caubet at psi.ch Thu Nov 19 17:01:37 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 17:01:37 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch>, <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> Message-ID: <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Hi Simon, that's a very good point, thanks a lot :) I have it remotely mounted on a client cluster, so I will consider priorities when mounting the filesystems with remote cluster mount. That's very useful. Also, as far as I saw, same approach can be also applied to local mounts (via mmchfs) during daemon startup with the same option --mount-priority. Thanks a lot for the hints, these are very useful. I'll test that. Cheers, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson Sent: Thursday, November 19, 2020 5:42:07 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem If it is a remote cluster mount from your clients (hopefully!), you might want to look at priority to order mounting of the file-systems. I don?t know what would happen if the overmounted file-system went away, you would likely want to test. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 15:39 To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi, I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. Thanks a lot, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Thu Nov 19 17:34:05 2020 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 19 Nov 2020 18:34:05 +0100 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: I would not mount a GPFS filesystem within a GPFS filesystem. Technically it should work, but I?d expect it to cause surprises if ever the lower filesystem experienced problems. Alone, a filesystem might recover automatically by remounting. But if there?s another filesystem mounted within, I expect it will be a problem.. Much better to use symlinks. -jf tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < marc.caubet at psi.ch>: > Hi Simon, > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > client cluster, so I will consider priorities when mounting the filesystems > with remote cluster mount. That's very useful. > > Also, as far as I saw, same approach can be also applied to local mounts > (via mmchfs) during daemon startup with the same option --mount-priority. > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > Cheers, > > Marc > _________________________________________________________ > Paul Scherrer Institut > High Performance Computing & Emerging Technologies > Marc Caubet Serrabou > Building/Room: OHSA/014 > Forschungsstrasse, 111 > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > ------------------------------ > *From:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > S.J.Thompson at bham.ac.uk> > *Sent:* Thursday, November 19, 2020 5:42:07 PM > *To:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > filesystem > > > If it is a remote cluster mount from your clients (hopefully!), you might > want to look at priority to order mounting of the file-systems. I don?t > know what would happen if the overmounted file-system went away, you would > likely want to test. > > > > Simon > > > > *From: * on behalf of " > marc.caubet at psi.ch" > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Thursday, 19 November 2020 at 15:39 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > filesystem > > > > Hi, > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects of > it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > > Technically, mounting a filesystem on top of an existing filesystem should > be possible, but, is this discouraged for any reason? How GPFS would behave > with that and is there a technical reason for avoiding this setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. > > > > Thanks a lot, > > Marc > > _________________________________________________________ > Paul Scherrer Institut > High Performance Computing & Emerging Technologies > Marc Caubet Serrabou > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > Switzerland > > Telephone: +41 56 310 46 67 > E-Mail: marc.caubet at psi.ch > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Thu Nov 19 17:38:07 2020 From: skylar2 at uw.edu (Skylar Thompson) Date: Thu, 19 Nov 2020 09:38:07 -0800 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: <20201119173807.kormirvbweqs3un6@thargelion> Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:08:13 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:08:13 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> Message-ID: On 19/11/2020 17:34, Jan-Frode Myklebust wrote: > > I would not mount a GPFS filesystem within a GPFS filesystem. > Technically it should work, but I?d expect it to cause surprises if ever > the lower filesystem experienced problems. Alone, a filesystem might > recover automatically by remounting. But if there?s another filesystem > mounted within, I expect it will be a problem.. > > Much better to use symlinks. > Think about that for a minute... I guess if you are worried about /projects going away (which would suggest something really bad has happened anyway) would be to mount the GPFS file system that is currently holding /projects somewhere else and then bind mount everything into /projects At this point I would note that bind mounts are much better than symlinks which suck for this sort of application. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:12:03 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:12:03 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: <2f789d09-3704-2d41-ef2a-953de178dce2@strath.ac.uk> On 19/11/2020 16:40, KG wrote: > You can also set mount priority on filesystems so that gpfs can try to > mount them in order...parent first > One of the things that systemd brings to the table https://github.com/systemd/systemd/commit/3519d230c8bafe834b2dac26ace49fcfba139823 JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From marc.caubet at psi.ch Thu Nov 19 18:13:08 2020 From: marc.caubet at psi.ch (Caubet Serrabou Marc (PSI)) Date: Thu, 19 Nov 2020 18:13:08 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <20201119173807.kormirvbweqs3un6@thargelion> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> , <20201119173807.kormirvbweqs3un6@thargelion> Message-ID: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Hi all, thanks a lot for your comments. Agreed, I better avoid it for now. I was concerned about how GPFS would behave in such case. For production I will take the safe route, but, just out of curiosity, I'll give it a try on a couple of test filesystems. Thanks a lot for your help, it was very helpful, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Skylar Thompson Sent: Thursday, November 19, 2020 6:38:07 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Nov 19 18:32:39 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 19 Nov 2020 18:32:39 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> <20201119173807.kormirvbweqs3un6@thargelion> <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Message-ID: <5b8edf06-a4ab-a39e-5a02-86fd7565b90a@strath.ac.uk> On 19/11/2020 18:13, Caubet Serrabou Marc (PSI) wrote: > > Hi all, > > > thanks a lot for your comments. Agreed, I?better avoid it for now. I was > concerned about how GPFS would behave in such case. For production I > will take the safe route, but, just out of curiosity, I'll give it a try > on a couple of test filesystems. > Don't use symlinks there is a range of applications that will break and you will confuse the hell out of your users as the fact you are not under /projects/new but /random/new is not hidden. Besides which if the symlink goes away because /projects goes away then it is all a bust anyway. If you are worried about /projects going away then the best plan is to mount the GPFS file systems somewhere else and then bind mount the directories into /projects on all the machines where they are mounted. GPFS is quite happy with this. We bind mount /gpfs/users into /users and /gpfs/software into /opt/software by default. In the past I have bind mounted random paths for every user (hundred plus) into /home JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From novosirj at rutgers.edu Thu Nov 19 18:34:09 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 19 Nov 2020 18:34:09 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: > On Nov 19, 2020, at 10:49 AM, Jonathan Buzzard wrote: > > On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: >> Hi, >> I have a filesystem holding many projects (i.e., mounted under /projects), each project is managed with filesets. >> I have a new big project which should be placed on a separate filesystem (blocksize, replication policy, etc. will be different, and subprojects of it will be managed with filesets). Ideally, this filesystem should be mounted in /projects/newproject. >> Technically, mounting a filesystem on top of an existing filesystem should be possible, but, is this discouraged for any reason? How GPFS would behave with that and is there a technical reason for avoiding this setup? >> Another alternative would be independent mount point + symlink, but I really would prefer to avoid symlinks. > > This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) > > Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. > > Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. Some others have said, but I disagree. It wasn?t that long ago that GPFS acted really screwy with systemd because it did something in a way other than Linux expected. As it is now, their devices are not /dev/whatever or server:/wherever like just about every other filesystem type. Not unreasonable to believe it would ?act funny? compared to other FS. I like GPFS a lot, but this is not one of my favorite characteristics of it. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From UWEFALKE at de.ibm.com Thu Nov 19 19:18:41 2020 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 19 Nov 2020 20:18:41 +0100 Subject: [gpfsug-discuss] =?utf-8?q?Mounting_filesystem_on_top_of_an_exist?= =?utf-8?q?ing=09filesystem?= In-Reply-To: References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch><0b1c1d7f-580c-b5f4-1f36-351130adf412@strath.ac.uk> Message-ID: Just the risk your parent system dies which will block your access to the child file system mounted on a mount point within. If that is not bothering , go ahead mount stacks . As for the symling though : it is also gone if the parent dies :-). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: KG To: gpfsug main discussion list Date: 19/11/2020 17:41 Subject: [EXTERNAL] Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Sent by: gpfsug-discuss-bounces at spectrumscale.org You can also set mount priority on filesystems so that gpfs can try to mount them in order...parent first On Thu, Nov 19, 2020, 21:19 Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: On 19/11/2020 15:34, Caubet Serrabou Marc (PSI) wrote: > Hi, > > > I have a filesystem holding many projects (i.e., mounted under > /projects), each project is managed with filesets. > > I have a new big project which should be placed on a separate filesystem > (blocksize, replication policy, etc. will be different, and subprojects > of it will be managed with filesets). Ideally, this filesystem should be > mounted in /projects/newproject. > > > Technically, mounting a filesystem on top of an existing filesystem > should be possible, but, is this discouraged for any reason? How GPFS > would behave with that and is there a technical reason for avoiding this > setup? > > Another alternative would be independent mount point + symlink, but I > really would prefer to avoid symlinks. This has all the hallmarks of either a Windows admin or a newbie Linux/Unix admin :-) Simply put /projects is mounted on top of whatever file system is providing the root file system in the first place LOL. Linux/Unix and/or GPFS does not give a monkeys about mounting another file system *ANYWHERE* in it period because there is no other way of doing it. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Thu Nov 19 19:37:52 2020 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Thu, 19 Nov 2020 19:37:52 +0000 Subject: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem In-Reply-To: <0963457f2dfd418eabf8e1681ef2f801@psi.ch> References: <15ad70a37e3e4ec7a5907480a9318b93@psi.ch> <2593D6A2-D0D9-4DB1-8A27-5163B8BF34A6@bham.ac.uk> <22fcf84afa274cfa8aaa8fc4d4be62bb@psi.ch> <20201119173807.kormirvbweqs3un6@thargelion> <0963457f2dfd418eabf8e1681ef2f801@psi.ch> Message-ID: <738D41AC-6A07-453E-A2D1-C1882BE52EDC@bham.ac.uk> My understanding was that this was perfectly acceptable in a GPFS system. i.e. mounting parts of file-systems in others. It has been suggested to us as a way of using different vendor GPFS systems (e.g. an ESS with someone elses) as a way of working round the licensing rules about ESS and anything else, but still giving a single user ?name space?. We didn?t go that route, and of course I might have misunderstood what was being suggested. Simon From: on behalf of "marc.caubet at psi.ch" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Thursday, 19 November 2020 at 18:13 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Hi all, thanks a lot for your comments. Agreed, I better avoid it for now. I was concerned about how GPFS would behave in such case. For production I will take the safe route, but, just out of curiosity, I'll give it a try on a couple of test filesystems. Thanks a lot for your help, it was very helpful, Marc _________________________________________________________ Paul Scherrer Institut High Performance Computing & Emerging Technologies Marc Caubet Serrabou Building/Room: OHSA/014 Forschungsstrasse, 111 5232 Villigen PSI Switzerland Telephone: +41 56 310 46 67 E-Mail: marc.caubet at psi.ch ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Skylar Thompson Sent: Thursday, November 19, 2020 6:38:07 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Mounting filesystem on top of an existing filesystem Agreed, not sure how the GPFS tools would react. An alternative to symlinks would be bind mounts, if for some reason a tool doesn't behave properly with a symlink in the path. On Thu, Nov 19, 2020 at 06:34:05PM +0100, Jan-Frode Myklebust wrote: > I would not mount a GPFS filesystem within a GPFS filesystem. Technically > it should work, but I???d expect it to cause surprises if ever the lower > filesystem experienced problems. Alone, a filesystem might recover > automatically by remounting. But if there???s another filesystem mounted > within, I expect it will be a problem.. > > Much better to use symlinks. > > > > -jf > > tor. 19. nov. 2020 kl. 18:01 skrev Caubet Serrabou Marc (PSI) < > marc.caubet at psi.ch>: > > > Hi Simon, > > > > > > that's a very good point, thanks a lot :) I have it remotely mounted on a > > client cluster, so I will consider priorities when mounting the filesystems > > with remote cluster mount. That's very useful. > > > > Also, as far as I saw, same approach can be also applied to local mounts > > (via mmchfs) during daemon startup with the same option --mount-priority. > > > > > > Thanks a lot for the hints, these are very useful. I'll test that. > > > > > > Cheers, > > > > Marc > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > Forschungsstrasse, 111 > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > ------------------------------ > > *From:* gpfsug-discuss-bounces at spectrumscale.org < > > gpfsug-discuss-bounces at spectrumscale.org> on behalf of Simon Thompson < > > S.J.Thompson at bham.ac.uk> > > *Sent:* Thursday, November 19, 2020 5:42:07 PM > > *To:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > If it is a remote cluster mount from your clients (hopefully!), you might > > want to look at priority to order mounting of the file-systems. I don???t > > know what would happen if the overmounted file-system went away, you would > > likely want to test. > > > > > > > > Simon > > > > > > > > *From: * on behalf of " > > marc.caubet at psi.ch" > > *Reply to: *"gpfsug-discuss at spectrumscale.org" < > > gpfsug-discuss at spectrumscale.org> > > *Date: *Thursday, 19 November 2020 at 15:39 > > *To: *"gpfsug-discuss at spectrumscale.org" > > > > *Subject: *[gpfsug-discuss] Mounting filesystem on top of an existing > > filesystem > > > > > > > > Hi, > > > > > > > > I have a filesystem holding many projects (i.e., mounted under /projects), > > each project is managed with filesets. > > > > I have a new big project which should be placed on a separate filesystem > > (blocksize, replication policy, etc. will be different, and subprojects of > > it will be managed with filesets). Ideally, this filesystem should be > > mounted in /projects/newproject. > > > > > > > > Technically, mounting a filesystem on top of an existing filesystem should > > be possible, but, is this discouraged for any reason? How GPFS would behave > > with that and is there a technical reason for avoiding this setup? > > > > Another alternative would be independent mount point + symlink, but I > > really would prefer to avoid symlinks. > > > > > > > > Thanks a lot, > > > > Marc > > > > _________________________________________________________ > > Paul Scherrer Institut > > High Performance Computing & Emerging Technologies > > Marc Caubet Serrabou > > Building/Room: OHSA/014 > > > > Forschungsstrasse, 111 > > > > 5232 Villigen PSI > > Switzerland > > > > Telephone: +41 56 310 46 67 > > E-Mail: marc.caubet at psi.ch > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department (UW Medicine), System Administrator -- Foege Building S046, (206)-685-7354 -- Pronouns: He/Him/His _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kamil.Czauz at Squarepoint-Capital.com Fri Nov 20 19:13:41 2020 From: Kamil.Czauz at Squarepoint-Capital.com (Czauz, Kamil) Date: Fri, 20 Nov 2020 19:13:41 +0000 Subject: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process In-Reply-To: References: Message-ID: Here is the output of waiters on 2 hosts that were having the issue today: HOST 1 [2020-11-20 09:07:53 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === Waiting 0.0035 sec since 09:08:07, monitored, thread 135497 FileBlockReadFetchHandlerThread: on ThCond 0x7F615C152468 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.180 Waiting 0.0036 sec since 09:08:07, monitored, thread 139228 PrefetchWorkerThread: on ThCond 0x7F627000D5D8 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.181 [2020-11-20 09:08:07 root at nyzls149m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === HOST 2 [2020-11-20 09:08:49 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === Waiting 0.0034 sec since 09:08:50, monitored, thread 345318 SharedHashTabFetchHandlerThread: on ThCond 0x7F049C001F08 (MsgRecordCondvar), reason 'RPC wait' for NSD I/O completion on node 10.64.44.133 [2020-11-20 09:08:50 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === [2020-11-20 09:08:52 root at nyzls150m ~]# /usr/lpp/mmfs/bin/mmdiag --waiters === mmdiag: waiters === You can see the waiters go from 0 to 1-2 , but they are hardly blocking. Yes there are separate pools for metadata for all of the filesystems here. I did another trace today when the problem was happening - this time I was able to get a longer trace using the following command: /usr/lpp/mmfs/bin/mmtracectl --start --trace=io --trace-file-size=512M --tracedev-write-mode=blocking --tracedev-buffer-size=64M -N nyzls149m This is what the trsum output looks like: Elapsed trace time: 62.412092000 seconds Elapsed trace time from first VFS call to last: 62.412091999 Time idle between VFS calls: 0.002913000 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 0.003487000 9 387.444 rdwr 0.273721000 183 1495.743 read_inode2 0.007304000 325 22.474 follow_link 0.013952000 58 240.552 pagein 0.025974000 66 393.545 getattr 0.002792000 26 107.385 revalidate 0.009406000 2172 4.331 create 66.194479000 3 22064826.333 open 1.725505000 88 19608.011 unlink 18.685099000 1 18685099.000 setattr 0.011627000 14 830.500 lookup 2379.215514000 502 4739473.135 delete_inode 0.015553000 328 47.418 rename 98.099073000 5 19619814.600 release 0.050574000 89 568.247 permission 0.007454000 73 102.110 getxattr 0.002346000 32 73.312 statfs 0.000081000 6 13.500 mmap 0.049809000 18 2767.167 removexattr 0.000827000 14 59.071 llseek 0.000441000 47 9.383 readdir 0.002667000 34 78.441 Ops 4093 Secs 62.409178999 Ops/Sec 65.583 MaxFilesToCache is set to 12000 : [common] maxFilesToCache 12000 I only see gpfs_i_lookup in the tracefile, no gpfs_v_lookups # grep gpfs_i_lookup trcrpt.2020-11-20_09.20.38.283986.nyzls149m |wc -l 1097 They mostly look like this - 62.346560 238895 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922178971A40 iNum 21980113 (0x14F63D1) cnP 0xFFFF922178971C88 retP 0x0 code 0 rc 0 62.346955 238895 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF91A8A4991E00 dentryP 0xFFFF92C545A93500 name '20170323.txt' d_flags 0x80 d_count 1 unhashed 1 62.367701 218442 TRACE_VNODE: gpfs_i_lookup exit: new inode 0xFFFF922071300000 iNum 29629892 (0x1C41DC4) cnP 0xFFFF922071300248 retP 0x0 code 0 rc 0 62.367734 218444 TRACE_VNODE: gpfs_i_lookup enter: diP 0xFFFF9193CF457800 dentryP 0xFFFF9229527A89C0 name 'node.py' d_flags 0x80 d_count 1 unhashed 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Monday, November 16, 2020 8:46 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, while the other nodes can well block the local one, as Frederick suggests, there should at least be something visible locally waiting for these other nodes. Looking at all waiters might be a good thing, but this case looks strange in other ways. Mind statement there are almost no local waiters and none of them gets older than 10 ms. I am no developer nor do I have the code, so don't expect too much. Can you tell what lookups you see (check in the trcrpt file, could be like gpfs_i_lookup or gpfs_v_lookup)? Lookups are metadata ops, do you have a separate pool for your metadata? How is that pool set up (doen to the physical block devices)? Your trcsum down revealed 36 lookups, each one on avg taking >30ms. That is a lot (albeit the respective waiters won't show up at first glance as suspicious ...). So, which waiters did you see (hope you saved them, if not, do it next time). What are the node you see this on and the whole cluster used for? What is the MaxFilesToCache setting (for that node and for others)? what HW is that, how big are your nodes (memory,CPU)? To check the unreasonably short trace capture time: how large are the trcrpt files you obtain? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 14:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - Regarding your previous message - waiters were coming / going with just 1-2 waiters when I ran the mmdiag command, with very low wait times (<0.01s). We are running version 4.2.3 I did another capture today while the client is functioning normally and this was the header result: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 25.996957 seconds and 67592121252 cycles Measured cycle count update rate to be 2600001271 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Fri Nov 13 08:20:01.800558000 2020 (TOD 1605273601.800558, cycles 20807897445779444) daemon trace enabled Fri Nov 13 08:20:01.910017000 2020 (TOD 1605273601.910017, cycles 20807897730372442) all streams included Fri Nov 13 08:20:26.423085049 2020 (TOD 1605273626.423085, cycles 20807961464381068) <---- useful part of trace extends from here trace quiesced Fri Nov 13 08:20:27.797515000 2020 (TOD 1605273627.000797, cycles 20807965037900696) <---- to here Approximate number of times the trace buffer was filled: 14.631 Still a very small capture (1.3s), but the trsum.awk output was not filled with lookup commands / large lookup times. Can you help debug what those long lookup operations mean? Unfinished operations: 27967 ***************** pagein ************** 1.362382116 27967 ***************** readpage ************** 1.362381516 139130 1.362448448 ********* Unfinished IO: buffer/disk 3002F670000 20:107498951168^\archive_data_16 104686 1.362022068 ********* Unfinished IO: buffer/disk 50011878000 1:47169618944^\archive_data_1 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFE 341710 1.362423815 ********* Unfinished IO: buffer/disk 20022218000 19:107498951680^\archive_data_15 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 18:3452986648^\FFFFFFFF 139150 1.361122006 ********* Unfinished IO: buffer/disk 50012018000 2:47169622016^\archive_data_2 0 0.000000000 ********* Unfinished IO: buffer/disk 5003CEB8000 4:23073390592^\00000000FFFFFFFF 95782 1.361112791 ********* Unfinished IO: buffer/disk 40016300000 20:107498950656^\archive_data_16 0 0.000000000 ********* Unfinished IO: buffer/disk 2000EE78000 5:47631127040^\00000000FFFFFFFF 271076 1.361579585 ********* Unfinished IO: buffer/disk 20023DB8000 4:47169606656^\archive_data_4 341676 1.362018599 ********* Unfinished IO: buffer/disk 40038140000 5:47169614336^\archive_data_5 139150 1.361131599 MSG FSnd: nsdMsgReadExt msg_id 2930654492 Sduration 13292.382 + us 341676 1.362027104 MSG FSnd: nsdMsgReadExt msg_id 2930654495 Sduration 12396.877 + us 95782 1.361124739 MSG FSnd: nsdMsgReadExt msg_id 2930654491 Sduration 13299.242 + us 271076 1.361587653 MSG FSnd: nsdMsgReadExt msg_id 2930654493 Sduration 12836.328 + us 92182 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 341710 1.362429643 MSG FSnd: nsdMsgReadExt msg_id 2930654497 Sduration 11994.338 + us 341662 0.000000000 MSG FSnd: msg_id 0 Sduration 0.000 + us 139130 1.362458376 MSG FSnd: nsdMsgReadExt msg_id 2930654498 Sduration 11965.605 + us 104686 1.362028772 MSG FSnd: nsdMsgReadExt msg_id 2930654496 Sduration 12395.209 + us 412373 0.775676657 MSG FRep: nsdMsgReadExt msg_id 304915249 Rduration 598747.324 us Rlen 262144 Hduration 598752.112 + us 341770 0.589739579 MSG FRep: nsdMsgReadExt msg_id 338079050 Rduration 784684.402 us Rlen 4 Hduration 784692.651 + us 143315 0.536252844 MSG FRep: nsdMsgReadExt msg_id 631945522 Rduration 838171.137 us Rlen 233472 Hduration 838174.299 + us 341878 0.134331812 MSG FRep: nsdMsgReadExt msg_id 338079023 Rduration 1240092.169 us Rlen 262144 Hduration 1240094.403 + us 175478 0.587353287 MSG FRep: nsdMsgReadExt msg_id 338079047 Rduration 787070.694 us Rlen 262144 Hduration 787073.990 + us 139558 0.633517347 MSG FRep: nsdMsgReadExt msg_id 631945538 Rduration 740906.634 us Rlen 102400 Hduration 740910.172 + us 143308 0.958832110 MSG FRep: nsdMsgReadExt msg_id 631945542 Rduration 415591.871 us Rlen 262144 Hduration 415597.056 + us Elapsed trace time: 1.374423981 seconds Elapsed trace time from first VFS call to last: 1.374423980 Time idle between VFS calls: 0.001603738 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs readpage 1.151660085 1874 614.546 rdwr 0.431456904 581 742.611 read_inode2 0.001180648 934 1.264 follow_link 0.000029502 7 4.215 getattr 0.000048413 9 5.379 revalidate 0.000007080 67 0.106 pagein 1.149699537 1877 612.520 create 0.007664829 9 851.648 open 0.001032657 19 54.350 unlink 0.002563726 14 183.123 delete_inode 0.000764598 826 0.926 lookup 0.312847947 953 328.277 setattr 0.020651226 824 25.062 permission 0.000015018 1 15.018 rename 0.000529023 4 132.256 release 0.001613800 22 73.355 getxattr 0.000030494 6 5.082 mmap 0.000054767 1 54.767 llseek 0.000001130 4 0.283 readdir 0.000033947 2 16.973 removexattr 0.002119736 820 2.585 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 42625 0.000000138 0.000031017 0.44% 99.56% 3 42378 0.000586959 0.011596801 4.82% 95.18% 32 42627 0.000000272 0.000013421 1.99% 98.01% 2 42641 0.003284590 0.012593594 20.69% 79.31% 35 42628 0.001522335 0.000002748 99.82% 0.18% 2 25464 0.003462795 0.500281914 0.69% 99.31% 12 301420 0.000016711 0.052848218 0.03% 99.97% 38 95103 0.000000544 0.000000000 100.00% 0.00% 1 145858 0.000000659 0.000794896 0.08% 99.92% 2 42221 0.000011484 0.000039445 22.55% 77.45% 5 371718 0.000000707 0.001805425 0.04% 99.96% 2 95109 0.000000880 0.008998763 0.01% 99.99% 2 95337 0.000010330 0.503057866 0.00% 100.00% 8 42700 0.002442175 0.012504429 16.34% 83.66% 35 189680 0.003466450 0.500128627 0.69% 99.31% 9 42681 0.006685396 0.000391575 94.47% 5.53% 16 42702 0.000048203 0.000000500 98.97% 1.03% 2 42703 0.000033280 0.140102087 0.02% 99.98% 9 224423 0.000000195 0.000000000 100.00% 0.00% 1 42706 0.000541098 0.000014713 97.35% 2.65% 3 106275 0.000000456 0.000000000 100.00% 0.00% 1 42721 0.000372857 0.000000000 100.00% 0.00% 1 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Friday, November 13, 2020 4:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi Kamil, in my mail just a few minutes ago I'd overlooked that the buffer size in your trace was indeed 128M (I suppose the trace file is adapting that size if not set in particular). That is very strange, even under high load, the trace should then capture some longer time than 10 secs, and , most of all, it should contain much more activities than just these few you had. That is very mysterious. I am out of ideas for the moment, and a bit short of time to dig here. To check your tracing, you could run a trace like before but when everything is normal and check that out - you should see many records, the trcsum.awk should list just a small portion of unfinished ops at the end, ... If that is fine, then the tracing itself is affected by your crritical condition (never experienced that before - rather GPFS grinds to a halt than the trace is abandoned), and that might well be worth a support ticket. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 13/11/2020 10:21 Subject: Re: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, looks your tracefile setting has been too low: all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here means you effectively captured a period of about 5ms only ... you can't see much from that. I'd assumed the default trace file size would be sufficient here but it doesn't seem to. try running with something like mmtracectl --start --trace-file-size=512M --trace=io --tracedev-write-mode=overwrite -N . However, if you say "no major waiter" - how many waiters did you see at any time? what kind of waiters were the oldest, how long they'd waited? it could indeed well be that some job is just creating a killer workload. The very short cyle time of the trace points, OTOH, to high activity, OTOH the trace file setting appears quite low (trace=io doesnt' collect many trace infos, just basic IO stuff). If I might ask: what version of GPFS are you running? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: gpfsug main discussion list Date: 13/11/2020 03:33 Subject: [EXTERNAL] Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Uwe - I hit the issue again today, no major waiters, and nothing useful from the iohist report. Nothing interesting in the logs either. I was able to get a trace today while the issue was happening. I took 2 traces a few min apart. The beginning of the traces look something like this: Overwrite trace parameters: buffer size: 134217728 64 kernel trace streams, indices 0-63 (selected by low bits of processor ID) 128 daemon trace streams, indices 64-191 (selected by low bits of thread ID) Interval for calibrating clock rate was 100.019054 seconds and 260049296314 cycles Measured cycle count update rate to be 2599997559 per second <---- using this value OS reported cycle count update rate as 2599999000 per second Trace milestones: kernel trace enabled Thu Nov 12 20:56:40.114080000 2020 (TOD 1605232600.114080, cycles 20701293141385220) daemon trace enabled Thu Nov 12 20:56:40.247430000 2020 (TOD 1605232600.247430, cycles 20701293488095152) all streams included Thu Nov 12 20:58:19.950515266 2020 (TOD 1605232699.950515, cycles 20701552715873212) <---- useful part of trace extends from here trace quiesced Thu Nov 12 20:58:20.133134000 2020 (TOD 1605232700.000133, cycles 20701553190681534) <---- to here Approximate number of times the trace buffer was filled: 553.529 Here is the output of trsum.awk details=0 I'm not quite sure what to make of it, can you help me decipher it? The 'lookup' operations are taking a hell of a long time, what does that mean? Capture 1 Unfinished operations: 21234 ***************** lookup ************** 0.165851604 290020 ***************** lookup ************** 0.151032241 302757 ***************** lookup ************** 0.168723402 301677 ***************** lookup ************** 0.070016530 230983 ***************** lookup ************** 0.127699082 21233 ***************** lookup ************** 0.060357257 309046 ***************** lookup ************** 0.157124551 301643 ***************** lookup ************** 0.165543982 304042 ***************** lookup ************** 0.172513838 167794 ***************** lookup ************** 0.056056815 189680 ***************** lookup ************** 0.062022237 362216 ***************** lookup ************** 0.072063619 406314 ***************** lookup ************** 0.114121838 167776 ***************** lookup ************** 0.114899642 303016 ***************** lookup ************** 0.144491120 290021 ***************** lookup ************** 0.142311603 167762 ***************** lookup ************** 0.144240366 248530 ***************** lookup ************** 0.168728131 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 30018014000 14:48493092752^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 2000006B000 2:6058336^\FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFE 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\00000000FFFFFFFF 0 0.000000000 ********* Unfinished IO: buffer/disk 3000002E000 2:1917676744^\FFFFFFFF Elapsed trace time: 0.182617894 seconds Elapsed trace time from first VFS call to last: 0.182617893 Time idle between VFS calls: 0.000006317 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.012021696 35 343.477 read_inode2 0.000100787 43 2.344 follow_link 0.000050609 8 6.326 pagein 0.000097806 10 9.781 revalidate 0.000010884 156 0.070 open 0.001001824 18 55.657 lookup 1.152449696 36 32012.492 delete_inode 0.000036816 38 0.969 permission 0.000080574 14 5.755 release 0.000470096 18 26.116 mmap 0.000340095 9 37.788 llseek 0.000001903 9 0.211 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 221919 0.000000244 0.050064080 0.00% 100.00% 4 167794 0.000011891 0.000069707 14.57% 85.43% 4 309046 0.147664569 0.000074663 99.95% 0.05% 9 349767 0.000000070 0.000000000 100.00% 0.00% 1 301677 0.017638372 0.048741086 26.57% 73.43% 12 84407 0.000010448 0.000016977 38.10% 61.90% 3 406314 0.000002279 0.000122367 1.83% 98.17% 7 25464 0.043270937 0.000006200 99.99% 0.01% 2 362216 0.000005617 0.000017498 24.30% 75.70% 2 379982 0.000000626 0.000000000 100.00% 0.00% 1 230983 0.123947465 0.000056796 99.95% 0.05% 6 21233 0.047877661 0.004887113 90.74% 9.26% 17 302757 0.154486003 0.010695642 93.52% 6.48% 24 248530 0.000006763 0.000035442 16.02% 83.98% 3 303016 0.014678039 0.000013098 99.91% 0.09% 2 301643 0.088025575 0.054036566 61.96% 38.04% 33 3339 0.000034997 0.178199426 0.02% 99.98% 35 21234 0.164240073 0.000262711 99.84% 0.16% 39 167762 0.000011886 0.000041865 22.11% 77.89% 3 336006 0.000001246 0.100519562 0.00% 100.00% 16 304042 0.121322325 0.019218406 86.33% 13.67% 33 301644 0.054325242 0.087715613 38.25% 61.75% 37 301680 0.000015005 0.020838281 0.07% 99.93% 9 290020 0.147713357 0.000121422 99.92% 0.08% 19 290021 0.000476072 0.000085833 84.72% 15.28% 10 44777 0.040819757 0.000010957 99.97% 0.03% 3 189680 0.000000044 0.000002376 1.82% 98.18% 1 241759 0.000000698 0.000000000 100.00% 0.00% 1 184839 0.000001621 0.150341986 0.00% 100.00% 28 362220 0.000010818 0.000020949 34.05% 65.95% 2 104687 0.000000495 0.000000000 100.00% 0.00% 1 # total App-read/write = 45 Average duration = 0.000269322 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 34 0.755556 0.755556 34 0 32889 0 0.001000 10 0.222222 0.977778 10 0 108136 0 0.004000 1 0.022222 1.000000 1 0 8 0 # max concurrant App-read/write = 2 # conc count % %ile 1 38 0.844444 0.844444 2 7 0.155556 1.000000 Capture 2 Unfinished operations: 335096 ***************** lookup ************** 0.289127895 334691 ***************** lookup ************** 0.225380797 362246 ***************** lookup ************** 0.052106493 334694 ***************** lookup ************** 0.048567769 362220 ***************** lookup ************** 0.054825580 333972 ***************** lookup ************** 0.275355791 406314 ***************** lookup ************** 0.283219905 334686 ***************** lookup ************** 0.285973208 289606 ***************** lookup ************** 0.064608288 21233 ***************** lookup ************** 0.074923689 189680 ***************** lookup ************** 0.089702578 335100 ***************** lookup ************** 0.151553955 334685 ***************** lookup ************** 0.117808430 167700 ***************** lookup ************** 0.119441314 336813 ***************** lookup ************** 0.120572137 334684 ***************** lookup ************** 0.124718126 21234 ***************** lookup ************** 0.131124745 84407 ***************** lookup ************** 0.132442945 334696 ***************** lookup ************** 0.140938740 335094 ***************** lookup ************** 0.201637910 167735 ***************** lookup ************** 0.164059859 334687 ***************** lookup ************** 0.252930745 334695 ***************** lookup ************** 0.278037098 341818 0.291815990 ********* Unfinished IO: buffer/disk 50000015000 3:439888512^\scratch_metadata_5 341818 0.291822084 MSG FSnd: nsdMsgReadExt msg_id 2894690129 Sduration 199.688 + us 100041 0.025061905 MSG FRep: nsdMsgReadExt msg_id 1012644746 Rduration 266959.867 us Rlen 0 Hduration 266963.954 + us Elapsed trace time: 0.292021772 seconds Elapsed trace time from first VFS call to last: 0.292021771 Time idle between VFS calls: 0.001436519 seconds Operations stats: total time(s) count avg-usecs wait-time(s) avg-usecs rdwr 0.000831801 4 207.950 read_inode2 0.000082347 31 2.656 pagein 0.000033905 3 11.302 revalidate 0.000013109 156 0.084 open 0.000237969 22 10.817 lookup 1.233407280 10 123340.728 delete_inode 0.000013877 33 0.421 permission 0.000046486 8 5.811 release 0.000172456 21 8.212 mmap 0.000064411 2 32.206 llseek 0.000000391 2 0.196 readdir 0.000213657 36 5.935 User thread stats: GPFS-time(sec) Appl-time GPFS-% Appl-% Ops 335094 0.053506265 0.000170270 99.68% 0.32% 16 167700 0.000008522 0.000027547 23.63% 76.37% 2 167776 0.000008293 0.000019462 29.88% 70.12% 2 334684 0.000023562 0.000160872 12.78% 87.22% 8 349767 0.000000467 0.250029787 0.00% 100.00% 5 84407 0.000000230 0.000017947 1.27% 98.73% 2 334685 0.000028543 0.000094147 23.26% 76.74% 8 406314 0.221755229 0.000009720 100.00% 0.00% 2 334694 0.000024913 0.000125229 16.59% 83.41% 10 335096 0.254359005 0.000240785 99.91% 0.09% 18 334695 0.000028966 0.000127823 18.47% 81.53% 10 334686 0.223770082 0.000267271 99.88% 0.12% 24 334687 0.000031265 0.000132905 19.04% 80.96% 9 334696 0.000033808 0.000131131 20.50% 79.50% 9 129075 0.000000102 0.000000000 100.00% 0.00% 1 341842 0.000000318 0.000000000 100.00% 0.00% 1 335100 0.059518133 0.000287934 99.52% 0.48% 19 224423 0.000000471 0.000000000 100.00% 0.00% 1 336812 0.000042720 0.000193294 18.10% 81.90% 10 21233 0.000556984 0.000083399 86.98% 13.02% 11 289606 0.000000088 0.000018043 0.49% 99.51% 2 362246 0.014440188 0.000046516 99.68% 0.32% 4 21234 0.000524848 0.000162353 76.37% 23.63% 13 336813 0.000046426 0.000175666 20.90% 79.10% 9 3339 0.000011816 0.272396876 0.00% 100.00% 29 341818 0.000000778 0.000000000 100.00% 0.00% 1 167735 0.000007866 0.000049468 13.72% 86.28% 3 175480 0.000000278 0.000000000 100.00% 0.00% 1 336006 0.000001170 0.250020470 0.00% 100.00% 16 44777 0.000000367 0.250149757 0.00% 100.00% 6 189680 0.000002717 0.000006518 29.42% 70.58% 1 184839 0.000003001 0.250144214 0.00% 100.00% 35 145858 0.000000687 0.000000000 100.00% 0.00% 1 333972 0.218656404 0.000043897 99.98% 0.02% 4 334691 0.187695040 0.000295117 99.84% 0.16% 25 # total App-read/write = 7 Average duration = 0.000123672 sec # time(sec) count % %ile read write avgBytesR avgBytesW 0.000500 7 1.000000 1.000000 7 0 1172 0 -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Uwe Falke Sent: Wednesday, November 11, 2020 8:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Hi, Kamil, I suppose you'd rather not see such an issue than pursue the ugly work-around to kill off processes. In such situations, the first looks should be for the GPFS log (on the client, on the cluster manager, and maybe on the file system manager) and for the current waiters (that is the list of currently waiting threads) on the hanging client. -> /var/adm/ras/mmfs.log.latest mmdiag --waiters That might give you a first idea what is taking long and which components are involved. Also, mmdiag --iohist shows you the last IOs and some stats (service time, size) for them. Either that clue is already sufficient, or you go on (if you see DIO somewhere, direct IO is used which might slow down things, for example). GPFS has a nice tracing which you can configure or just run the default trace. Running a dedicated (low-level) io trace can be achieved by mmtracectl --start --trace=io --tracedev-write-mode=overwrite -N then, when the issue is seen, stop the trace by mmtracectl --stop -N Do not wait to stop the trace once you've seen the issue, the trace file cyclically overwrites its output. If the issue lasts some time you could also start the trace while you see it, run the trace for say 20 secs and stop again. On stopping the trace, the output gets converted into an ASCII trace file named trcrpt.*(usually in /tmp/mmfs, check the command output). There you should see lines with FIO which carry the inode of the related file after the "tag" keyword. example: 0.000745100 25123 TRACE_IO: FIO: read data tag 248415 43466 ioVecSize 8 1st buf 0x299E89BC000 disk 8D0 da 154:2083875440 nSectors 128 err 0 finishTime 1563473283.135212150 -> inode is 248415 there is a utility , tsfindinode, to translate that into the file path. you need to build this first if not yet done: cd /usr/lpp/mmfs/samples/util ; make , then run ./tsfindinode -i For the IO trace analysis there is an older tool : /usr/lpp/mmfs/samples/debugtools/trsum.awk. Then there is some new stuff I've not yet used in /usr/lpp/mmfs/samples/traceanz/ (always check the README) Hope that halps a bit. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist Hybrid Cloud Infrastructure / Technology Consulting & Implementation Services +49 175 575 2877 Mobile Rathausstr. 7, 09111 Chemnitz, Germany uwefalke at de.ibm.com IBM Services IBM Data Privacy Statement IBM Deutschland Business & Technology Services GmbH Gesch?ftsf?hrung: Sven Schooss, Stefan Hierl Sitz der Gesellschaft: Ehningen Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Czauz, Kamil" To: "gpfsug-discuss at spectrumscale.org" Date: 11/11/2020 23:36 Subject: [EXTERNAL] [gpfsug-discuss] Poor client performance with high cpu usage of mmfsd process Sent by: gpfsug-discuss-bounces at spectrumscale.org We regularly run into performance issues on our clients where the client seems to hang when accessing any gpfs mount, even something simple like a ls could take a few minutes to complete. This affects every gpfs mount on the client, but other clients are working just fine. Also the mmfsd process at this point is spinning at something like 300-500% cpu. The only way I have found to solve this is by killing processes that may be doing heavy i/o to the gpfs mounts - but this is more of an art than a science. I often end up killing many processes before finding the offending one. My question is really about finding the offending process easier. Is there something similar to iotop or a trace that I can enable that can tell me what files/processes and being heavily used by the mmfsd process on the client? -Kamil Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website http://www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Confidentiality Note: This e-mail and any attachments are confidential and may be protected by legal privilege. If you are not the intended recipient, be aware that any disclosure, copying, distribution or use of this e-mail or any attachment is prohibited. If you have received this e-mail in error, please notify us immediately by returning it to the sender and delete this copy from your system. We will use any personal information you give to us in accordance with our Privacy Policy which can be found in the Data Protection section on our corporate website www.squarepoint-capital.com. Please note that e-mails may be monitored for regulatory and compliance purposes. Thank you for your cooperation. -------------- next part -------------- An HTML attachment was scrubbed... URL: From hooft at natlab.research.philips.com Sat Nov 21 00:37:01 2020 From: hooft at natlab.research.philips.com (Peter van Hooft) Date: Sat, 21 Nov 2020 01:37:01 +0100 Subject: [gpfsug-discuss] mmchdisk /dev/fs start -a progress Message-ID: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> Hello, Is it possible to find out the progress of the 'mmchdisk /dev/fs start -a' command when the controlling terminal had been lost? We can see the task running on the fs manager node with 'mmdiag --commands' with attributes 'hold PIT/disk waitTime 0' We are starting to worry the mmchdisk is taking too long, and see continuously waiters like Waiting 3.1946 sec since 01:28:23, ignored, thread 22092 TSCHDISKCmdThread: on ThCond 0x180267573D0 (SGManagementMgrDataCondvar), reason 'waiting for stripe group to recover' Thanks for any hints. Peter van Hooft Philips Research From jonathan.buzzard at strath.ac.uk Sat Nov 21 10:13:42 2020 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sat, 21 Nov 2020 10:13:42 +0000 Subject: [gpfsug-discuss] mmchdisk /dev/fs start -a progress In-Reply-To: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> References: <20201121003701.GA32509@pc67340132.natlab.research.philips.com> Message-ID: On 21/11/2020 00:37, Peter van Hooft wrote: > > Hello, > > Is it possible to find out the progress of the 'mmchdisk /dev/fs start -a' > command when the controlling terminal had been lost? > I don't think so. You are lucky it is still running > We can see the task running on the fs manager node with 'mmdiag --commands' with > attributes 'hold PIT/disk waitTime 0' > We are starting to worry the mmchdisk is taking too long, and see continuously waiters like > Waiting 3.1946 sec since 01:28:23, ignored, thread 22092 TSCHDISKCmdThread: on ThCond 0x180267573D0 (SGManagementMgrDataCondvar), reason 'waiting for stripe group to recover' > > Thanks for any hints. > Not that this is going to help this time, but it is why you should *ALWAYS* without exception run these sorts of commands within a screen/tmux session so when you loose the connection to the server you can just reconnect and pick it up again. This is introductory system administration 101. No critical or long running command should ever be dependant on a remote controlling terminal. If you can't run them locally then run them in a screen or tmux session. There are plenty of good howto's for both screen and tmux on the internet. Depending on which distribution you use I would note that RedHat have very annoyingly and for completely specious reasons removed screen from RHEL8 and left tmux. So if you are starting from scratch tmux is the one to learn :-( JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From robert.horton at icr.ac.uk Mon Nov 23 15:06:05 2020 From: robert.horton at icr.ac.uk (Robert Horton) Date: Mon, 23 Nov 2020 15:06:05 +0000 Subject: [gpfsug-discuss] AFM experiences? Message-ID: Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. From novosirj at rutgers.edu Mon Nov 23 15:30:47 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 23 Nov 2020 15:30:47 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dean.flanders at fmi.ch Mon Nov 23 17:58:12 2020 From: dean.flanders at fmi.ch (Flanders, Dean) Date: Mon, 23 Nov 2020 17:58:12 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> References: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> Message-ID: Hello Rob, We looked at AFM years ago for DR, but after reading the bug reports, we avoided it, and also have had seen a case where it had to be removed from one customer, so we have kept things simple. Now looking again a few years later there are still issues, IBM Spectrum Scale Active File Management (AFM) issues which may result in undetected data corruption, and that was just my first google hit. We have kept it simple, and use a parallel rsync process with policy engine and can hit wire speed for copying of millions of small files in order to have isolation between the sites at GB/s. I am not saying it is bad, just that it needs an appropriate risk/reward ratio to implement as it increases overall complexity. Kind regards, Dean From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Ryan Novosielski Sent: Monday, November 23, 2020 4:31 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM experiences? We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton > wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Nov 23 21:54:39 2020 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 23 Nov 2020 21:54:39 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: Message-ID: Rob, Talk to Jake Carroll from the University of Queensland, he has done a number of presentations at Scale User Groups of UQ?s MeDiCI data fabric which is based on Spectrum Scale and does very aggressive use of AFM. Their use of AFM is not only on campus, but to remote Storage clusters between 30km and 1500km away from their Home cluster. They have also tested AFM between Australia, Japan, and USA Sent from my iPhone > On 24 Nov 2020, at 01:20, Robert Horton wrote: > > ?Hi all, > > We're thinking about deploying AFM and would be interested in hearing > from anyone who has used it in anger - particularly independent writer. > > Our scenario is we have a relatively large but slow (mainly because it > is stretched over two sites with a 10G link) cluster for long/medium- > term storage and a smaller but faster cluster for scratch storage in > our HPC system. What we're thinking of doing is using some/all of the > scratch capacity as an IW cache of some/all of the main cluster, the > idea to reduce the need for people to manually move data between the > two. > > It seems to generally work as expected in a small test environment, > although we have a few concerns: > > - Quota management on the home cluster - we need a way of ensuring > people don't write data to the cache which can't be accomodated on > home. Probably not insurmountable but needs a bit of thought... > > - It seems inodes on the cache only get freed when they are deleted on > the cache cluster - not if they get deleted from the home cluster or > when the blocks are evicted from the cache. Does this become an issue > in time? > > If anyone has done anything similar I'd be interested to hear how you > got on. It would be intresting to know if you created a cache fileset > for each home fileset or just one for the whole lot, as well as any > other pearls of wisdom you may have to offer. > > Thanks! > Rob > > -- > Robert Horton | Research Data Storage Lead > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > Twitter @ICR_London > Facebook: www.facebook.com/theinstituteofcancerresearch > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Mon Nov 23 23:14:08 2020 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Mon, 23 Nov 2020 23:14:08 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: <2C7317A6-B9DF-450A-92A6-AE156396204A@rutgers.edu> Ours are about 50 and 100 km from the home cluster, but it?s over 100Gb fiber. > On Nov 23, 2020, at 4:54 PM, Andrew Beattie wrote: > > Rob, > > Talk to Jake Carroll from the University of Queensland, he has done a number of presentations at Scale User Groups of UQ?s MeDiCI data fabric which is based on Spectrum Scale and does very aggressive use of AFM. > > Their use of AFM is not only on campus, but to remote Storage clusters between 30km and 1500km away from their Home cluster. They have also tested AFM between Australia, Japan, and USA > > Sent from my iPhone > > > On 24 Nov 2020, at 01:20, Robert Horton wrote: > > > > ?Hi all, > > > > We're thinking about deploying AFM and would be interested in hearing > > from anyone who has used it in anger - particularly independent writer. > > > > Our scenario is we have a relatively large but slow (mainly because it > > is stretched over two sites with a 10G link) cluster for long/medium- > > term storage and a smaller but faster cluster for scratch storage in > > our HPC system. What we're thinking of doing is using some/all of the > > scratch capacity as an IW cache of some/all of the main cluster, the > > idea to reduce the need for people to manually move data between the > > two. > > > > It seems to generally work as expected in a small test environment, > > although we have a few concerns: > > > > - Quota management on the home cluster - we need a way of ensuring > > people don't write data to the cache which can't be accomodated on > > home. Probably not insurmountable but needs a bit of thought... > > > > - It seems inodes on the cache only get freed when they are deleted on > > the cache cluster - not if they get deleted from the home cluster or > > when the blocks are evicted from the cache. Does this become an issue > > in time? > > > > If anyone has done anything similar I'd be interested to hear how you > > got on. It would be intresting to know if you created a cache fileset > > for each home fileset or just one for the whole lot, as well as any > > other pearls of wisdom you may have to offer. > > > > Thanks! > > Rob > > > > -- > > Robert Horton | Research Data Storage Lead > > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > > Twitter @ICR_London > > Facebook: www.facebook.com/theinstituteofcancerresearch > > > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > > > This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From vpuvvada at in.ibm.com Tue Nov 24 02:32:01 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:02:01 +0530 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: >- Quota management on the home cluster - we need a way of ensuring >people don't write data to the cache which can't be accomodated on >home. Probably not insurmountable but needs a bit of thought... You could set same quotas between cache and home clusters. AFM does not support replication of filesystem metadata like quotas, fileset configuration etc... >- It seems inodes on the cache only get freed when they are deleted on >the cache cluster - not if they get deleted from the home cluster or >when the blocks are evicted from the cache. Does this become an issue >in time? AFM periodically revalidates with home cluster. If the files/dirs were already deleted at home cluster, AFM moves them to /.ptrash directory at cache cluster during the revalidation. These files can be removed manually by user or auto eviction process. If the .ptrash directory is not cleaned up on time, it might result into quota issues at cache cluster. ~Venkat (vpuvvada at in.ibm.com) From: Robert Horton To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2020 08:51 PM Subject: [EXTERNAL] [gpfsug-discuss] AFM experiences? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Tue Nov 24 02:37:18 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:07:18 +0530 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: <440D8981-58C7-4CF1-BF06-BC11F27A47BC@rutgers.edu> Message-ID: Dean, This is one of the corner case which is associated with sparse files at the home cluster. You could try with latest versions of scale, AFM indepedent-writer mode have many performance/functional improvements in newer releases. ~Venkat (vpuvvada at in.ibm.com) From: "Flanders, Dean" To: gpfsug main discussion list Date: 11/23/2020 11:44 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] AFM experiences? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello Rob, We looked at AFM years ago for DR, but after reading the bug reports, we avoided it, and also have had seen a case where it had to be removed from one customer, so we have kept things simple. Now looking again a few years later there are still issues, IBM Spectrum Scale Active File Management (AFM) issues which may result in undetected data corruption, and that was just my first google hit. We have kept it simple, and use a parallel rsync process with policy engine and can hit wire speed for copying of millions of small files in order to have isolation between the sites at GB/s. I am not saying it is bad, just that it needs an appropriate risk/reward ratio to implement as it increases overall complexity. Kind regards, Dean From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Ryan Novosielski Sent: Monday, November 23, 2020 4:31 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM experiences? We use it similar to how you describe it. We now run 5.0.4.1 on the client side (I mean actual client nodes, not the home or cache clusters). Before that, we had reliability problems (failure to cache libraries of programs that were executing, etc.). The storage clusters in our case are 5.0.3-2.3. We also got bit by the quotas thing. You have to set them the same on both sides, or you will have problems. It seems a little silly that they are not kept in sync by GPFS, but that?s how it is. If memory serves, the result looked like an AFM failure (queue not being cleared), but it turned out to be that the files just could not be written at the home cluster because the user was over quota there. I also think I?ve seen load average increase due to this sort of thing, but I may be mixing that up with another problem scenario. We monitor via Nagios which I believe monitors using mmafmctl commands. Really can?t think of a single time, apart from the other day, where the queue backed up. The instance the other day only lasted a few minutes (if you suddenly create many small files, like installing new software, it may not catch up instantly). -- #BlackLivesMatter ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' On Nov 23, 2020, at 10:19, Robert Horton wrote: ?Hi all, We're thinking about deploying AFM and would be interested in hearing from anyone who has used it in anger - particularly independent writer. Our scenario is we have a relatively large but slow (mainly because it is stretched over two sites with a 10G link) cluster for long/medium- term storage and a smaller but faster cluster for scratch storage in our HPC system. What we're thinking of doing is using some/all of the scratch capacity as an IW cache of some/all of the main cluster, the idea to reduce the need for people to manually move data between the two. It seems to generally work as expected in a small test environment, although we have a few concerns: - Quota management on the home cluster - we need a way of ensuring people don't write data to the cache which can't be accomodated on home. Probably not insurmountable but needs a bit of thought... - It seems inodes on the cache only get freed when they are deleted on the cache cluster - not if they get deleted from the home cluster or when the blocks are evicted from the cache. Does this become an issue in time? If anyone has done anything similar I'd be interested to hear how you got on. It would be intresting to know if you created a cache fileset for each home fileset or just one for the whole lot, as well as any other pearls of wisdom you may have to offer. Thanks! Rob -- Robert Horton | Research Data Storage Lead The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | Twitter @ICR_London Facebook: www.facebook.com/theinstituteofcancerresearch The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. This e-mail message is confidential and for use by the addressee only. If the message is received by anyone other than the addressee, please return the message to the sender by replying to it and then delete the message from your computer and network. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Tue Nov 24 02:41:21 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 24 Nov 2020 08:11:21 +0530 Subject: [gpfsug-discuss] =?utf-8?q?Migrate/syncronize_data_from_Isilon_to?= =?utf-8?q?_Scale_over=09NFS=3F?= In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: AFM provides near zero downtime for migration. As of today, AFM migration does not support ACLs or other EAs migration from non scale (GPFS) source. https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm ~Venkat (vpuvvada at in.ibm.com) From: "Frederick Stock" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 11/17/2020 03:14 AM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Have you considered using the AFM feature of Spectrum Scale? I doubt it will provide any speed improvement but it would allow for data to be accessed as it was being migrated. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: Andi Christiansen Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Date: Mon, Nov 16, 2020 2:44 PM Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at googlemail.com Tue Nov 24 12:16:55 2020 From: luke.raimbach at googlemail.com (Luke Raimbach) Date: Tue, 24 Nov 2020 12:16:55 +0000 Subject: [gpfsug-discuss] AFM experiences? In-Reply-To: References: Message-ID: Hi Rob, Some things to think about from experiences a year or so ago... If you intend to perform any HPC workload (writing / updating / deleting files) inside a cache, then appropriately specified gateway nodes will be your friend: 1. When creating, updating or deleting files in the cache, each operation requires acknowledgement from the gateway handling that particular cache, before returning ACK to the application. This will add a latency overhead to the workload - if your storage is IB connected to the compute cluster and using verbsRdmaSend for example, this will increase your happiness. Connecting low-spec gateway nodes over 10GbE with the expectation that they will "drain down" over time was a sore learning experience in the early days of AFM for me. 2. AFM queues can quickly eat up memory. I think around 350bytes of memory is consumed for each operation in the AFM queue, so if you have huge file churn inside a cache then the queue will grow very quickly. If you run out of memory, the node dies and you enter cache recovery when it comes back up (or another node takes over). This can end up cycling the node as it tries to revalidate a cache and keep up with any other queues. Get more memory! I've not used AFM for a while now and I think the latter enormity has some mitigation against create / delete cycles (i.e. the create operation is expunged from the queue instead of two operations being played back to the home). I expect IBM experts will tell you more about those improvements. Also, several smaller caches are better than one large one (parallel execution of queues helps utilise the available bandwidth and you have a better failover spread if you have multiple gateways, for example). Independent Writer mode comes with some small danger (user error or impatience mainly) inasmuch as whoever updates a file last will win; e.g. home user A writes a file, then cache user B updates the file after reading it and tells user A the update is complete, when really the gateway queue is long and the change is waiting to go back home. User A uses the file expecting the changes are made, then updates it with some results. Meanwhile the AFM queue drains down and user B's change arrives after user A has completed their changes. The interim version of the file user B modified will persist at home and user A's latest changes are lost. Some careful thought about workflow (or good user training about eventual consistency) will save some potential misery on this front. Hope this helps, Luke On Mon, 23 Nov 2020 at 15:19, Robert Horton wrote: > Hi all, > > We're thinking about deploying AFM and would be interested in hearing > from anyone who has used it in anger - particularly independent writer. > > Our scenario is we have a relatively large but slow (mainly because it > is stretched over two sites with a 10G link) cluster for long/medium- > term storage and a smaller but faster cluster for scratch storage in > our HPC system. What we're thinking of doing is using some/all of the > scratch capacity as an IW cache of some/all of the main cluster, the > idea to reduce the need for people to manually move data between the > two. > > It seems to generally work as expected in a small test environment, > although we have a few concerns: > > - Quota management on the home cluster - we need a way of ensuring > people don't write data to the cache which can't be accomodated on > home. Probably not insurmountable but needs a bit of thought... > > - It seems inodes on the cache only get freed when they are deleted on > the cache cluster - not if they get deleted from the home cluster or > when the blocks are evicted from the cache. Does this become an issue > in time? > > If anyone has done anything similar I'd be interested to hear how you > got on. It would be intresting to know if you created a cache fileset > for each home fileset or just one for the whole lot, as well as any > other pearls of wisdom you may have to offer. > > Thanks! > Rob > > -- > Robert Horton | Research Data Storage Lead > The Institute of Cancer Research | 237 Fulham Road | London | SW3 6JB > T +44 (0)20 7153 5350 | E robert.horton at icr.ac.uk | W www.icr.ac.uk | > Twitter @ICR_London > Facebook: www.facebook.com/theinstituteofcancerresearch > > The Institute of Cancer Research: Royal Cancer Hospital, a charitable > Company Limited by Guarantee, Registered in England under Company No. > 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP. > > This e-mail message is confidential and for use by the addressee only. If > the message is received by anyone other than the addressee, please return > the message to the sender by replying to it and then delete the message > from your computer and network. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From yeep at robust.my Tue Nov 24 14:09:34 2020 From: yeep at robust.my (T.A. Yeep) Date: Tue, 24 Nov 2020 22:09:34 +0800 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Venkat, If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? On Tue, Nov 24, 2020 at 10:41 AM Venkateswara R Puvvada wrote: > AFM provides near zero downtime for migration. As of today, AFM > migration does not support ACLs or other EAs migration from non scale > (GPFS) source. > > > https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm > > ~Venkat (vpuvvada at in.ibm.com) > > > > From: "Frederick Stock" > To: gpfsug-discuss at spectrumscale.org > Cc: gpfsug-discuss at spectrumscale.org > Date: 11/17/2020 03:14 AM > Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data > from Isilon to Scale over NFS? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Have you considered using the AFM feature of Spectrum Scale? I doubt it > will provide any speed improvement but it would allow for data to be > accessed as it was being migrated. > > Fred > __________________________________________________ > Fred Stock | IBM Pittsburgh Lab | 720-430-8821 > stockf at us.ibm.com > > > ----- Original message ----- > From: Andi Christiansen > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: "gpfsug-discuss at spectrumscale.org" > Cc: > Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon > to Scale over NFS? > Date: Mon, Nov 16, 2020 2:44 PM > > Hi all, > > i have got a case where a customer wants 700TB migrated from isilon to > Scale and the only way for him is exporting the same directory on NFS from > two different nodes... > > as of now we are using multiple rsync processes on different parts of > folders within the main directory. this is really slow and will take > forever.. right now 14 rsync processes spread across 3 nodes fetching from > 2.. > > does anyone know of a way to speed it up? right now we see from 1Gbit to > 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from > scale nodes and 20Gbits from isilon so we should be able to reach just > under 20Gbit... > > > if anyone have any ideas they are welcome! > > > Thanks in advance > Andi Christiansen > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Best regards *T.A. Yeep*Mobile: +6-016-719 8506 | Tel: +6-03-7628 0526 | www.robusthpc.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Nov 24 09:39:47 2020 From: chair at spectrumscale.org (Simon Thompson (Spectrum Scale User Group Chair)) Date: Tue, 24 Nov 2020 09:39:47 +0000 Subject: [gpfsug-discuss] SSUG::Digital with CIUK Message-ID: <> An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: meeting.ics Type: text/calendar Size: 2623 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 3499622 bytes Desc: not available URL: From prasad.surampudi at theatsgroup.com Tue Nov 24 16:05:19 2020 From: prasad.surampudi at theatsgroup.com (Prasad Surampudi) Date: Tue, 24 Nov 2020 16:05:19 +0000 Subject: [gpfsug-discuss] mmhealth reports fserrinvalid errors on CNFS servers Message-ID: We are seeing fserrinvalid error on couple of filesystems in Spectrum Scale cluster. These errors are reported but mmhealth only couple of nodes (CNFS servers) in the cluster, but mmhealth on other nodes shows no issues. Any idea what this error means? And why its reported on CNFS servers and not on other nodes? What need to be done to fix this issue? sudo /usr/lpp/mmfs/bin/mmhealth node show FILESYSTEM -v Node name: cnfs05-gpfs Component Status Reasons ------------------------------------------------------------------- FILESYSTEM DEGRADED fserrinvalid(vol) argus HEALTHY - dytech HEALTHY - enlnt_E HEALTHY - enlnt_Es HEALTHY - haaforfs HEALTHY - haaforfs2 HEALTHY - historical HEALTHY - prcfs HEALTHY - qmtfs HEALTHY - research HEALTHY - research2 HEALTHY - schon_raw HEALTHY - uhdb_vol1 HEALTHY - vol DEGRADED fserrinvalid(vol) Event Parameter Severity Event Message ---------------------------------------------------------------------------------------------------------- fserrinvalid vol ERROR FS=vol,ErrNo=1124,Unknown error=0464000000010000000180A108BC000079B4000000000000003400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 -------------- next part -------------- An HTML attachment was scrubbed... URL: From NSCHULD at de.ibm.com Tue Nov 24 16:44:35 2020 From: NSCHULD at de.ibm.com (Norbert Schuld) Date: Tue, 24 Nov 2020 17:44:35 +0100 Subject: [gpfsug-discuss] =?utf-8?q?mmhealth_reports_fserrinvalid_errors_o?= =?utf-8?q?n_CNFS=09servers?= In-Reply-To: References: Message-ID: To get an explanation for any event one can ask the system: # mmhealth event show fserrinvalid Event Name: fserrinvalid Event ID: 999338 Description: Unrecognized FSSTRUCT error received. Check documentation Cause: A filesystem corruption detected User Action: Check error message for details and the mmfs.log.latest log for further details. See the topic Checking and repairing a file system in the IBM Spectrum Scale documentation: Administering. Managing file systems. If the file system is severely damaged, the best course of action is to follow the procedures in section: Additional information to collect for file system corruption or MMFS_FSSTRUCT errors Severity: ERROR State: DEGRADED The event is triggered by a callback which may not fire on all nodes, that is why only a subset of nodes have the information. Depending on the version of scale the procedure to remove the event varies: For newer release please use # mmhealth event resolve Missing arguments. Usage: mmhealth event resolve {EventName} [Identifier] For older releases it is described here: https://www.ibm.com/support/knowledgecenter/STXKQY_5.0.5/com.ibm.spectrum.scale.v5r05.doc/bl1pdg_fsstruc.htm mmsysmonc event filesystem fsstruct_fixed Mit freundlichen Gr??en / Kind regards Norbert Schuld M925:IBM Spectrum Scale Software Development Phone: +49-160 70 70 335 IBM Deutschland Research & Development GmbH Email: nschuld at de.ibm.com Am Weiher 24 65451 Kelsterbach Knowing is not enough; we must apply. Willing is not enough; we must do. IBM Data Privacy Statement IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Gregor Pillen Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Prasad Surampudi To: "gpfsug-discuss at spectrumscale.org" Date: 24.11.2020 17:05 Subject: [EXTERNAL] [gpfsug-discuss] mmhealth reports fserrinvalid errors on CNFS servers Sent by: gpfsug-discuss-bounces at spectrumscale.org We are seeing fserrinvalid error on couple of filesystems in Spectrum Scale cluster. These errors are reported but mmhealth only couple of nodes (CNFS servers) in the cluster, but mmhealth on other nodes shows no issues. Any idea what this error means? And why its reported on CNFS servers and not on other nodes? What need to be done to fix this issue? sudo /usr/lpp/mmfs/bin/mmhealth node show FILESYSTEM -v Node name: cnfs05-gpfs Component Status Reasons ------------------------------------------------------------------- FILESYSTEM DEGRADED fserrinvalid(vol) argus HEALTHY - dytech HEALTHY - enlnt_E HEALTHY - enlnt_Es HEALTHY - haaforfs HEALTHY - haaforfs2 HEALTHY - historical HEALTHY - prcfs HEALTHY - qmtfs HEALTHY - research HEALTHY - research2 HEALTHY - schon_raw HEALTHY - uhdb_vol1 HEALTHY - vol DEGRADED fserrinvalid(vol) Event Parameter Severity Event Message ---------------------------------------------------------------------------------------------------------- fserrinvalid vol ERROR FS=vol,ErrNo=1124,Unknown error=0464000000010000000180A108BC000079B4000000000000003400000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1D963707.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From jake.carroll at uq.edu.au Wed Nov 25 21:29:24 2020 From: jake.carroll at uq.edu.au (Jake Carroll) Date: Wed, 25 Nov 2020 21:29:24 +0000 Subject: [gpfsug-discuss] IB routers in ESS configuration + 3 different subnets - valid config? Message-ID: Hi. I am just in the process of sanity-checking a potential future configuration. Let's say I have an ESS 5000 and an ESS 3000 placed on the data centre floor to form the basis of a new scratch array. Let's then suppose that I have three existing supercomputers in that same location. Each of those supercomputers has a separate IB subnet and their networks are unrelated to each other, IB-wise. My understanding is that it is valid and possible to use MLNX EDR IB *routers* in order to be able to transport NSD communications back and forth across those separate subnets, back to the ESS (which lives on its own unique subnet). So at this point, I've got four unique subnets - one for the ESS, one for each super. As I understand it, there is an upper limit of *SIX* unique subnets on those EDR IB routers. As I understand it - for IPoIB transport, I'd also need some "gateway" boxes more or less - essentially some decent servers which I put EDR/HDR cards in as dog legs that act as an IPoIB gateway interface to each subnet. I appreciate that there is devil in the detail - but what I'm asking is if it is valid to "route" NSD with IB Routers (not switches) this way to separate subnets. Colleagues at IBM have all said "yeah....should work....we've not done it....but should be fine?" Colleagues at Mellanox (uhhh...nvidia...) say "Yes, this is valid and does exactly as the IB Router should and there is nothing unusual about this". If someone has experience doing this or could call out any oddity/weirdness/gotchas, I'd be very appreciative. I'm fairly sure this is all very low risk - but given nobody locally could tell me "Yeah, all certified and valid!" I'd like the wisdom of the wider crowd. Thank you. --jc -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Fri Nov 27 11:46:05 2020 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Fri, 27 Nov 2020 17:16:05 +0530 Subject: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? In-Reply-To: References: <1388247256.209171.1605555854969@privateemail.com> Message-ID: Hi Yeep, >If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? rsync can be used to just fix metadata like ACLs and EAs. AFM does not revalidate the files with source system if rsync changes the ACLs on them. So ACLs can only be fixed after or during the cutover. ACL inheritance may be used by setting on ACLs on required parent dirs upfront if this option is sufficient, there was an user who migrated to scale using this method. ~Venkat (vpuvvada at in.ibm.com) From: "T.A. Yeep" To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 11/24/2020 07:40 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Venkat, If ACLs and other EAs migration from non scale is not supported by AFM, is there any 3rd party tool that could complement that when paired with AFM? On Tue, Nov 24, 2020 at 10:41 AM Venkateswara R Puvvada < vpuvvada at in.ibm.com> wrote: AFM provides near zero downtime for migration. As of today, AFM migration does not support ACLs or other EAs migration from non scale (GPFS) source. https://www.ibm.com/support/knowledgecenter/STXKQY_5.1.0/com.ibm.spectrum.scale.v5r10.doc/bl1ins_uc_migrationusingafmmigrationenhancements.htm ~Venkat (vpuvvada at in.ibm.com) From: "Frederick Stock" To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Date: 11/17/2020 03:14 AM Subject: [EXTERNAL] Re: [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Sent by: gpfsug-discuss-bounces at spectrumscale.org Have you considered using the AFM feature of Spectrum Scale? I doubt it will provide any speed improvement but it would allow for data to be accessed as it was being migrated. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: Andi Christiansen Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [EXTERNAL] [gpfsug-discuss] Migrate/syncronize data from Isilon to Scale over NFS? Date: Mon, Nov 16, 2020 2:44 PM Hi all, i have got a case where a customer wants 700TB migrated from isilon to Scale and the only way for him is exporting the same directory on NFS from two different nodes... as of now we are using multiple rsync processes on different parts of folders within the main directory. this is really slow and will take forever.. right now 14 rsync processes spread across 3 nodes fetching from 2.. does anyone know of a way to speed it up? right now we see from 1Gbit to 3Gbit if we are lucky(total bandwidth) and there is a total of 30Gbit from scale nodes and 20Gbits from isilon so we should be able to reach just under 20Gbit... if anyone have any ideas they are welcome! Thanks in advance Andi Christiansen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Best regards T.A. Yeep Mobile: +6-016-719 8506 | Tel: +6-03-7628 0526 | www.robusthpc.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Mon Nov 30 13:49:12 2020 From: carlz at us.ibm.com (Carl Zetie - carlz@us.ibm.com) Date: Mon, 30 Nov 2020 13:49:12 +0000 Subject: [gpfsug-discuss] Licensing costs for data lakes (SSUG follow-up) Message-ID: I am seeking some help on a topic I know many of you care deeply about: licensing costs I am trying to gather some more information about a request that has come up a couple of times, pricing for ?data lakes?. I would like to understand better what people are looking for here. - Is it as simple as ?much steeper discounts for very large deployments?? Or is a ?data lake? something specific, e.g. a large deployment that is not performance/latency sensitive; a storage pool that is [primarily] HDD; a tier that has specific read/write patterns such as moving entire large datasets in or out; or something else? Bear in mind that if we have special licensing for data lakes, we need a rigorous definition so that both you and we know whether your use of that licensing is compliant. Nobody likes ambiguity in licensing! - Are you expecting pricing to get very flat/discounting to get steep for large deployments? Or a different price tier/structure for ?data lakes? if we can rigorously define what one means? Do you agree or disagree with the proposition that if you keep adding storage hardware/capacity, that the software licensing cost should rise in proportion (even if that proportion is much smaller for a ?data lake? than for a performance tier)? - Feel free to be creative and imaginative. For example, would you be interested in a low-cost pricing model for storage that is an AFM Home and is _only_ accessed by using AFM to move data in and out of an AFM Cache (probably on the performance tier)? This would be conceptually similar to the way you can now (5.1) use AFM-Object to park data in a cheap object store. - Also feel free to answer questions I didn?t ask? If you prefer to discuss this in Slack rather than email, I started a discussion there a little while ago (please thread your comments!): https://ssug-poweraiug.slack.com/archives/CEVVCEE8M/p1605815075188800 Carl Zetie Program Director Offering Management Spectrum Scale ---- (919) 473 3318 ][ Research Triangle Park carlz at us.ibm.com [signature_1545794140] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 69558 bytes Desc: image001.png URL: From david_johnson at brown.edu Mon Nov 30 21:41:30 2020 From: david_johnson at brown.edu (David Johnson) Date: Mon, 30 Nov 2020 16:41:30 -0500 Subject: [gpfsug-discuss] internal details on GPFS inode expansion Message-ID: When GPFS needs to add inodes to the filesystem, it seems to pre-create about 4 million of them. Judging by the logs, it seems it only takes a few (13 maybe) seconds to do this. However we are suspecting that this might only be to request the additional inodes and that there is some background activity for some time afterwards. Would someone who has knowledge of the actual internals be willing to confirm or deny this, and if there is background activity, is it on all nodes in the cluster, NSD nodes, "default worker nodes"? Thanks, -- ddj Dave Johnson ddj at brown.edu