From j.ouwehand at vumc.nl Mon Oct 2 14:35:23 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Mon, 2 Oct 2017 13:35:23 +0000 Subject: [gpfsug-discuss] number of SMBD processes Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Hello, Since we use new "IBM Spectrum Scale SMB CES" nodes, we see that that the number of SMBD processes has increased significantly from ~ 4,000 to ~ 7,500. We also see that the SMBD processes are not closed. This is likely because the Samba global-parameter "deadtime" is missing. ------------ https://www.samba.org/samba/docs/using_samba/ch11.html This global option sets the number of minutes that Samba will wait for an inactive client before closing its session with the Samba server. A client is considered inactive when it has no open files and no data is being sent from it. The default value for this option is 0, which means that Samba never closes any connection, regardless of how long they have been inactive. This can lead to unnecessary consumption of the server's resources by inactive clients. We recommend that you override the default as follows: [global] deadtime = 10 ------------ Is this Samba parameter "deadtime" supported by IBM? Kindly regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6431 bytes Desc: image001.gif URL: From bbanister at jumptrading.com Mon Oct 2 15:10:24 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 14:10:24 +0000 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: References: Message-ID: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> Thanks for posting this Sandeep! As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn't know that this site was an available resource. Bookmarking is good, but there unfortunately is not a way to "watch" the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. Thanks again, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sandeep Ramesh Sent: Friday, September 29, 2017 11:02 PM To: gpfsug-discuss at spectrumscale.org Cc: Theodore Hoover Jr ; Doris Conti Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 Note: External Email ________________________________ Hi Folks I was asked by Doris Conti to send the below to our Spectrum Scale User group. Below is a consolidated link that list all the enablement on Spectrum Scale/ESS that was done in 1H 2017 - which have blogs and videos from development and offering management. https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media Do note, Spectrum Scale developers keep blogging on the below site which is worth bookmarking: https://developer.ibm.com/storage/blog/ (as recent as 4 new blogs in Sept) Thanks Sandeep Linkedin: https://www.linkedin.com/in/sandeeprpatil Spectrum Scale Dev. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Mon Oct 2 15:13:52 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 14:13:52 +0000 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: References: Message-ID: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Hi Kristy, I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? Thanks for any help here! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Thursday, September 21, 2017 1:49 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. Best, Kristy On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: Thanks Doug. If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. Cheers, Kristy On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. For more information http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ Doug Mobile _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Oct 2 15:23:25 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 2 Oct 2017 14:23:25 +0000 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> References: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> Message-ID: <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> I?d agree with Bryan here ? Depending on me to passively check this for new material isn?t a great mechanism. If it?s on the mailing list, I will probably see it in a timely manner. On a related note, IBM has too many places to look for ?timely? Scale information/tips/how-to?s. It really should be in central repository and then use links to this location. I?m not sure I have a great idea here, but I have a list of 6-8 placed bookmarked that I go looking for articles. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of Bryan Banister Reply-To: gpfsug main discussion list Date: Monday, October 2, 2017 at 9:11 AM To: gpfsug main discussion list Cc: Theodore Hoover Jr , Doris Conti Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 Thanks for posting this Sandeep! As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn?t know that this site was an available resource. Bookmarking is good, but there unfortunately is not a way to ?watch? the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. Thanks again, -Bryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Mon Oct 2 15:31:32 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Mon, 2 Oct 2017 10:31:32 -0400 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> References: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> Message-ID: <8A33571E-905B-41D8-A934-C984A90EF6F9@ulmer.org> I?ve been told in the past that the Spectrum Scale Wiki is the place to watch for the most timely information, and there is a way to "follow" the wiki so you get notified of updates. That being said, I?ve not gotten "following" it to work yet so I don?t know what that actually *means*. I?d love to get a daily digest of all of the changes to that Wiki ? or even just a URL I would watch with IFTTT that would actually show me links to all of the updates. -- Stephen > On Oct 2, 2017, at 10:23 AM, Oesterlin, Robert > wrote: > > I?d agree with Bryan here ? Depending on me to passively check this for new material isn?t a great mechanism. If it?s on the mailing list, I will probably see it in a timely manner. > > On a related note, IBM has too many places to look for ?timely? Scale information/tips/how-to?s. It really should be in central repository and then use links to this location. I?m not sure I have a great idea here, but I have a list of 6-8 placed bookmarked that I go looking for articles. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > 507-269-0413 > > > From: > on behalf of Bryan Banister > > Reply-To: gpfsug main discussion list > > Date: Monday, October 2, 2017 at 9:11 AM > To: gpfsug main discussion list > > Cc: Theodore Hoover Jr >, Doris Conti > > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 > > Thanks for posting this Sandeep! > > As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn?t know that this site was an available resource. > > Bookmarking is good, but there unfortunately is not a way to ?watch? the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. > > Thanks again, > -Bryan > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Mon Oct 2 18:12:37 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 2 Oct 2017 17:12:37 +0000 Subject: [gpfsug-discuss] number of SMBD processes In-Reply-To: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> References: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.gif at 01D33B90.D2CAECC0.gif Type: image/gif Size: 6431 bytes Desc: not available URL: From ckerner at illinois.edu Mon Oct 2 19:20:39 2017 From: ckerner at illinois.edu (Chad Kerner) Date: Mon, 2 Oct 2017 13:20:39 -0500 Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image Message-ID: Has anyone tried building the GPL layer on the latest CentOS 7.4 image with the 3.10.0-693.2.2.el7.x86_64 kernel? This is trying to build 4.2.0.4 on Centos 7.4-1708. protector -Wformat=0 -Wno-format-security -I/usr/lpp/mmfs/src/gpl-linux -c kdump.c cc kdump.o kdump-kern.o kdump-kern-dwarfs.o -o kdump -lpthread kdump-kern.o: In function `GetOffset': kdump-kern.c:(.text+0x9): undefined reference to `page_offset_base' kdump-kern.o: In function `KernInit': kdump-kern.c:(.text+0x58): undefined reference to `page_offset_base' collect2: error: ld returned 1 exit status make[1]: *** [modules] Error 1 make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' make: *** [Modules] Error 1 -------------------------------------------------------- Thanks, Chad -- Chad Kerner, Senior Storage Engineer Storage Enabling Technologies National Center for Supercomputing Applications University of Illinois, Urbana-Champaign -------------- next part -------------- An HTML attachment was scrubbed... URL: From JRLang at uwyo.edu Mon Oct 2 20:31:59 2017 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Mon, 2 Oct 2017 19:31:59 +0000 Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image In-Reply-To: References: Message-ID: Chad I asked this same question last week. The answer is to upgrade to Scpectrum 4.2.3.4 jeff From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Chad Kerner Sent: Monday, October 2, 2017 1:21 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image Has anyone tried building the GPL layer on the latest CentOS 7.4 image with the 3.10.0-693.2.2.el7.x86_64 kernel? This is trying to build 4.2.0.4 on Centos 7.4-1708. protector -Wformat=0 -Wno-format-security -I/usr/lpp/mmfs/src/gpl-linux -c kdump.c cc kdump.o kdump-kern.o kdump-kern-dwarfs.o -o kdump -lpthread kdump-kern.o: In function `GetOffset': kdump-kern.c:(.text+0x9): undefined reference to `page_offset_base' kdump-kern.o: In function `KernInit': kdump-kern.c:(.text+0x58): undefined reference to `page_offset_base' collect2: error: ld returned 1 exit status make[1]: *** [modules] Error 1 make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' make: *** [Modules] Error 1 -------------------------------------------------------- Thanks, Chad -- Chad Kerner, Senior Storage Engineer Storage Enabling Technologies National Center for Supercomputing Applications University of Illinois, Urbana-Champaign -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Oct 2 22:24:43 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 2 Oct 2017 14:24:43 -0700 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> References: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Message-ID: Trying to get details on availability. More when I hear back. -Kristy > On Oct 2, 2017, at 7:13 AM, Bryan Banister wrote: > > Hi Kristy, > > I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? > > Thanks for any help here! > -Bryan > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org ] On Behalf Of Kristy Kallback-Rose > Sent: Thursday, September 21, 2017 1:49 PM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC > > Note: External Email > Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. > > Best, > Kristy > > On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: > > Thanks Doug. > > If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. > > Cheers, > Kristy > > On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: > > > Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. > > > For more information > http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ > > Doug > > Mobile > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Mon Oct 2 22:26:57 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 21:26:57 +0000 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: References: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Message-ID: Kristy, Thanks for the quick response. I did reach out to Karthik about the File System Corruption (MMFSCK) presentation, which was really what I lost. I?m sure he?ll get me the presentation, so please don?t rush at this point on my account! Sorry for the fire drill, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Monday, October 02, 2017 4:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Trying to get details on availability. More when I hear back. -Kristy On Oct 2, 2017, at 7:13 AM, Bryan Banister > wrote: Hi Kristy, I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? Thanks for any help here! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Thursday, September 21, 2017 1:49 PM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. Best, Kristy On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: Thanks Doug. If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. Cheers, Kristy On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. For more information http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ Doug Mobile _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leslie.james.elliott at gmail.com Tue Oct 3 12:32:56 2017 From: leslie.james.elliott at gmail.com (leslie elliott) Date: Tue, 3 Oct 2017 21:32:56 +1000 Subject: [gpfsug-discuss] transparent cloud tiering Message-ID: hi I am trying to change the account for the cloud tier but am having some problems any hints would be appreciated I am not interested in the data locally or migrated but do not seem to be able to recall this so would just like to repurpose it with the new account I can see in the logs 2017-10-03_15:38:49.226+1000: [W] Snapshot quiesce of SG cloud01 snap -1/0 doing 'mmcrsnapshot :MCST.scan.6' timed out on node . Retrying if possible. which is no doubt the reason for the following mmcloudgateway account delete --cloud-nodeclass TCTNodeClass --cloud-name gpfscloud1234 mmcloudgateway: Sending the command to the first successful node starting with gpfs-dev02 mmcloudgateway: This may take a while... mmcloudgateway: Error detected on node gpfs-dev02 The return code is 94. The error is: MCSTG00084E: Command Failed with following reason: Unable to create snapshot for file system /gpfs/itscloud01, [Ljava.lang.String;@3353303e failed with: com.ibm.gpfsconnector.messages.GpfsConnectorException: Command [/usr/lpp/mmfs/bin/mmcrsnapshot, cloud01, MCST.scan.4] failed with the following return code: 78.. mmcloudgateway: Sending the command to the next node gpfs-dev04 mmcloudgateway: Error detected on node gpfs-dev04 The return code is 94. The error is: MCSTG00084E: Command Failed with following reason: Unable to create snapshot for file system /gpfs/cloud01, [Ljava.lang.String;@90a887ad failed with: com.ibm.gpfsconnector.messages.GpfsConnectorException: Command [/usr/lpp/mmfs/bin/mmcrsnapshot, cloud01, MCST.scan.6] failed with the following return code: 78.. mmcloudgateway: Command failed. Examine previous error messages to determine cause. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Oct 3 12:57:21 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 3 Oct 2017 07:57:21 -0400 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f@nasa.gov> Hi All, At the SSUG in NY there was mention of operating systems as well as feature deprecations that would occur in the lifecycle of the next major release of GPFS. I'm not sure if this is public knowledge yet so I haven't mentioned specifics but given the proposed release time frame of the next major release I thought customers may appreciate having access to this information so they could provide feedback about the potential impact to their environment if these deprecations do occur. Any chance someone from IBM could provide specifics here so folks can chime in? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From j.ouwehand at vumc.nl Wed Oct 4 12:59:45 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Wed, 4 Oct 2017 11:59:45 +0000 Subject: [gpfsug-discuss] number of SMBD processes In-Reply-To: References: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E4185E@sp-mx-mbx4> Hello Christof, Thank you very much for the explanation. You have point us in the right direction. Vriendelijke groet, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Christof Schmitt Verzonden: maandag 2 oktober 2017 19:13 Aan: gpfsug-discuss at spectrumscale.org CC: gpfsug-discuss at spectrumscale.org Onderwerp: Re: [gpfsug-discuss] number of SMBD processes Hello, the short answer is that the "deadtime" parameter is not a supported parameter in Spectrum Scale. The longer answer is that setting "deadtime" likely does not solve any issue. "deadtime" was introduced in Samba mainly for older protocol versions. While it is implemented independent of protocol versions, not the statement about "no open files" for a connection to be closed. Spectrum Scale only supports SMB versions 2 and 3. Basically everything there is based on an open file handle. Most SMB 2/3 clients open at least the root directory of the export and register for change notifications there and the client then can wait for any time for changes. That is a valid case, and the open directory handle prevents the connection from being affected by any setting of the "deadtime" parameter. Clients that are no longer active and have not properly closed the connection are detected on the TCP level: # mmsmb config list | grep sock socket options TCP_NODELAY SO_KEEPALIVE TCP_KEEPCNT=4 TCP_KEEPIDLE=240 TCP_KEEPINTVL=15 Every client that no longer responds for 5 minutes will have the connection dropped (240s + 4x15s). On the other hand, if the SMB clients are still responding to TCP keep-alive packets, then the connection is considered valid. It might be interesting to look into the unwanted connections and possibly capture a network trace or look into the client systems to better understand the situation. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: "Ouwehand, JJ" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" > Cc: Subject: [gpfsug-discuss] number of SMBD processes Date: Mon, Oct 2, 2017 6:35 AM Hello, Since we use new ?IBM Spectrum Scale SMB CES? nodes, we see that that the number of SMBD processes has increased significantly from ~ 4,000 to ~ 7,500. We also see that the SMBD processes are not closed. This is likely because the Samba global-parameter ?deadtime? is missing. ------------ https://www.samba.org/samba/docs/using_samba/ch11.html This global option sets the number of minutes that Samba will wait for an inactive client before closing its session with the Samba server. A client is considered inactive when it has no open files and no data is being sent from it. The default value for this option is 0, which means that Samba never closes any connection, regardless of how long they have been inactive. This can lead to unnecessary consumption of the server's resources by inactive clients. We recommend that you override the default as follows: [global] deadtime = 10 ------------ Is this Samba parameter ?deadtime? supported by IBM? Kindly regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=LCAKWPxQj5PMUf5YKTH3Z0zW9cDW--1AO_mljWE3ni8&s=y0FjQ5P-9Q7YjxyvuNNa4kdzHZKfrsjW81pGDLMNuig&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6431 bytes Desc: image001.gif URL: From heiner.billich at psi.ch Wed Oct 4 18:26:03 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Wed, 4 Oct 2017 17:26:03 +0000 Subject: [gpfsug-discuss] AFM - prefetch of many small files - tuning - storage latency required to increase max socket buffer size ... Message-ID: <0A9C5A40-221C-46B5-B7E3-72A9D5A6D483@psi.ch> Hello, A while ago I asked the list for advice on how to tune AFM to speed-up the prefetch of small files (~1MB). In the meantime, we got some results which I want to share. We had to increase the maximum socket buffer sizes to very high values of 40-80MB. Consider that we use IP over Infiniband and the bandwidth-delay-product is about 5MB (1-10us latency). How do we explain this? The reads on the nfs server have a latency of about 20ms. This is physics of disks. Hence a single client can get up to 50 requests/s. Each request is 1MB. To get 1GB/s we need 20 clients in parallel. At all times we have about 20 requests pending. Looks like the server does allocate the socket buffer space before it asks for the data. Hence it allocates/blocks about 20MB at all times. Surprisingly it?s storage latency and not network latency that required us to increase the max. socket buffer size. For large files prefetch works and reduces the latency of reads drastically and no special tuning is required. We did test with kernel-nfs and gpfs 4.2.3 on RHEL7. Whether ganesha shows a similar pattern would be interesting to know. Once we fixed the nfs issues afm did show a nice parallel prefetch up to ~1GB/s with 1MB sized files without any tuning. Still much below the 4GB/s measured with iperf between the two nodes ?. Kind regards, Heiner -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From kkr at lbl.gov Wed Oct 4 22:44:10 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Wed, 4 Oct 2017 14:44:10 -0700 Subject: [gpfsug-discuss] NYC Meeting Presos (Workaround) Message-ID: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> Hi, I?m having some trouble getting links added to the SS/GPFS UG page, but I want to share the presos I have so far, a couple more are coming soon. So, as a workaround (as storage people we can appreciate workarounds, right?!), here are the links to the slides I have thus far: Spectrum Scale Object at CSCS: http://files.gpfsug.org/presentations/2017/NYC/Day4-ss_object_cscs.pdf File System Corruptions & Best Practices: http://files.gpfsug.org/presentations/2017/NYC/Day4-hpcxxl2017_fsck_karthik.pdf Spectrum Scale Cloud Enablement: http://files.gpfsug.org/presentations/2017/NYC/Day4-SpectrumScaleCloudEnablement%20Sept28.pdf IBM Spectrum Scale 4.2.3 Security Overview: http://files.gpfsug.org/presentations/2017/NYC/Day4-Spectrum%20Scale%20Security-%20User%20Group%20-%20NYCSept2017.pdf What?s New in Spectrum Scale: http://files.gpfsug.org/presentations/2017/NYC/Day4-NYC%20User%20Group%202017%20What's%20New%20v2.pdf Cheers, Kristy -------------- next part -------------- An HTML attachment was scrubbed... URL: From jez.tucker at gpfsug.org Thu Oct 5 11:11:53 2017 From: jez.tucker at gpfsug.org (Jez Tucker) Date: Thu, 5 Oct 2017 11:11:53 +0100 Subject: [gpfsug-discuss] NYC Meeting Presos (Workaround) In-Reply-To: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> References: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> Message-ID: *waves hands*? - I can help here if you have issues.? Same for anyone else. ping me 1::1 On 04/10/17 22:44, Kristy Kallback-Rose wrote: > Hi, > > I?m having some trouble getting links added to the SS/GPFS UG page, > but I want to share the presos I have so far, a couple more are coming > soon. So, as a workaround (as storage people we can appreciate > workarounds, right?!), here are the links to the slides I have thus far: > > Spectrum Scale Object at CSCS: > http://files.gpfsug.org/presentations/2017/NYC/Day4-ss_object_cscs.pdf > > File System Corruptions & Best Practices: > http://files.gpfsug.org/presentations/2017/NYC/Day4-hpcxxl2017_fsck_karthik.pdf > > Spectrum Scale Cloud Enablement: > http://files.gpfsug.org/presentations/2017/NYC/Day4-SpectrumScaleCloudEnablement%20Sept28.pdf > > IBM Spectrum Scale 4.2.3 Security Overview: > http://files.gpfsug.org/presentations/2017/NYC/Day4-Spectrum%20Scale%20Security-%20User%20Group%20-%20NYCSept2017.pdf > > What?s New in Spectrum Scale: > http://files.gpfsug.org/presentations/2017/NYC/Day4-NYC%20User%20Group%202017%20What's%20New%20v2.pdf > > > Cheers, > Kristy > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From listymclistfaces at gmail.com Fri Oct 6 13:56:04 2017 From: listymclistfaces at gmail.com (listy mclistface) Date: Fri, 6 Oct 2017 13:56:04 +0100 Subject: [gpfsug-discuss] Client power failure Message-ID: Hi, Although our NSD nodes are on UPS etc, we have some clients which aren't. Do we run the risk of FS corruption if we drop client nodes mid write? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jez.tucker at gpfsug.org Fri Oct 6 14:14:59 2017 From: jez.tucker at gpfsug.org (Jez Tucker) Date: Fri, 6 Oct 2017 14:14:59 +0100 Subject: [gpfsug-discuss] Client power failure In-Reply-To: References: Message-ID: <61604124-ec28-c930-7ea3-a20a6223b779@gpfsug.org> Hi ? Can we please refrain from completely anonymous emails ListyMcListFaces ;-) Ta ListMasterMcListAdmin On 06/10/17 13:56, listy mclistface wrote: > Hi, > > Although our NSD nodes are on UPS etc, we have some clients which > aren't.? ?Do we run the risk of FS corruption if we drop client nodes > mid write? > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Oct 6 14:24:11 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 6 Oct 2017 13:24:11 +0000 Subject: [gpfsug-discuss] Client power failure Message-ID: I agree ? anonymous ones should be dropped from the list. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of Jez Tucker Reply-To: "jez.tucker at gpfsug.org" , gpfsug main discussion list Date: Friday, October 6, 2017 at 8:17 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Client power failure Can we please refrain from completely anonymous emails ListyMcListFaces ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Fri Oct 6 14:45:38 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Fri, 6 Oct 2017 13:45:38 +0000 Subject: [gpfsug-discuss] Client power failure In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 6 21:39:28 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 6 Oct 2017 20:39:28 +0000 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release In-Reply-To: References: Message-ID: Hi Aaron, I appreciate your care with this. The user group are the first users to be briefed on this. We're not quite ready to put more in writing just yet, however I will be at SC17 and hope to be able to do so at that time. (I'll also take any other questions that people want to ask, including "where's my RFE?"...) I also want to add one note about the meaning of feature deprecation, because it's not well understood even within IBM: If we deprecate a feature with the next major release it does NOT mean we are dropping support there and then. It means we are announcing the INTENTION to drop support in some future release, and encourage you to (a) start making plans on migration to a supported alternative, and (b) chime in on what you need in order to be able to satisfactorily migrate if our proposed alternative is not adequate. regards, Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com ------------------------------ Message: 2 Date: Tue, 3 Oct 2017 07:57:21 -0400 From: Aaron Knister To: gpfsug main discussion list Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f at nasa.gov> Content-Type: text/plain; charset="utf-8"; format=flowed Hi All, At the SSUG in NY there was mention of operating systems as well as feature deprecations that would occur in the lifecycle of the next major release of GPFS. I'm not sure if this is public knowledge yet so I haven't mentioned specifics but given the proposed release time frame of the next major release I thought customers may appreciate having access to this information so they could provide feedback about the potential impact to their environment if these deprecations do occur. Any chance someone from IBM could provide specifics here so folks can chime in? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ------------------------------ From aaron.s.knister at nasa.gov Fri Oct 6 23:30:05 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 6 Oct 2017 18:30:05 -0400 Subject: [gpfsug-discuss] changing default configuration values Message-ID: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From scale at us.ibm.com Sat Oct 7 04:06:41 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 6 Oct 2017 23:06:41 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Hi Aaron, The default value applies to all nodes in the cluster. Thus changing it will change all nodes in the cluster. You need to run mmchconfig to customize the node override again. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 10/06/2017 06:30 PM Subject: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From john.hearns at asml.com Mon Oct 9 09:38:29 2017 From: john.hearns at asml.com (John Hearns) Date: Mon, 9 Oct 2017 08:38:29 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Aaron, The reply you just got her is absolutely the correct one. However, its worth contributing something here. I have recently bene dealing with the parameter verbsPorts - which is a list of the interfaces which verbs should use. I found on our cluyster it was set to use dual ports for all nodes, including servers, when only our servers have dual ports. I will follow the advice below and make a global change, then change back the configuration for the server. It is worth looking though at mmllnodeclass -all There is a rather rich set of nodeclasses, including clientNodes managerNodes nonNsdNodes nonQuorumNodes So if you want to make changes to a certain type of node in your cluster you will be able to achieve it using nodeclasses. Bond, James Bond commander.bond at mi6.gov.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of IBM Spectrum Scale Sent: Saturday, October 07, 2017 5:07 AM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Hi Aaron, The default value applies to all nodes in the cluster. Thus changing it will change all nodes in the cluster. You need to run mmchconfig to customize the node override again. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. [Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is there a way to change the default value of a configurati]Aaron Knister ---10/06/2017 06:30:20 PM---Is there a way to change the default value of a configuration option without overriding any overrid From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM Subject: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From john.hearns at asml.com Mon Oct 9 09:44:28 2017 From: john.hearns at asml.com (John Hearns) Date: Mon, 9 Oct 2017 08:44:28 +0000 Subject: [gpfsug-discuss] Setting fo verbsRdmaSend Message-ID: We have a GPFS setup which is completely Infiniband connected. Version 4.2.3.4 I see that verbsRdmaCm is set to Disabled. Reading up about this, I am inclined to leave this disabled. Can anyone comment on the likely effects of changing it, and if there are any real benefits in performance? commander.bond at mi6.gov.uk -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.ouwehand at vumc.nl Mon Oct 9 10:13:07 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Mon, 9 Oct 2017 09:13:07 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Hello, Currently we are urgently looking for an on-access antivirus solution for our IBM Spectrum Scale SMB CES Cluster. Unfortunately IBM has no such solution. Does anyone have a good supported solution? Kind regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [cid:image003.png at 01D340EF.9527A0C0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 8437 bytes Desc: image003.png URL: From r.sobey at imperial.ac.uk Mon Oct 9 10:16:35 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 9 Oct 2017 09:16:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus In-Reply-To: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> References: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Message-ID: According to one of the presentations posted on this list a few days ago, there is "bulk antivirus scanning with Symantec AV" "coming soon". From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ouwehand, JJ Sent: 09 October 2017 10:13 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus Hello, Currently we are urgently looking for an on-access antivirus solution for our IBM Spectrum Scale SMB CES Cluster. Unfortunately IBM has no such solution. Does anyone have a good supported solution? Kind regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [cid:image001.png at 01D340E7.AF732BA0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8437 bytes Desc: image001.png URL: From daniel.kidger at uk.ibm.com Mon Oct 9 10:27:57 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Mon, 9 Oct 2017 09:27:57 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus In-Reply-To: References: , <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.png at 01D340E7.AF732BA0.png Type: image/png Size: 8437 bytes Desc: not available URL: From a.khiredine at meteo.dz Mon Oct 9 13:47:09 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Mon, 9 Oct 2017 12:47:09 +0000 Subject: [gpfsug-discuss] how gpfs work when disk fail Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> dear all how gpfs work when disk fail this is a example scenario when disk fail 1 Server 2 Disk directly attached to the local node 100GB mmlscluster GPFS cluster information ======================== GPFS cluster name: test.gpfs GPFS cluster id: 174397273000001824 GPFS UID domain: test.gpfs Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs Secondary server: (none) Node Daemon node name IP address Admin node name Designation ------------------------------------------------------------------- 1 gpfs 192.168.1.10 gpfs quorum-manager cat disk %nsd: device=/dev/sdb nsd=nsda servers=gpfs usage=dataAndMetadata pool=system %nsd: device=/dev/sdc nsd=nsdb servers=gpfs usage=dataAndMetadata pool=system mmcrnsd -F disk.txt mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------- nsdsdbgpfsa C0A8000F59DB69E2 /dev/sdb generic gpfsa-ib server node nsdsdcgpfsa C0A8000F59DB69E3 /dev/sdc generic gpfsa-ib server node mmcrfs gpfs -F disk.txt -B 1M -L 32M -T /gpfs -A no -m 2 -M 3 -r 2 -R 3 mmmount gpfs df -h gpfs 200G 3,8G 197G 2% /gpfs <-- The Disk Have 200GB my question is the following ?? if I write 180 GB of data in /gpfs and the disk /dev/sdb is fail how the disk and/or GPFS continues to support all my data Thanks Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From S.J.Thompson at bham.ac.uk Mon Oct 9 13:57:08 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 9 Oct 2017 12:57:08 +0000 Subject: [gpfsug-discuss] AFM fun (more!) Message-ID: Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon From janfrode at tanso.net Mon Oct 9 14:45:32 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 9 Oct 2017 15:45:32 +0200 Subject: [gpfsug-discuss] how gpfs work when disk fail In-Reply-To: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> References: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> Message-ID: You don't have room to write 180GB of file data, only ~100GB. When you write f.ex. 90 GB of file data, each filesystem block will get one copy written to each of your disks, occuppying 180 GB on total disk space. So you can always read if from the other disks if one should fail. This is controlled by your "-m 2 -r 2" settings, and the default failureGroup -1 since you didn't specify a failure group in your disk descriptor. Normally I would always specify a failure group when doing replication. -jf On Mon, Oct 9, 2017 at 2:47 PM, atmane khiredine wrote: > dear all > > how gpfs work when disk fail > > this is a example scenario when disk fail > > 1 Server > > 2 Disk directly attached to the local node 100GB > > mmlscluster > > GPFS cluster information > ======================== > GPFS cluster name: test.gpfs > GPFS cluster id: 174397273000001824 > GPFS UID domain: test.gpfs > Remote shell command: /usr/bin/ssh > Remote file copy command: /usr/bin/scp > Repository type: server-based > > GPFS cluster configuration servers: > ----------------------------------- > Primary server: gpfs > Secondary server: (none) > > Node Daemon node name IP address Admin node name Designation > ------------------------------------------------------------------- > 1 gpfs 192.168.1.10 gpfs quorum-manager > > cat disk > > %nsd: > device=/dev/sdb > nsd=nsda > servers=gpfs > usage=dataAndMetadata > pool=system > > %nsd: > device=/dev/sdc > nsd=nsdb > servers=gpfs > usage=dataAndMetadata > pool=system > > mmcrnsd -F disk.txt > > mmlsnsd -X > > Disk name NSD volume ID Device Devtype Node name Remarks > ------------------------------------------------------------ > --------------- > nsdsdbgpfsa C0A8000F59DB69E2 /dev/sdb generic gpfsa-ib server node > nsdsdcgpfsa C0A8000F59DB69E3 /dev/sdc generic gpfsa-ib server node > > > mmcrfs gpfs -F disk.txt -B 1M -L 32M -T /gpfs -A no -m 2 -M 3 -r 2 -R 3 > > mmmount gpfs > > df -h > > gpfs 200G 3,8G 197G 2% /gpfs <-- The Disk Have 200GB > > my question is the following ?? > > if I write 180 GB of data in /gpfs > and the disk /dev/sdb is fail > how the disk and/or GPFS continues to support all my data > > Thanks > > Atmane Khiredine > HPC System Administrator | Office National de la M?t?orologie > T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : > a.khiredine at meteo.dz > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Oct 9 15:38:15 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 9 Oct 2017 14:38:15 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Oct 9 19:55:45 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 14:55:45 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Thanks John! Funnily enough playing with node classes is what sent me down this path. I had a bunch of nodes defined (just over 1000) with a lower pagepool than the default. I then started using nodeclasses to clean up the config and I noticed that if you define a parameter with a nodeclass it doesn't override any previously set values for nodes in the node class. What I mean by that is if you do this: - mmchconfig pagepool=256M -N n001 - add node n001 to nodeclass mynodeclass - mmchconfig pagepool=256M -N mynodeclass after the 2nd chconfig there is still a definition for pagepool=256M for node n001. I tried to clean things up by doing "mmchconfig pagepool=DEFAULT -N n001" however the default value of the pagepool in our config is 1024M not the "1G" mmchconfig expects as the defualt value so I wasn't able to remove the explicit definition of pagepool for n001. What I ended up doing was an "mmchconfig pagepool=1024M -N n001" and that removed the explicit definitions. -Aaron On 10/9/17 4:38 AM, John Hearns wrote: > Aaron, > > The reply you just got her is absolutely the correct one. > > However, its worth contributing something here. I have recently bene > dealing with the parameter verbsPorts ? which is a list of the > interfaces which verbs should use. I found on our cluyster it was set to > use dual ports for all nodes, including servers, when only our servers > have dual ports.? I will follow the advice below and make a global > change, then change back the configuration for the server. > > It is worth looking though at? mmllnodeclass ?all > > There is a rather rich set of nodeclasses, including?? clientNodes > ??managerNodes nonNsdNodes? nonQuorumNodes > > So if you want to make changes to a certain type of node in your cluster > you will be able to achieve it using nodeclasses. > > Bond, James Bond > > commander.bond at mi6.gov.uk > > *From:* gpfsug-discuss-bounces at spectrumscale.org > [mailto:gpfsug-discuss-bounces at spectrumscale.org] *On Behalf Of *IBM > Spectrum Scale > *Sent:* Saturday, October 07, 2017 5:07 AM > *To:* gpfsug main discussion list > *Cc:* gpfsug-discuss-bounces at spectrumscale.org > *Subject:* Re: [gpfsug-discuss] changing default configuration values > > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 > . > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > > To: gpfsug main discussion list > > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > ------------------------------------------------------------------------ > > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the > body of this communication or the attachment thereto (if any), the > information is provided on an AS-IS basis without any express or implied > warranties or liabilities. To the extent you are relying on this > information, you are doing so at your own risk. If you are not the > intended recipient, please notify the sender immediately by replying to > this message and destroy all copies of this message and any attachments. > Neither the sender nor the company/group of companies he or she > represents shall be liable for the proper and complete transmission of > the information contained in this communication, or for any delay in its > receipt. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 19:56:25 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 14:56:25 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: <01c2a2bb-f332-e067-e7b5-6954df14c25d@nasa.gov> Thanks! Good to know. On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 20:00:02 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 15:00:02 -0400 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release In-Reply-To: References: Message-ID: <49283f9f-12b1-6381-6890-37d16aa87635@nasa.gov> Thanks Carl. Unfortunately I won't be at SC17 this year but thankfully a number of my colleagues will be so I'll send them with a list of questions on my behalf :) On 10/6/17 4:39 PM, Carl Zetie wrote: > Hi Aaron, > > I appreciate your care with this. The user group are the first users to be briefed on this. > > We're not quite ready to put more in writing just yet, however I will be at SC17 and hope > to be able to do so at that time. (I'll also take any other questions that people want to > ask, including "where's my RFE?"...) > > I also want to add one note about the meaning of feature deprecation, because it's not well > understood even within IBM: If we deprecate a feature with the next major release it does > NOT mean we are dropping support there and then. It means we are announcing the INTENTION > to drop support in some future release, and encourage you to (a) start making plans on > migration to a supported alternative, and (b) chime in on what you need in order to be > able to satisfactorily migrate if our proposed alternative is not adequate. > > regards, > > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > > ------------------------------ > > Message: 2 > Date: Tue, 3 Oct 2017 07:57:21 -0400 > From: Aaron Knister > To: gpfsug main discussion list > Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major > release > Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f at nasa.gov> > Content-Type: text/plain; charset="utf-8"; format=flowed > > Hi All, > > At the SSUG in NY there was mention of operating systems as well as > feature deprecations that would occur in the lifecycle of the next major > release of GPFS. I'm not sure if this is public knowledge yet so I > haven't mentioned specifics but given the proposed release time frame of > the next major release I thought customers may appreciate having access > to this information so they could provide feedback about the potential > impact to their environment if these deprecations do occur. Any chance > someone from IBM could provide specifics here so folks can chime in? > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 21:46:59 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 16:46:59 -0400 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> Message-ID: <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Hi Sven, Just wondering if you've had any additional thoughts/conversations about this. -Aaron On 9/8/17 5:21 PM, Sven Oehme wrote: > Hi, > > the code assumption is that the underlying device has no volatile write > cache, i was absolute sure we have that somewhere in the FAQ, but i > couldn't find it, so i will talk to somebody to correct this. > if i understand > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt?correct > one could enforce this by setting REQ_FUA, but thats not explicitly set > today, at least i can't see it. i will discuss this with one of our devs > who owns this code and come back. > > sven > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > wrote: > > Thanks Sven. I didn't think GPFS itself was caching anything on that > layer, but it's my understanding that O_DIRECT isn't sufficient to force > I/O to be flushed (e.g. the device itself might have a volatile caching > layer). Take someone using ZFS zvol's as NSDs. I can write() all day log > to that zvol (even with O_DIRECT) but there is absolutely no guarantee > those writes have been committed to stable storage and aren't just > sitting in RAM until an fsync() occurs (or some other bio function that > causes a flush). I also don't believe writing to a SATA drive with > O_DIRECT will force cache flushes of the drive's writeback cache.. > although I just tested that one and it seems to actually trigger a scsi > cache sync. Interesting. > > -Aaron > > On 9/7/17 10:55 PM, Sven Oehme wrote: > > I am not sure what exactly you are looking for but all > blockdevices are > > opened with O_DIRECT , we never cache anything on this layer . > > > > > > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > > > >> wrote: > > > >? ? ?Hi Everyone, > > > >? ? ?This is something that's come up in the past and has recently > resurfaced > >? ? ?with a project I've been working on, and that is-- it seems > to me as > >? ? ?though mmfsd never attempts to flush the cache of the block > devices its > >? ? ?writing to (looking at blktrace output seems to confirm > this). Is this > >? ? ?actually the case? I've looked at the gpl headers for linux > and I don't > >? ? ?see any sign of blkdev_fsync, blkdev_issue_flush, WRITE_FLUSH, or > >? ? ?REQ_FLUSH. I'm sure there's other ways to trigger this > behavior that > >? ? ?GPFS may very well be using that I've missed. That's why I'm > asking :) > > > >? ? ?I figure with FPO being pushed as an HDFS replacement using > commodity > >? ? ?drives this feature has *got* to be in the code somewhere. > > > >? ? ?-Aaron > > > >? ? ?-- > >? ? ?Aaron Knister > >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ?Goddard Space Flight Center > > (301) 286-2776 > >? ? ?_______________________________________________ > >? ? ?gpfsug-discuss mailing list > >? ? ?gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Mon Oct 9 22:07:10 2017 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 09 Oct 2017 21:07:10 +0000 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Message-ID: Hi, yeah sorry i intended to reply back before my vacation and forgot about it the the vacation flushed it all away :-D so right now the assumption in Scale/GPFS is that the underlying storage doesn't have any form of enabled volatile write cache. the problem seems to be that even if we set REQ_FUA some stacks or devices may not have implemented that at all or correctly, so even we would set it there is no guarantee that it will do what you think it does. the benefit of adding the flag at least would allow us to blame everything on the underlying stack/device , but i am not sure that will make somebody happy if bad things happen, therefore the requirement of a non-volatile device will still be required at all times underneath Scale. so if you think we should do this, please open a PMR with the details of your test so it can go its regular support path. you can mention me in the PMR as a reference as we already looked at the places the request would have to be added. Sven On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister wrote: > Hi Sven, > > Just wondering if you've had any additional thoughts/conversations about > this. > > -Aaron > > On 9/8/17 5:21 PM, Sven Oehme wrote: > > Hi, > > > > the code assumption is that the underlying device has no volatile write > > cache, i was absolute sure we have that somewhere in the FAQ, but i > > couldn't find it, so i will talk to somebody to correct this. > > if i understand > > > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt > correct > > one could enforce this by setting REQ_FUA, but thats not explicitly set > > today, at least i can't see it. i will discuss this with one of our devs > > who owns this code and come back. > > > > sven > > > > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > > wrote: > > > > Thanks Sven. I didn't think GPFS itself was caching anything on that > > layer, but it's my understanding that O_DIRECT isn't sufficient to > force > > I/O to be flushed (e.g. the device itself might have a volatile > caching > > layer). Take someone using ZFS zvol's as NSDs. I can write() all day > log > > to that zvol (even with O_DIRECT) but there is absolutely no > guarantee > > those writes have been committed to stable storage and aren't just > > sitting in RAM until an fsync() occurs (or some other bio function > that > > causes a flush). I also don't believe writing to a SATA drive with > > O_DIRECT will force cache flushes of the drive's writeback cache.. > > although I just tested that one and it seems to actually trigger a > scsi > > cache sync. Interesting. > > > > -Aaron > > > > On 9/7/17 10:55 PM, Sven Oehme wrote: > > > I am not sure what exactly you are looking for but all > > blockdevices are > > > opened with O_DIRECT , we never cache anything on this layer . > > > > > > > > > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > > > > > > >> wrote: > > > > > > Hi Everyone, > > > > > > This is something that's come up in the past and has recently > > resurfaced > > > with a project I've been working on, and that is-- it seems > > to me as > > > though mmfsd never attempts to flush the cache of the block > > devices its > > > writing to (looking at blktrace output seems to confirm > > this). Is this > > > actually the case? I've looked at the gpl headers for linux > > and I don't > > > see any sign of blkdev_fsync, blkdev_issue_flush, > WRITE_FLUSH, or > > > REQ_FLUSH. I'm sure there's other ways to trigger this > > behavior that > > > GPFS may very well be using that I've missed. That's why I'm > > asking :) > > > > > > I figure with FPO being pushed as an HDFS replacement using > > commodity > > > drives this feature has *got* to be in the code somewhere. > > > > > > -Aaron > > > > > > -- > > > Aaron Knister > > > NASA Center for Climate Simulation (Code 606.2) > > > Goddard Space Flight Center > > > (301) 286-2776 > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Oct 10 00:19:20 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 19:19:20 -0400 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Message-ID: <7090f583-d021-dd98-e55c-23eac83849ef@nasa.gov> Thanks, Sven. I think my goal was for the REQ_FUA flag to be used in alignment with the consistency expectations of the filesystem. Meaning if I was writing to a file on a filesystem (e.g. dd if=/dev/zero of=/gpfs/fs0/file1) that the write requests to the disk addresses containing data on the file wouldn't be issued with REQ_FUA. However, once the file was closed the close() wouldn't return until a disk buffer flush had occurred. For more important operations (e.g. metadata updates, log operations) I would expect/suspect REQ_FUA would be issued more frequently. The advantage here is it would allow GPFS to run ontop of block devices that don't perform well with the present synchronous workload of mmfsd (e.g. ZFS, and various other software-defined storage or hardware appliances) but that can perform well when only periodically (e.g. every few seconds) asked to flush pending data to disk. I also think this would be *really* important in an FPO environment where individual drives will probably have caches on by default and I'm not sure direct I/O is sufficient to force linux to issue scsi synchronize cache commands to those devices. I'm guessing that this is far from easy but I figured I'd ask. -Aaron On 10/9/17 5:07 PM, Sven Oehme wrote: > Hi, > > yeah sorry i intended to reply back before my vacation and forgot about > it the the vacation flushed it all away :-D > so right now the assumption in Scale/GPFS is that the underlying storage > doesn't have any form of enabled volatile write cache. the problem seems > to be that even if we set?REQ_FUA some stacks or devices may not have > implemented that at all or correctly, so even we would set it there is > no guarantee that it will do what you think it does. the benefit of > adding the flag at least would allow us to blame everything on the > underlying stack/device , but i am not sure that will make somebody > happy if bad things happen, therefore the requirement of a non-volatile > device will still be required at all times underneath Scale. > so if you think we should do this, please open a PMR with the details of > your test so it can go its regular support path. you can mention me in > the PMR as a reference as we already looked at the places the request > would have to be added.?? > > Sven > > > On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister > wrote: > > Hi Sven, > > Just wondering if you've had any additional thoughts/conversations about > this. > > -Aaron > > On 9/8/17 5:21 PM, Sven Oehme wrote: > > Hi, > > > > the code assumption is that the underlying device has no volatile > write > > cache, i was absolute sure we have that somewhere in the FAQ, but i > > couldn't find it, so i will talk to somebody to correct this. > > if i understand > > > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt?correct > > one could enforce this by setting REQ_FUA, but thats not > explicitly set > > today, at least i can't see it. i will discuss this with one of > our devs > > who owns this code and come back. > > > > sven > > > > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > > > >> wrote: > > > >? ? ?Thanks Sven. I didn't think GPFS itself was caching anything > on that > >? ? ?layer, but it's my understanding that O_DIRECT isn't > sufficient to force > >? ? ?I/O to be flushed (e.g. the device itself might have a > volatile caching > >? ? ?layer). Take someone using ZFS zvol's as NSDs. I can write() > all day log > >? ? ?to that zvol (even with O_DIRECT) but there is absolutely no > guarantee > >? ? ?those writes have been committed to stable storage and aren't just > >? ? ?sitting in RAM until an fsync() occurs (or some other bio > function that > >? ? ?causes a flush). I also don't believe writing to a SATA drive with > >? ? ?O_DIRECT will force cache flushes of the drive's writeback cache.. > >? ? ?although I just tested that one and it seems to actually > trigger a scsi > >? ? ?cache sync. Interesting. > > > >? ? ?-Aaron > > > >? ? ?On 9/7/17 10:55 PM, Sven Oehme wrote: > >? ? ? > I am not sure what exactly you are looking for but all > >? ? ?blockdevices are > >? ? ? > opened with O_DIRECT , we never cache anything on this layer . > >? ? ? > > >? ? ? > > >? ? ? > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > >? ? ? > > > >? ? ? > > >? ? ? >>> wrote: > >? ? ? > > >? ? ? >? ? ?Hi Everyone, > >? ? ? > > >? ? ? >? ? ?This is something that's come up in the past and has > recently > >? ? ?resurfaced > >? ? ? >? ? ?with a project I've been working on, and that is-- it seems > >? ? ?to me as > >? ? ? >? ? ?though mmfsd never attempts to flush the cache of the block > >? ? ?devices its > >? ? ? >? ? ?writing to (looking at blktrace output seems to confirm > >? ? ?this). Is this > >? ? ? >? ? ?actually the case? I've looked at the gpl headers for linux > >? ? ?and I don't > >? ? ? >? ? ?see any sign of blkdev_fsync, blkdev_issue_flush, > WRITE_FLUSH, or > >? ? ? >? ? ?REQ_FLUSH. I'm sure there's other ways to trigger this > >? ? ?behavior that > >? ? ? >? ? ?GPFS may very well be using that I've missed. That's > why I'm > >? ? ?asking :) > >? ? ? > > >? ? ? >? ? ?I figure with FPO being pushed as an HDFS replacement using > >? ? ?commodity > >? ? ? >? ? ?drives this feature has *got* to be in the code somewhere. > >? ? ? > > >? ? ? >? ? ?-Aaron > >? ? ? > > >? ? ? >? ? ?-- > >? ? ? >? ? ?Aaron Knister > >? ? ? >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ? >? ? ?Goddard Space Flight Center > >? ? ? > (301) 286-2776 > >? ? ? >? ? ?_______________________________________________ > >? ? ? >? ? ?gpfsug-discuss mailing list > >? ? ? >? ? ?gpfsug-discuss at spectrumscale.org > > >? ? ? > >? ? ? > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? ? ? > > >? ? ? > > >? ? ? > > >? ? ? > _______________________________________________ > >? ? ? > gpfsug-discuss mailing list > >? ? ? > gpfsug-discuss at spectrumscale.org > > >? ? ? > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? ? ? > > > > >? ? ?-- > >? ? ?Aaron Knister > >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ?Goddard Space Flight Center > >? ? ?(301) 286-2776 > >? ? ?_______________________________________________ > >? ? ?gpfsug-discuss mailing list > >? ? ?gpfsug-discuss at spectrumscale.org > > >? ? ?http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From vpuvvada at in.ibm.com Tue Oct 10 05:56:21 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 10 Oct 2017 10:26:21 +0530 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Tue Oct 10 08:47:23 2017 From: john.hearns at asml.com (John Hearns) Date: Tue, 10 Oct 2017 07:47:23 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: > The queues are steadily rising and we've seen them over 1000000 ... There is definitely a song here... I see you playing the blues guitar... I can't answer your question directly. As I recall you are at the latest version? We recently had to update to 4.2.3.4 due to an AFM issue - where if the home NFS share was disconnected, a read operation would finish early and not re-start. One thing I would do is look at where the 'real' NFS mount is being done (apology - I assume an NFS home). Log on to bber-afmgw01 and find where the home filesystem is being mounted, which is below /var/mmfs/afm Have a ferret around in there - do you still have that filesystem mounted? -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: Monday, October 09, 2017 2:57 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] AFM fun (more!) Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-appsgpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homesgpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Caa732d9965f64983c2e508d50f15424e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=wVJhicLSj%2FWUjedvBKo6MG%2FYrtFAaWKxMeqiUrKRHfM%3D&reserved=0 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From john.hearns at asml.com Tue Oct 10 09:42:05 2017 From: john.hearns at asml.com (John Hearns) Date: Tue, 10 Oct 2017 08:42:05 +0000 Subject: [gpfsug-discuss] Recommended pagepool size on clients? Message-ID: May I ask how to size pagepool on clients? Somehow I hear an enormous tin can being opened behind me... and what sounds like lots of worms... Anyway, I currently have mmhealth reporting gpfs_pagepool_small. Pagepool is set to 1024M on clients, and I now note the documentation says you get this warning when pagepool is lower or equal to 1GB We did do some IOR benchmarking which shows better performance with an increased pagepool size. I am looking for some rules of thumb for sizing for an 128Gbyte RAM client. And yup, I know the answer will be 'depends on your workload' I agree though that 1024M is too low. Illya,kuryakin at uncle.int -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scottg at emailhosting.com Tue Oct 10 10:49:54 2017 From: scottg at emailhosting.com (Scott Goldman) Date: Tue, 10 Oct 2017 05:49:54 -0400 Subject: [gpfsug-discuss] changing default configuration values Message-ID: So, I think brings up one of the slight frustrations I've always had with mmconfig.. If I have a cluster to which new nodes will eventually be added, OR, I have standard I always wish to apply, there is no way to say "all FUTURE" nodes need to have my defaults.. I just have to remember to extended the changes in as new nodes are brought into the cluster. Is there a way to accomplish this? Thanks ? Original Message ? From: aaron.s.knister at nasa.gov Sent: October 9, 2017 2:56 PM To: gpfsug-discuss at spectrumscale.org Reply-to: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Thanks! Good to know. On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Tue Oct 10 13:02:14 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:02:14 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: Use mmchconfig and change the defaults, and then have a node class for "not the defaults"? Apply settings to a node class and add all new clients to the node class? Note there was some version of Scale where node classes were enumerated when the config was set for the node class, but in (4.2.3 at least), this works as expected, I.e. The node class is not expanded when doing mmchconfig -N Simon On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So, I think brings up one of the slight frustrations I've always had with >mmconfig.. > >If I have a cluster to which new nodes will eventually be added, OR, I >have standard I always wish to apply, there is no way to say "all FUTURE" >nodes need to have my defaults.. I just have to remember to extended the >changes in as new nodes are brought into the cluster. > >Is there a way to accomplish this? >Thanks > > Original Message >From: aaron.s.knister at nasa.gov >Sent: October 9, 2017 2:56 PM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Thanks! Good to know. > >On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >> Hi Aaron, >> >> The default value applies to all nodes in the cluster. Thus changing it >> will change all nodes in the cluster. You need to run mmchconfig to >> customize the node override again. >> >> >> Regards, The Spectrum Scale (GPFS) team >> >> >>------------------------------------------------------------------------- >>----------------------------------------- >> If you feel that your question can benefit other users of Spectrum >>Scale >> (GPFS), then please post it to the public IBM developerWroks Forum at >> >>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111111 >>1-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please contact >> 1-800-237-5511 in the United States or your local IBM Service Center in >> other countries. >> >> The forum is informally monitored as time permits and should not be >>used >> for priority messages to the Spectrum Scale (GPFS) team. >> >> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >> there a way to change the default value of a configuratiAaron Knister >> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >> of a configuration option without overriding any overrid >> >> From: Aaron Knister >> To: gpfsug main discussion list >> Date: 10/06/2017 06:30 PM >> Subject: [gpfsug-discuss] changing default configuration values >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Is there a way to change the default value of a configuration option >> without overriding any overrides in place? >> >> Take the following situation: >> >> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >> n001,n002) >> >> Is there a way to then set the default value of foo to qux without >> changing the value of foo for nodes n001 and n002? >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_li >>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sb >>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4 >>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scottg at emailhosting.com Tue Oct 10 13:04:30 2017 From: scottg at emailhosting.com (Scott Goldman) Date: Tue, 10 Oct 2017 08:04:30 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: Message-ID: So when a node is added to the node class, my defaults" will be applied? If so,excellent. Thanks ? Original Message ? From: S.J.Thompson at bham.ac.uk Sent: October 10, 2017 8:02 AM To: gpfsug-discuss at spectrumscale.org Reply-to: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Use mmchconfig and change the defaults, and then have a node class for "not the defaults"? Apply settings to a node class and add all new clients to the node class? Note there was some version of Scale where node classes were enumerated when the config was set for the node class, but in (4.2.3 at least), this works as expected, I.e. The node class is not expanded when doing mmchconfig -N Simon On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So, I think brings up one of the slight frustrations I've always had with >mmconfig.. > >If I have a cluster to which new nodes will eventually be added, OR, I >have standard I always wish to apply, there is no way to say "all FUTURE" >nodes need to have my defaults.. I just have to remember to extended the >changes in as new nodes are brought into the cluster. > >Is there a way to accomplish this? >Thanks > >? Original Message >From: aaron.s.knister at nasa.gov >Sent: October 9, 2017 2:56 PM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Thanks! Good to know. > >On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >> Hi Aaron, >> >> The default value applies to all nodes in the cluster. Thus changing it >> will change all nodes in the cluster. You need to run mmchconfig to >> customize the node override again. >> >> >> Regards, The Spectrum Scale (GPFS) team >> >> >>------------------------------------------------------------------------- >>----------------------------------------- >> If you feel that your question can benefit other users of Spectrum >>Scale >> (GPFS), then please post it to the public IBM developerWroks Forum at >> >>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111111 >>1-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please contact >> 1-800-237-5511 in the United States or your local IBM Service Center in >> other countries. >> >> The forum is informally monitored as time permits and should not be >>used >> for priority messages to the Spectrum Scale (GPFS) team. >> >> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >> there a way to change the default value of a configuratiAaron Knister >> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >> of a configuration option without overriding any overrid >> >> From: Aaron Knister >> To: gpfsug main discussion list >> Date: 10/06/2017 06:30 PM >> Subject: [gpfsug-discuss] changing default configuration values >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Is there a way to change the default value of a configuration option >> without overriding any overrides in place? >> >> Take the following situation: >> >> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >> n001,n002) >> >> Is there a way to then set the default value of foo to qux without >> changing the value of foo for nodes n001 and n002? >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_li >>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sb >>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4 >>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Robert.Oesterlin at nuance.com Tue Oct 10 13:27:45 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 10 Oct 2017 12:27:45 +0000 Subject: [gpfsug-discuss] changing default configuration values Message-ID: <1BFF991D-4ABD-4C3A-B6FB-41CEABFCD4FB@nuance.com> Yes, this is exactly what we do for our LROC enabled nodes. Add them to the node class and you're all set. Bob Oesterlin Sr Principal Storage Engineer, Nuance ?On 10/10/17, 7:03 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: Apply settings to a node class and add all new clients to the node class? From S.J.Thompson at bham.ac.uk Tue Oct 10 13:30:57 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:30:57 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: Yes, but obviously only when you recycle mmfsd on the node after adding it to the node class, e.g. page pool cannot be changed online. We do this all the time, e.g. We have nodes with different IB fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have classes for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different DCs etc) Simon On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So when a node is added to the node class, my defaults" will be applied? >If so,excellent. Thanks > > > Original Message >From: S.J.Thompson at bham.ac.uk >Sent: October 10, 2017 8:02 AM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Use mmchconfig and change the defaults, and then have a node class for >"not the defaults"? > >Apply settings to a node class and add all new clients to the node class? > >Note there was some version of Scale where node classes were enumerated >when the config was set for the node class, but in (4.2.3 at least), this >works as expected, I.e. The node class is not expanded when doing >mmchconfig -N > >Simon > >On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf >of scottg at emailhosting.com" behalf of scottg at emailhosting.com> wrote: > >>So, I think brings up one of the slight frustrations I've always had with >>mmconfig.. >> >>If I have a cluster to which new nodes will eventually be added, OR, I >>have standard I always wish to apply, there is no way to say "all FUTURE" >>nodes need to have my defaults.. I just have to remember to extended the >>changes in as new nodes are brought into the cluster. >> >>Is there a way to accomplish this? >>Thanks >> >> Original Message >>From: aaron.s.knister at nasa.gov >>Sent: October 9, 2017 2:56 PM >>To: gpfsug-discuss at spectrumscale.org >>Reply-to: gpfsug-discuss at spectrumscale.org >>Subject: Re: [gpfsug-discuss] changing default configuration values >> >>Thanks! Good to know. >> >>On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>> Hi Aaron, >>> >>> The default value applies to all nodes in the cluster. Thus changing it >>> will change all nodes in the cluster. You need to run mmchconfig to >>> customize the node override again. >>> >>> >>> Regards, The Spectrum Scale (GPFS) team >>> >>> >>>------------------------------------------------------------------------ >>>- >>>----------------------------------------- >>> If you feel that your question can benefit other users of Spectrum >>>Scale >>> (GPFS), then please post it to the public IBM developerWroks Forum at >>> >>>https://www.ibm.com/developerworks/community/forums/html/forum?id=111111 >>>1 >>>1-0000-0000-0000-000000000479. >>> >>> >>> If your query concerns a potential software error in Spectrum Scale >>> (GPFS) and you have an IBM software maintenance contract please contact >>> 1-800-237-5511 in the United States or your local IBM Service Center in >>> other countries. >>> >>> The forum is informally monitored as time permits and should not be >>>used >>> for priority messages to the Spectrum Scale (GPFS) team. >>> >>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >>> there a way to change the default value of a configuratiAaron Knister >>> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >>> of a configuration option without overriding any overrid >>> >>> From: Aaron Knister >>> To: gpfsug main discussion list >>> Date: 10/06/2017 06:30 PM >>> Subject: [gpfsug-discuss] changing default configuration values >>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>> >>> >>>------------------------------------------------------------------------ >>> >>> >>> >>> Is there a way to change the default value of a configuration option >>> without overriding any overrides in place? >>> >>> Take the following situation: >>> >>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>> n001,n002) >>> >>> Is there a way to then set the default value of foo to qux without >>> changing the value of foo for nodes n001 and n002? >>> >>> -Aaron >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> >>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_l >>>i >>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2S >>>b >>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_ >>>4 >>>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >>-- >>Aaron Knister >>NASA Center for Climate Simulation (Code 606.2) >>Goddard Space Flight Center >>(301) 286-2776 >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Tue Oct 10 13:32:25 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 10 Oct 2017 08:32:25 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Simon, Does that mean node classes don't work the way individual node names do with the "-i/-I" options? -Aaron On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: > Yes, but obviously only when you recycle mmfsd on the node after adding it > to the node class, e.g. page pool cannot be changed online. > > We do this all the time, e.g. We have nodes with different IB > fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have classes > for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different > DCs etc) > > Simon > > On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of scottg at emailhosting.com" behalf of scottg at emailhosting.com> wrote: > >> So when a node is added to the node class, my defaults" will be applied? >> If so,excellent. Thanks >> >> >> Original Message >> From: S.J.Thompson at bham.ac.uk >> Sent: October 10, 2017 8:02 AM >> To: gpfsug-discuss at spectrumscale.org >> Reply-to: gpfsug-discuss at spectrumscale.org >> Subject: Re: [gpfsug-discuss] changing default configuration values >> >> Use mmchconfig and change the defaults, and then have a node class for >> "not the defaults"? >> >> Apply settings to a node class and add all new clients to the node class? >> >> Note there was some version of Scale where node classes were enumerated >> when the config was set for the node class, but in (4.2.3 at least), this >> works as expected, I.e. The node class is not expanded when doing >> mmchconfig -N >> >> Simon >> >> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So, I think brings up one of the slight frustrations I've always had with >>> mmconfig.. >>> >>> If I have a cluster to which new nodes will eventually be added, OR, I >>> have standard I always wish to apply, there is no way to say "all FUTURE" >>> nodes need to have my defaults.. I just have to remember to extended the >>> changes in as new nodes are brought into the cluster. >>> >>> Is there a way to accomplish this? >>> Thanks >>> >>> Original Message >>> From: aaron.s.knister at nasa.gov >>> Sent: October 9, 2017 2:56 PM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Thanks! Good to know. >>> >>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>> Hi Aaron, >>>> >>>> The default value applies to all nodes in the cluster. Thus changing it >>>> will change all nodes in the cluster. You need to run mmchconfig to >>>> customize the node override again. >>>> >>>> >>>> Regards, The Spectrum Scale (GPFS) team >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> - >>>> ----------------------------------------- >>>> If you feel that your question can benefit other users of Spectrum >>>> Scale >>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>> >>>> https://www.ibm.com/developerworks/community/forums/html/forum?id=111111 >>>> 1 >>>> 1-0000-0000-0000-000000000479. >>>> >>>> >>>> If your query concerns a potential software error in Spectrum Scale >>>> (GPFS) and you have an IBM software maintenance contract please contact >>>> 1-800-237-5511 in the United States or your local IBM Service Center in >>>> other countries. >>>> >>>> The forum is informally monitored as time permits and should not be >>>> used >>>> for priority messages to the Spectrum Scale (GPFS) team. >>>> >>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >>>> there a way to change the default value of a configuratiAaron Knister >>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >>>> of a configuration option without overriding any overrid >>>> >>>> From: Aaron Knister >>>> To: gpfsug main discussion list >>>> Date: 10/06/2017 06:30 PM >>>> Subject: [gpfsug-discuss] changing default configuration values >>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> >>>> Is there a way to change the default value of a configuration option >>>> without overriding any overrides in place? >>>> >>>> Take the following situation: >>>> >>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>> n001,n002) >>>> >>>> Is there a way to then set the default value of foo to qux without >>>> changing the value of foo for nodes n001 and n002? >>>> >>>> -Aaron >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_l >>>> i >>>> stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2S >>>> b >>>> on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_ >>>> 4 >>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Tue Oct 10 13:36:14 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:36:14 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Tue Oct 10 15:45:32 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 10:45:32 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: It's always helpful to check and confirm that you get what you expected. mmlsconfig shows the value in the configuration and "mmfsadm dump config" shows the value in the GPFS daemon currently running. [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1073741824 c10f1n11.gpfs.net: pagepool 1073741824 [root at c10f1n11 gitr]# mmchconfig pagepool=1500M -i -N c69bc2xn03 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root at c10f1n11 gitr]# Tue Oct 10 10:36:49 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation started [root at c10f1n11 gitr]# Tue Oct 10 10:36:52 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G pagepool 1500M [c69bc2xn03] [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1572864000 c10f1n11.gpfs.net: pagepool 1073741824 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/10/2017 08:36 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From scale at us.ibm.com Tue Oct 10 15:51:37 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 10:51:37 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: For a customer production system, "mmdiag --config" rather than "mmfsadm dump config" should be used. The mmdiag command is meant for end users while the "mmfsadm dump" command is a service aid that carries greater risks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: IBM Spectrum Scale/Poughkeepsie/IBM To: gpfsug main discussion list Date: 10/10/2017 10:48 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: Enci Zhong It's always helpful to check and confirm that you get what you expected. mmlsconfig shows the value in the configuration and "mmfsadm dump config" shows the value in the GPFS daemon currently running. [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1073741824 c10f1n11.gpfs.net: pagepool 1073741824 [root at c10f1n11 gitr]# mmchconfig pagepool=1500M -i -N c69bc2xn03 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root at c10f1n11 gitr]# Tue Oct 10 10:36:49 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation started [root at c10f1n11 gitr]# Tue Oct 10 10:36:52 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G pagepool 1500M [c69bc2xn03] [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1572864000 c10f1n11.gpfs.net: pagepool 1073741824 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/10/2017 08:36 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From scale at us.ibm.com Tue Oct 10 16:09:20 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 11:09:20 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Leo.Earl at uea.ac.uk Tue Oct 10 16:29:47 2017 From: Leo.Earl at uea.ac.uk (Leo Earl (ITCS - Staff)) Date: Tue, 10 Oct 2017 15:29:47 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: Hi Simon, (My first ever post - queue being shot down in flames) Whilst this doesn't answer any of your questions (directly) One thing we do tend to look at when we see (what appears) to be a static "Queue Length" value, is the data which is actually inflight from an AFM perspective, so that we can ascertain whether the reason is something like, a user writing huge files to the cache, which take time to sync with home, and thus remain in the queue, providing a static "Queue Length" [root at server ~]# mmafmctl gpfs getstate | awk ' $6 >=50 ' Fileset Name Fileset Target Fileset State Gateway Node Queue State Queue Length Queue numExec afmcancergenetics :/leo/res_hpc/afm-leo Dirty csgpfs01 Active 60 126157822 [root at server ~]# So for instance, by using the tsfindinode command, to have a look at the size of the file which is currently "inflight" from an AFM perspective: [root at server ~]# mmfsadm dump afm | more Fileset: afm-leo 12 (gpfs) mode: independent-writer queue: Normal myFileset MDS: home: proto: nfs port: 2049 lastCmd: 6 handler: Mounted Dirty refCount: 5 queue: delay 300 QLen 0+9 flushThds 4 maxFlushThds 4 numExec 436 qfs 0 err 0 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 1 i/o: pReadGWs 0 (All) pReadChunkSize 134217728 pReadThresh: -2 >> Disabled << i/o: prefetchThresh 0 (Prefetch) iw: afmIwTakeoverTime 0 Priority Queue: (listed by execution order) (state: Active) Write [601414711.601379492] inflight (377743888481 @ 0) chunks 0 bytes 0 0 thread_id 7630 Write [601717612.601465868] inflight (462997479227 @ 0) chunks 0 bytes 0 1 thread_id 10200 Write [601717612.601465870] inflight (391663667550 @ 0) chunks 0 bytes 0 2 thread_id 10287 Write [601717612.601465871] inflight (377743888481 @ 0) chunks 0 bytes 0 3 thread_id 10333 Write [601717612.601573418] queued (387002794104 @ 0) chunks 0 bytes 0 4 Write [601414711.601650195] queued (342305480538 @ 0) chunks 0 bytes 0 5 ResetDirty [538455296.-1] queued etype normal normal 19061213 ResetDirty [601623334.-1] queued etype normal normal 19061213 RecoveryMarker [-1.-1] queued etype normal normal 19061213 Normal Queue: Empty (state: Active) Fileset: afmdata 20 (gpfs) #Use the file inode ID to determine the actual file which is inflight between cache and home [root at server cancergenetics]# tsfindinode -i 601379492 /gpfs/afm/cancergenetics > inode.out [root at server ~]# cat /root/inode.out 601379492 0 0xCCB6 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server ~]# ls -l /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq -rw-r--r-- 1 bbn16cku MED_pg 377743888481 Mar 25 05:48 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server I am not sure if that helps and you probably already know about it inflight checking... Kind Regards, Leo Leo Earl | Head of Research & Specialist Computing Room ITCS 01.16, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ +44 (0) 1603 593856 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 October 2017 05:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM fun (more!) Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Oct 10 17:03:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 16:03:35 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: So as you might expect, we've been poking at this all day. We'd typically get to ~1000 entries in the queue having taken access to the FS away from users (yeah its that bad), but the remaining items would stay for ever as far as we could see. By copying the file, removing and then moving the copied file, we're able to get it back into a clean state. But then we ran a sample user job, and instantly the next job hung up the queue (we're talking like <100MB files here). Interestingly we looked at the queue to see what was going on (with saferdump, always use saferdump!!!) Normal Queue: (listed by execution order) (state: Active) 95 Write [6060026.6060026] inflight (18 @ 0) thread_id 44812 96 Write [13808655.13808655] queued (18 @ 0) 97 Truncate [6060026] queued 98 Truncate [13808655] queued 124 Write [6060000.6060000] inflight (18 @ 0) thread_id 44835 125 Truncate [6060000] queued 159 Write [6060013.6060013] inflight (18 @ 0) thread_id 21329 160 Truncate [6060013] queued 171 Write [5953611.5953611] inflight (18 @ 0) thread_id 44837 172 Truncate [5953611] queued Note that each inode that is inflight is followed by a queued Truncate... We are running efix2, because there is an issue with truncate not working prior to this (it doesn't get sent to home), so this smells like an AFM bug to me. We have a PMR open... Simon From: > on behalf of "Leo Earl (ITCS - Staff)" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 10 October 2017 at 16:29 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] AFM fun (more!) Hi Simon, (My first ever post ? queue being shot down in flames) Whilst this doesn?t answer any of your questions (directly) One thing we do tend to look at when we see (what appears) to be a static ?Queue Length? value, is the data which is actually inflight from an AFM perspective, so that we can ascertain whether the reason is something like, a user writing huge files to the cache, which take time to sync with home, and thus remain in the queue, providing a static ?Queue Length? [root at server ~]# mmafmctl gpfs getstate | awk ' $6 >=50 ' Fileset Name Fileset Target Fileset State Gateway Node Queue State Queue Length Queue numExec afmcancergenetics :/leo/res_hpc/afm-leo Dirty csgpfs01 Active 60 126157822 [root at server ~]# So for instance, by using the tsfindinode command, to have a look at the size of the file which is currently ?inflight? from an AFM perspective: [root at server ~]# mmfsadm dump afm | more Fileset: afm-leo 12 (gpfs) mode: independent-writer queue: Normal myFileset MDS: home: proto: nfs port: 2049 lastCmd: 6 handler: Mounted Dirty refCount: 5 queue: delay 300 QLen 0+9 flushThds 4 maxFlushThds 4 numExec 436 qfs 0 err 0 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 1 i/o: pReadGWs 0 (All) pReadChunkSize 134217728 pReadThresh: -2 >> Disabled << i/o: prefetchThresh 0 (Prefetch) iw: afmIwTakeoverTime 0 Priority Queue: (listed by execution order) (state: Active) Write [601414711.601379492] inflight (377743888481 @ 0) chunks 0 bytes 0 0 thread_id 7630 Write [601717612.601465868] inflight (462997479227 @ 0) chunks 0 bytes 0 1 thread_id 10200 Write [601717612.601465870] inflight (391663667550 @ 0) chunks 0 bytes 0 2 thread_id 10287 Write [601717612.601465871] inflight (377743888481 @ 0) chunks 0 bytes 0 3 thread_id 10333 Write [601717612.601573418] queued (387002794104 @ 0) chunks 0 bytes 0 4 Write [601414711.601650195] queued (342305480538 @ 0) chunks 0 bytes 0 5 ResetDirty [538455296.-1] queued etype normal normal 19061213 ResetDirty [601623334.-1] queued etype normal normal 19061213 RecoveryMarker [-1.-1] queued etype normal normal 19061213 Normal Queue: Empty (state: Active) Fileset: afmdata 20 (gpfs) #Use the file inode ID to determine the actual file which is inflight between cache and home [root at server cancergenetics]# tsfindinode -i 601379492 /gpfs/afm/cancergenetics > inode.out [root at server ~]# cat /root/inode.out 601379492 0 0xCCB6 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server ~]# ls -l /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq -rw-r--r-- 1 bbn16cku MED_pg 377743888481 Mar 25 05:48 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server I am not sure if that helps and you probably already know about it inflight checking? Kind Regards, Leo Leo Earl | Head of Research & Specialist Computing Room ITCS 01.16, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ +44 (0) 1603 593856 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 October 2017 05:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] AFM fun (more!) Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Oct 10 19:00:55 2017 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 10 Oct 2017 18:00:55 +0000 Subject: [gpfsug-discuss] Recommended pagepool size on clients? In-Reply-To: References: Message-ID: if this is a new cluster and you use reasonable new HW, i probably would start with just the following settings on the clients : pagepool=4g,workerThreads=256,maxStatCache=0,maxFilesToCache=256k depending on what storage you use and what workload you have you may have to set a couple of other settings too, but that should be a good start. we plan to make this whole process significant easier in the future, The Next Major Scale release will eliminate the need for another ~20 parameters in special cases and we will simplify the communication setup a lot too. beyond that we started working on introducing tuning suggestions based on the running system environment but there is no release targeted for that yet. Sven On Tue, Oct 10, 2017 at 1:42 AM John Hearns wrote: > May I ask how to size pagepool on clients? Somehow I hear an enormous tin > can being opened behind me? and what sounds like lots of worms? > > > > Anyway, I currently have mmhealth reporting gpfs_pagepool_small. Pagepool > is set to 1024M on clients, > > and I now note the documentation says you get this warning when pagepool > is lower or equal to 1GB > > We did do some IOR benchmarking which shows better performance with an > increased pagepool size. > > > > I am looking for some rules of thumb for sizing for an 128Gbyte RAM client. > > And yup, I know the answer will be ?depends on your workload? > > I agree though that 1024M is too low. > > > > Illya,kuryakin at uncle.int > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdeluca at gmail.com Tue Oct 10 19:51:28 2017 From: bdeluca at gmail.com (Ben De Luca) Date: Tue, 10 Oct 2017 20:51:28 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: > Bob, > > The problem may occur when the TCP connection is broken between two nodes. > While in the vast majority of the cases when data stops flowing through the > connection, the result is one of the nodes getting expelled, there are > cases where the TCP connection simply breaks -- that is relatively rare but > happens on occasion. There is logic in the mmfsd daemon to detect the > disconnection and attempt to reconnect to the destination in question. If > the reconnect is successful then steps are taken to recover the state kept > by the daemons, and that includes resending some RPCs that were in flight > when the disconnection took place. > > As the flash describes, a problem in the logic to resend some RPCs was > causing one of the RPC headers to be omitted, resulting in the RPC data to > be interpreted as the (missing) header. Normally the result is an assert on > the receiving end, like the "logAssertFailed: !"Request and queue size > mismatch" assert described in the flash. However, it's at least > conceivable (though expected to very rare) that the content of the RPC data > could be interpreted as a valid RPC header. In the case of an RPC which > involves data transfer between an NSD client and NSD server, that might > result in incorrect data being written to some NSD device. > > Disconnect/reconnect scenarios appear to be uncommon. An entry like > > [N] Reconnected to xxx.xxx.xxx.xxx nodename > > in mmfs.log would be an indication that a reconnect has occurred. By > itself, the reconnect will not imply that data or the file system was > corrupted, since that will depend on what RPCs were pending when the > connection happened. In the case the assert above is hit, no corruption is > expected, since the daemon will go down before incorrect data gets written. > > Reconnects involving an NSD server are those which present the highest > risk, given that NSD-related RPCs are used to write data into NSDs > > Even on clusters that have not been subjected to disconnects/reconnects > before, such events might still happen in the future in case of network > glitches. It's then recommended that an efix for the problem be applied in > a timely fashion. > > > Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 > > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------ > ------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/ > forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM > Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale > (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file > system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Can anyone from the Scale team comment? > > Anytime I see ?may result in file system corruption or undetected file > data corruption? it gets my attention. > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > *Storage * > IBM My Notifications > Check out the *IBM Electronic Support* > > > > IBM Spectrum Scale > *: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect > function may result in file system corruption or undetected file data > corruption* > > IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 > levels, in which resending an NSD RPC after a network reconnect function > may result in file system corruption or undetected file data corruption. > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r= > IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=- > AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Tue Oct 10 23:15:11 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 00:15:11 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bdeluca at gmail.com Wed Oct 11 05:40:21 2017 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 11 Oct 2017 06:40:21 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: > Hi, I understood the failure to occur requires that the RPC payload of > the RPC resent without actual header can be mistaken for a valid RPC > header. The resend mechanism is probably not considering what the actual > content/target the RPC has. > So, in principle, the RPC could be to update a data block, or a metadata > block - so it may hit just a single data file or corrupt your entire file > system. > However, I think the likelihood that the RPC content can go as valid RPC > header is very low. > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org > Date: 10/10/2017 08:52 PM > Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum > Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in > file system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > does this corrupt the entire filesystem or just the open files that are > being written too? > > One is horrific and the other is just mildly bad. > > On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: > Bob, > > The problem may occur when the TCP connection is broken between two nodes. > While in the vast majority of the cases when data stops flowing through > the connection, the result is one of the nodes getting expelled, there are > cases where the TCP connection simply breaks -- that is relatively rare > but happens on occasion. There is logic in the mmfsd daemon to detect the > disconnection and attempt to reconnect to the destination in question. If > the reconnect is successful then steps are taken to recover the state kept > by the daemons, and that includes resending some RPCs that were in flight > when the disconnection took place. > > As the flash describes, a problem in the logic to resend some RPCs was > causing one of the RPC headers to be omitted, resulting in the RPC data to > be interpreted as the (missing) header. Normally the result is an assert > on the receiving end, like the "logAssertFailed: !"Request and queue size > mismatch" assert described in the flash. However, it's at least > conceivable (though expected to very rare) that the content of the RPC > data could be interpreted as a valid RPC header. In the case of an RPC > which involves data transfer between an NSD client and NSD server, that > might result in incorrect data being written to some NSD device. > > Disconnect/reconnect scenarios appear to be uncommon. An entry like > > [N] Reconnected to xxx.xxx.xxx.xxx nodename > > in mmfs.log would be an indication that a reconnect has occurred. By > itself, the reconnect will not imply that data or the file system was > corrupted, since that will depend on what RPCs were pending when the > connection happened. In the case the assert above is hit, no corruption is > expected, since the daemon will go down before incorrect data gets > written. > > Reconnects involving an NSD server are those which present the highest > risk, given that NSD-related RPCs are used to write data into NSDs > > Even on clusters that have not been subjected to disconnects/reconnects > before, such events might still happen in the future in case of network > glitches. It's then recommended that an efix for the problem be applied in > a timely fashion. > > > Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 > > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------ > ------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/ > forums/html/forum?id=11111111-0000-0000-0000-000000000479 > . > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM > Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale > (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file > system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Can anyone from the Scale team comment? > > Anytime I see ?may result in file system corruption or undetected file > data corruption? it gets my attention. > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > > > > > > > > Storage > IBM My Notifications > Check out the IBM Electronic Support > > > > > > > > IBM Spectrum Scale > > > > : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect > function may result in file system corruption or undetected file data > corruption > > > > IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 > levels, in which resending an NSD RPC after a network reconnect function > may result in file system corruption or undetected file data corruption. > > > > > > > > > > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r= > IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=- > AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Wed Oct 11 07:08:33 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 11 Oct 2017 06:08:33 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From arc at b4restore.com Wed Oct 11 08:46:03 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 07:46:03 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Oct 11 09:01:47 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 11 Oct 2017 09:01:47 +0100 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From arc at b4restore.com Wed Oct 11 09:18:01 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 08:18:01 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> Message-ID: <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> Hi Jonathan, Yes I thought about that but the system is located at a customer site and they are not willing to do that, unfortunately. That's why I was hoping there was a way around it Andi R. Christiansen -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:02 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 09:32:37 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 08:32:37 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Oct 11 09:46:46 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 08:46:46 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors Message-ID: I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon From jonathan.buzzard at strath.ac.uk Wed Oct 11 09:53:34 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 11 Oct 2017 09:53:34 +0100 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> Message-ID: <1507712014.9906.5.camel@strath.ac.uk> On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From arc at b4restore.com Wed Oct 11 09:59:20 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 08:59:20 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <1507712014.9906.5.camel@strath.ac.uk> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> <1507712014.9906.5.camel@strath.ac.uk> Message-ID: Yes i think my last resort might be to go to customer with a separate switch and do the reconfiguration. Thanks ? -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:54 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From arc at b4restore.com Wed Oct 11 10:02:08 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 09:02:08 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: <674e2c9b6c3f450b8f85b2d36a504597@B4RWEX01.internal.b4restore.com> Hi Simon, I will do that before I go to the customer with a separate switch as a last resort :) Thanks Venlig hilsen / Best Regards Andi Rhod Christiansen From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: 11. oktober 2017 10:33 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 11 11:19:13 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 12:19:13 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Hm , mmfsck will return not very reliable results in online mode, especially it will report many issues which are just due to the transient states in a files system in operation. It should however not find less issues than in off-line mode. mmrestripefs -c does not do any logical checks, it just checks for differences of multiple replicas of the same data/metadata. File system errors can be caused by such discrepancies (if an odd/corrupt replica is used by the GPFS), but can also be caused (probably more likely) by logical errors / bugs when metadata were modified in the file system. In those cases, all the replicas are identical nevertheless corrupt (cannot be found by mmrestripefs) So, mmrestripefs -c is like scrubbing for silent data corruption (on its own, it cannot decide which is the correct replica!), while mmfsck checks the filesystem structure for logical consistency. If the contents of the replicas of a data block differ, mmfsck won't see any problem (as long as the fs metadata are consistent), but mmrestripefs -c will. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 10/11/2017 10:47 AM Subject: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 11:31:53 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 10:31:53 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Wed Oct 11 11:58:52 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 12:58:52 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: If you do both, you are on the safe side. I am not sure wether mmfsck reads both replica of the metadata (if it it does, than one could spare the mmrestripefs -c WRT metadata, but I don't think so), if not, one could still have luckily checked using valid metadata where maybe one (or more) MD block has (have) an invalid replica which might come up another time ... But the mmfsrestripefs -c is not only ensuring the sanity of the FS but also of the data stored within (which is not necessarily the same). Mostly, however, filesystem checks are only done if fs issues are indicated by errors in the logs. Do you have reason to assume your fs has probs? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 12:32 PM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 12:22:26 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 11:22:26 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Yes I get we should only be doing this if we think we have a problem. And the answer is, right now, we're not entirely clear. We have a couple of issues our users are reporting to us, and its not clear to us if they are related, an FS problem or ACLs getting in the way. We do have users who are trying to work on files getting IO error, and we have an AFM sync issue. The disks are all online, I poked the FS with tsdbfs and the files look OK - (small files, but content of the block matches). Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error reported to user when they access a file even if its not an offline file??) We have a PMR open with IBM on this already. But there's a wanting to be sure in our own minds that we don't have an underlying FS problem. I.e. I have confidence that I can tell my users, yes I know you are seeing weird stuff, but we have run checks and are not introducing data corruption. Simon On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Mostly, however, filesystem checks are only done if fs issues are >indicated by errors in the logs. Do you have reason to assume your fs has >probs? From stockf at us.ibm.com Wed Oct 11 12:55:18 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 11 Oct 2017 07:55:18 -0400 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Generally you should not run mmfsck unless you see MMFS_FSSTRUCT errors in your system logs. To my knowledge online mmfsck only checks for a subset of problems, notably lost blocks, but that situation does not indicate any problems with the file system. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 06:32 AM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr????en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch??ftsf??hrung: >Thomas Wolter, Sven Schoo?? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Wed Oct 11 13:30:49 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 11 Oct 2017 08:30:49 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 11 15:01:54 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 16:01:54 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Usually, IO errors point to some basic problem reading/writing data . if there are repoducible errors, it's IMHO always a nice thing to trace GPFS for such an access. Often that reveals already the area where the cause lies and maybe even the details of it. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 01:22 PM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org Yes I get we should only be doing this if we think we have a problem. And the answer is, right now, we're not entirely clear. We have a couple of issues our users are reporting to us, and its not clear to us if they are related, an FS problem or ACLs getting in the way. We do have users who are trying to work on files getting IO error, and we have an AFM sync issue. The disks are all online, I poked the FS with tsdbfs and the files look OK - (small files, but content of the block matches). Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error reported to user when they access a file even if its not an offline file??) We have a PMR open with IBM on this already. But there's a wanting to be sure in our own minds that we don't have an underlying FS problem. I.e. I have confidence that I can tell my users, yes I know you are seeing weird stuff, but we have run checks and are not introducing data corruption. Simon On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Mostly, however, filesystem checks are only done if fs issues are >indicated by errors in the logs. Do you have reason to assume your fs has >probs? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 15:13:03 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 14:13:03 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: So with the help of IBM support and Venkat (thanks guys!), we think its a problem with DMAPI. As we initially saw this as an issue with AFM replication, we had traces from there, and had entries like: gpfsWrite exit: failed err 688 Now apparently err 688 relates to "DMAPI disposition", once we had this we were able to get someone to take a look at the HSM dsmrecalld, it was running, but had failed over to a node that wasn't able to service requests properly. (multiple NSD servers with different file-systems each running dsmrecalld, but I don't think you can scope nods XYZ to filesystem ABC but not DEF). Anyway once we got that fixed, a bunch of stuff in the AFM cache popped out (and a little poke for some stuff that hadn't updated metadata cache probably). So hopefully its now also solved for our other users. What is complicated here is that a DMAPI issue was giving intermittent IO errors, people could write into new folders, but not existing files, though I could (some sort of Schr?dinger's cat IO issue??). So hopefully we are fixed... Simon On 11/10/2017, 15:01, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Usually, IO errors point to some basic problem reading/writing data . >if there are repoducible errors, it's IMHO always a nice thing to trace >GPFS for such an access. Often that reveals already the area where the >cause lies and maybe even the details of it. > > > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: gpfsug main discussion list >Date: 10/11/2017 01:22 PM >Subject: Re: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >Yes I get we should only be doing this if we think we have a problem. > >And the answer is, right now, we're not entirely clear. > >We have a couple of issues our users are reporting to us, and its not >clear to us if they are related, an FS problem or ACLs getting in the way. > >We do have users who are trying to work on files getting IO error, and we >have an AFM sync issue. The disks are all online, I poked the FS with >tsdbfs and the files look OK - (small files, but content of the block >matches). > >Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error >reported to user when they access a file even if its not an offline >file??) > >We have a PMR open with IBM on this already. > >But there's a wanting to be sure in our own minds that we don't have an >underlying FS problem. I.e. I have confidence that I can tell my users, >yes I know you are seeing weird stuff, but we have run checks and are not >introducing data corruption. > >Simon > >On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf >of UWEFALKE at de.ibm.com" behalf of UWEFALKE at de.ibm.com> wrote: > >>Mostly, however, filesystem checks are only done if fs issues are >>indicated by errors in the logs. Do you have reason to assume your fs has >>probs? > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From truongv at us.ibm.com Wed Oct 11 17:14:21 2017 From: truongv at us.ibm.com (Truong Vu) Date: Wed, 11 Oct 2017 12:14:21 -0400 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to the network In-Reply-To: References: Message-ID: What you can do is create network alias to the old IP. Run mmchnode to change hostname/IP for non-quorum nodes first. Make one (or more) of the nodes you just change a quorum node. Change all of the quorum nodes that still on old IPs to non-quorum. Then change IPs on them. Thanks, Tru. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 10/11/2017 04:53 AM Subject: gpfsug-discuss Digest, Vol 69, Issue 26 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Changing ip on spectrum scale cluster with every node down and not connected to network. (Andi Rhod Christiansen) 2. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Jonathan Buzzard) 3. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Andi Rhod Christiansen) 4. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Simon Thompson (IT Research Support)) 5. Checking a file-system for errors (Simon Thompson (IT Research Support)) 6. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Jonathan Buzzard) ---------------------------------------------------------------------- Message: 1 Date: Wed, 11 Oct 2017 07:46:03 +0000 From: Andi Rhod Christiansen To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <3e6e1727224143ac9b8488d16f40fcb3 at B4RWEX01.internal.b4restore.com> Content-Type: text/plain; charset="us-ascii" Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: < https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20171011_820adb01_attachment-2D0001.html&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=NrezaW_ayd5u-bE6ppJ6p3FBluuDTtv6KHqb4TwaGsY&e= > ------------------------------ Message: 2 Date: Wed, 11 Oct 2017 09:01:47 +0100 From: Jonathan Buzzard To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <8b9180bf-0bef-4e42-020b-28a9610012a1 at strath.ac.uk> Content-Type: text/plain; charset=windows-1252; format=flowed On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ------------------------------ Message: 3 Date: Wed, 11 Oct 2017 08:18:01 +0000 From: Andi Rhod Christiansen To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <9fcbdf3fa2df4df5bd25f4e93d2a3e79 at B4RWEX01.internal.b4restore.com> Content-Type: text/plain; charset="us-ascii" Hi Jonathan, Yes I thought about that but the system is located at a customer site and they are not willing to do that, unfortunately. That's why I was hoping there was a way around it Andi R. Christiansen -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:02 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= ------------------------------ Message: 4 Date: Wed, 11 Oct 2017 08:32:37 +0000 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: Content-Type: text/plain; charset="us-ascii" I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org< mailto:gpfsug-discuss at spectrumscale.org>" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org< mailto:gpfsug-discuss at spectrumscale.org>" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: < https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20171011_cd962e6b_attachment-2D0001.html&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=Iy6NQR-GJD1Hkc0A0C96Jkesrs6h-6HpOnnw3MOQmi4&e= > ------------------------------ Message: 5 Date: Wed, 11 Oct 2017 08:46:46 +0000 From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Checking a file-system for errors Message-ID: Content-Type: text/plain; charset="us-ascii" I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon ------------------------------ Message: 6 Date: Wed, 11 Oct 2017 09:53:34 +0100 From: Jonathan Buzzard To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <1507712014.9906.5.camel at strath.ac.uk> Content-Type: text/plain; charset="UTF-8" On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= End of gpfsug-discuss Digest, Vol 69, Issue 26 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Thu Oct 12 18:41:49 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 12 Oct 2017 17:41:49 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> SC17 is only 1 month away! Here are the GPFS/Scale User group meeting details. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Look for signs when you arrive Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeep.patil at in.ibm.com Fri Oct 13 09:20:56 2017 From: sandeep.patil at in.ibm.com (Sandeep Ramesh) Date: Fri, 13 Oct 2017 13:50:56 +0530 Subject: [gpfsug-discuss] New Redpapers on Spectrum Scale/ESS GUI Published Message-ID: Dear Spectrum Scale User Group Members, New Redpapers on Spectrum Scale GUI and ESS GUI has been published yesterday. To help keep the community informed. Monitoring and Managing IBM Spectrum Scale Using the GUI http://www.redbooks.ibm.com/redpieces/abstracts/redp5458.html?Open Monitoring and Managing the IBM Elastic Storage Server Using the GUI http://www.redbooks.ibm.com/redpieces/abstracts/redp5471.html?Open thx Spectrum Scale Dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Oct 13 10:47:39 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 13 Oct 2017 09:47:39 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> References: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> Message-ID: I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: 12 October 2017 18:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details SC17 is only 1 month away! Here are the GPFS/Scale User group meeting details. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Look for signs when you arrive Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 13 13:12:59 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 12:12:59 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: >I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? I believe (hope?) that's just a placeholder, and we'll actually use the time for something more engaging... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 09:47:39 +0000 From: "Sobey, Richard A" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="utf-8" I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? From r.sobey at imperial.ac.uk Fri Oct 13 13:45:43 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 13 Oct 2017 12:45:43 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: Actually, I was being 100% serious :) Although it's a boring topic, it's nonetheless fairly crucial and I'd like to see more about it. I won't be at SC17 unless you're livestreaming it anyway. Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie Sent: 13 October 2017 13:13 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details >I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? I believe (hope?) that's just a placeholder, and we'll actually use the time for something more engaging... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 09:47:39 +0000 From: "Sobey, Richard A" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="utf-8" I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From john.hearns at asml.com Fri Oct 13 13:56:18 2017 From: john.hearns at asml.com (John Hearns) Date: Fri, 13 Oct 2017 12:56:18 +0000 Subject: [gpfsug-discuss] How to simulate an NSD failure? Message-ID: I have set up a small testbed, consisting of three nodes. Two of the nodes have a disk which is being used as an NSD. This is being done for some preparation for fun and games with some whizzy new servers. The testbed has spinning drives. I have created two NSDs and have set the data replication to 1 (this is deliberate). I am trying to fail an NSD and find which files have parts on the failed NSD. A first test with 'mmdeldisk' didn't have much effect as SpectrumScale is smart enough to copy the data off the drive. I now take the drive offline and delete it by echo offline > /sys/block/sda/device/state echo 1 > /sys/block/sda/delete Short of going to the data centre and physically pulling the drive that's a pretty final way of stopping access to a drive. I then wrote 100 files to the filesystem, the node with the NSD did log "rejecting I/O to offline device" However mmlsdisk says that this disk is status 'ready' I am going to stop that NSD and run an mmdeldisk - at which point I do expect things to go south rapidly. I just am not understanding at what point a failed write would be detected? Or once a write fails are all the subsequent writes Routed off to the active NSD(s) ?? Sorry if I am asking an idiot question. Inspector.clouseau at surete.fr -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Oct 13 14:38:26 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Fri, 13 Oct 2017 13:38:26 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: The slides from the Manchester meeting are at: http://files.gpfsug.org/presentations/2017/Manchester/09_licensing-update.p df We moved all of our socket licenses to per TB DME earlier this year, and then also have DME per drive for our Lenovo DSS-G system, which for various reasons is in a different cluster There are certainly people in IBM UK who understand this process if that was something you wanted to look at. Simon On 13/10/2017, 13:45, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Sobey, Richard A" wrote: >Actually, I was being 100% serious :) Although it's a boring topic, it's >nonetheless fairly crucial and I'd like to see more about it. I won't be >at SC17 unless you're livestreaming it anyway. > >Richard > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie >Sent: 13 October 2017 13:13 >To: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration >and program details > >>I *need* to see the presentation from the licensing session ? everyone?s >>favourite topic ? > > >I believe (hope?) that's just a placeholder, and we'll actually use the >time for something more engaging... > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > >Message: 3 >Date: Fri, 13 Oct 2017 09:47:39 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > utlook.com> > >Content-Type: text/plain; charset="utf-8" > >I *need* to see the presentation from the licensing session ? everyone?s >favourite topic ? > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From heiner.billich at psi.ch Fri Oct 13 15:15:53 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Fri, 13 Oct 2017 14:15:53 +0000 Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home Message-ID: <94041E4C-3978-4D39-86EA-79629FC17AB8@psi.ch> Hello, Running an AFM IW cache we noticed that AFM starts the flushing of data from cache to home rather slow, say at 20MB/s, and only slowly increases to several 100MB/s after a few minutes. As soon as the pending queue gets no longer filled the data rate drops, again. I assume that this is a good behavior for WAN traffic where you don?t want to use the full bandwidth from the beginning but only if really needed. For our local setup with dedicated links I would prefer a much more aggressive behavior to get data transferred asap to home. Am I right, does AFM implement such a ?slow startup?, and is there a way to change this behavior? We did increase afmNumFlushThreads to 128. Currently we measure with many small files (1MB). For large files the behavior is different, we get a stable data rate from the beginning, but I did not yet try with a continuous write on the cache to see whether I see an increase after a while, too. Thank you, Heiner Billich -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From carlz at us.ibm.com Fri Oct 13 15:46:47 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 14:46:47 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 -Registrationandprogram details In-Reply-To: References: Message-ID: Hi Richard, I'm always happy to have a separate conversation if you have any questions about licensing. Ping me on my email address below. Same goes for anybody else who won't be at SC17. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com >------------------------------ > >Message: 2 >Date: Fri, 13 Oct 2017 12:45:43 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > rod.outlook.com> > >Content-Type: text/plain; charset="us-ascii" > >Actually, I was being 100% serious :) Although it's a boring topic, >it's nonetheless fairly crucial and I'd like to see more about it. I >won't be at SC17 unless you're livestreaming it anyway. > >Richard > >won't be >>at SC17 unless you're livestreaming it anyway. >> >>Richard >> From sfadden at us.ibm.com Fri Oct 13 16:56:56 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Fri, 13 Oct 2017 15:56:56 +0000 Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Fri Oct 13 17:32:35 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Fri, 13 Oct 2017 16:32:35 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From alex at calicolabs.com Fri Oct 13 17:53:40 2017 From: alex at calicolabs.com (Alex Chekholko) Date: Fri, 13 Oct 2017 09:53:40 -0700 Subject: [gpfsug-discuss] How to simulate an NSD failure? In-Reply-To: References: Message-ID: John, I think a "philosophical" difference between GPFS code and newer filesystems which were written later, in the age of "commodity hardware", is that GPFS expects the underlying hardware to be very reliable. So "disks" are typically RAID arrays available via multiple paths. And network links should have no errors, and be highly reliable, etc. GPFS does not detect these things well as it does not expect them to fail. That's why you see some discussions around "improving network diagnostics" and "improving troubleshooting tools" and things like that. Having a failed NSD is highly unusual for a GPFS system and you should design your system so that situation does not happen. In your example here, if data is striped across two NSDs and one of them becomes inaccessible, when a client tries to write, it should get an I/O error, and perhaps even unmount the filesystem (depending on where you metadata lives). Regards, Alex On Fri, Oct 13, 2017 at 5:56 AM, John Hearns wrote: > I have set up a small testbed, consisting of three nodes. Two of the nodes > have a disk which is being used as an NSD. > > This is being done for some preparation for fun and games with some whizzy > new servers. The testbed has spinning drives. > > I have created two NSDs and have set the data replication to 1 (this is > deliberate). > > I am trying to fail an NSD and find which files have parts on the failed > NSD. > > A first test with ?mmdeldisk? didn?t have much effect as SpectrumScale is > smart enough to copy the data off the drive. > > > > I now take the drive offline and delete it by > > echo offline > /sys/block/sda/device/state > > echo 1 > /sys/block/sda/delete > > > > Short of going to the data centre and physically pulling the drive that?s > a pretty final way of stopping access to a drive. > > I then wrote 100 files to the filesystem, the node with the NSD did log > ?rejecting I/O to offline device? > > However mmlsdisk says that this disk is status ?ready? > > > > I am going to stop that NSD and run an mmdeldisk ? at which point I do > expect things to go south rapidly. > > I just am not understanding at what point a failed write would be > detected? Or once a write fails are all the subsequent writes > > Routed off to the active NSD(s) ?? > > > > Sorry if I am asking an idiot question. > > > > Inspector.clouseau at surete.fr > > > > > > > > > > > > > > > > > > > > > > > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhabib73 at gmail.com Fri Oct 13 18:48:57 2017 From: mhabib73 at gmail.com (Muhammad Habib) Date: Fri, 13 Oct 2017 13:48:57 -0400 Subject: [gpfsug-discuss] How to simulate an NSD failure? In-Reply-To: References: Message-ID: If your devices/disks are multipath , make sure you remove all paths in order for disk to go offline. Also following line does not see correct: echo 1 > /sys/block/sda/delete , it should rather be echo 1 > /sys/block/sda/device/delete Further after you removed the disks , did you run the fdisk -l , to make sure its completely gone , also if the /var/log/messages confirms the disk is offline. Once all this confirmed then GPFS should take disks down and logs should tell you as well. Thanks M.Habib On Fri, Oct 13, 2017 at 8:56 AM, John Hearns wrote: > I have set up a small testbed, consisting of three nodes. Two of the nodes > have a disk which is being used as an NSD. > > This is being done for some preparation for fun and games with some whizzy > new servers. The testbed has spinning drives. > > I have created two NSDs and have set the data replication to 1 (this is > deliberate). > > I am trying to fail an NSD and find which files have parts on the failed > NSD. > > A first test with ?mmdeldisk? didn?t have much effect as SpectrumScale is > smart enough to copy the data off the drive. > > > > I now take the drive offline and delete it by > > echo offline > /sys/block/sda/device/state > > echo 1 > /sys/block/sda/delete > > > > Short of going to the data centre and physically pulling the drive that?s > a pretty final way of stopping access to a drive. > > I then wrote 100 files to the filesystem, the node with the NSD did log > ?rejecting I/O to offline device? > > However mmlsdisk says that this disk is status ?ready? > > > > I am going to stop that NSD and run an mmdeldisk ? at which point I do > expect things to go south rapidly. > > I just am not understanding at what point a failed write would be > detected? Or once a write fails are all the subsequent writes > > Routed off to the active NSD(s) ?? > > > > Sorry if I am asking an idiot question. > > > > Inspector.clouseau at surete.fr > > > > > > > > > > > > > > > > > > > > > > > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- This communication contains confidential information intended only for the persons to whom it is addressed. Any other distribution, copying or disclosure is strictly prohibited. If you have received this communication in error, please notify the sender and delete this e-mail message immediately. Le pr?sent message contient des renseignements de nature confidentielle r?serv?s uniquement ? l'usage du destinataire. Toute diffusion, distribution, divulgation, utilisation ou reproduction de la pr?sente communication, et de tout fichier qui y est joint, est strictement interdite. Si vous avez re?u le pr?sent message ?lectronique par erreur, veuillez informer imm?diatement l'exp?diteur et supprimer le message de votre ordinateur et de votre serveur. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gcorneau at us.ibm.com Fri Oct 13 19:50:05 2017 From: gcorneau at us.ibm.com (Glen Corneau) Date: Fri, 13 Oct 2017 13:50:05 -0500 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: , Message-ID: The original announcement letter for Spectrum Scale Data Management Edition (US version reference below) re-defines a terabyte (nice eh?, should be a tebibyte). My math agrees with yours, assuming your file system size is actually 1PB versus 1PiB Terabyte Terabyte is a unit of measure by which the Program can be licensed. A terabyte is 2 to the 40th power bytes. Licensee must obtain an entitlement for each terabyte available to the Program. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpateam&supplier=897&letternum=ENUS216-158 ------------------ Glen Corneau Power Systems Washington Systems Center gcorneau at us.ibm.com From: "Daniel Kidger" To: gpfsug-discuss at spectrumscale.org Date: 10/13/2017 11:32 AM Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Sent by: gpfsug-discuss-bounces at spectrumscale.org All, For me the URL looks to have got mangled. Should be: http://files.gpfsug.org/presentations/2017/Manchester/09_licensing-update.pdf with the index page that points to it here: http://www.spectrumscale.org/presentations/ Also a personal note, I have found increasing confusion of decimal v. binary units for storage capacity. I understand that Spectrum Scale uses binary TiB, but say an 1TB drive is in decimal so c. 10% difference. So a 1 Petabyte filesystem needs only 909 TiB of Spectrum Scale licenses. Any comments from others? Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: "Simon Thompson (IT Research Support)" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Date: Fri, Oct 13, 2017 2:38 PM The slides from the Manchester meeting are at: https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2017_Manchester_09-5Flicensing-2Dupdate.p&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=f6qsuSorl92LShV92TTaXNyG3KU0VvuFN4YhT_LTTFc&e= df We moved all of our socket licenses to per TB DME earlier this year, and then also have DME per drive for our Lenovo DSS-G system, which for various reasons is in a different cluster There are certainly people in IBM UK who understand this process if that was something you wanted to look at. Simon On 13/10/2017, 13:45, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Sobey, Richard A" wrote: >Actually, I was being 100% serious :) Although it's a boring topic, it's >nonetheless fairly crucial and I'd like to see more about it. I won't be >at SC17 unless you're livestreaming it anyway. > >Richard > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie >Sent: 13 October 2017 13:13 >To: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration >and program details > >>I *need* to see the presentation from the licensing session ? everyone?s >>favourite topic ? > > >I believe (hope?) that's just a placeholder, and we'll actually use the >time for something more engaging... > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > >Message: 3 >Date: Fri, 13 Oct 2017 09:47:39 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > utlook.com> > >Content-Type: text/plain; charset="utf-8" > >I *need* to see the presentation from the licensing session ? everyone?s >favourite topic ? > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=d-vphLEe_UlGazP6RdYAyyAA3Qv5S9IRVNuO1i9vjJc&m=rOPfwzvHMD3_MRZy2WHgOGtmYQya-jWx5d_s92EeJRk&s=LkQ4lwnC-ATFnHjydppCXDasUDijS9DUh0p-cFaM0NM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 13 20:10:56 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 19:10:56 +0000 Subject: [gpfsug-discuss] Scale per TB (was: User group Meeting at SC17 - Registration and program details) In-Reply-To: References: Message-ID: Yeah, I know... It's actually an IBM thing, not just a Scale thing. Some time in the distant past, IBM decided that too few people were familiar with the term "tebibyte" or its official abbreviation "TiB", so in the IBM licensing catalog there is the "Terabyte" (really a tebibyte) and the "Decimal Terabyte" (an actual terabyte). When we made the capacity license we had to decide which one to use, and we decided to err on the side of giving people the larger amount. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 13:50:05 -0500 From: "Glen Corneau" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="us-ascii" The original announcement letter for Spectrum Scale Data Management Edition (US version reference below) re-defines a terabyte (nice eh?, should be a tebibyte). My math agrees with yours, assuming your file system size is actually 1PB versus 1PiB Terabyte Terabyte is a unit of measure by which the Program can be licensed. A terabyte is 2 to the 40th power bytes. Licensee must obtain an entitlement for each terabyte available to the Program. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpateam&supplier=897&letternum=ENUS216-158 ------------------ Glen Corneau Power Systems Washington Systems Center gcorneau at us.ibm.com From: "Daniel Kidger" To: gpfsug-discuss at spectrumscale.org Date: 10/13/2017 11:32 AM Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Sent by: gpfsug-discuss-bounces at spectrumscale.org All, For me the URL looks to have got mangled. Should be: https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2017_Manchester_09-5Flicensing-2Dupdate.pdf&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=obB2s7QQTgU9QMn1708Vpg&m=6slj4_SM9ZLHQtguBkVK7Xg2UN1RlDFCjKJoGhWn2wU&s=NU2Hs398IPSytPh8bYplXjFChhaF9G21Pt4YoHvbrPY&e= with the index page that points to it here: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spectrumscale.org_presentations_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=obB2s7QQTgU9QMn1708Vpg&m=6slj4_SM9ZLHQtguBkVK7Xg2UN1RlDFCjKJoGhWn2wU&s=CLN7JkpjQsfPdvOapYPGX3o7gHZj8AOh7tYSusTZJPE&e= Also a personal note, I have found increasing confusion of decimal v. binary units for storage capacity. I understand that Spectrum Scale uses binary TiB, but say an 1TB drive is in decimal so c. 10% difference. So a 1 Petabyte filesystem needs only 909 TiB of Spectrum Scale licenses. Any comments from others? Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com From a.khiredine at meteo.dz Sun Oct 15 13:44:42 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Sun, 15 Oct 2017 12:44:42 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Dear All, Is there a way to save the GPS configuration? OR how backup all GSS no backup of data or metadata only configuration for disaster recovery for example: stanza vdisk pdisk RAID code recovery group array Thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From skylar2 at u.washington.edu Mon Oct 16 14:29:33 2017 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Mon, 16 Oct 2017 13:29:33 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server In-Reply-To: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> References: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Message-ID: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu> I'm not familiar with GSS, but we have a script that executes the following before backing up a GPFS filesystem so that we have human-readable configuration information: mmlsconfig mmlsnsd mmlscluster mmlsnode mmlsdisk ${FS_NAME} -L mmlsfileset ${FS_NAME} -L mmlspool ${FS_NAME} all -L mmlslicense -L mmlspolicy ${FS_NAME} -L And then executes this for the benefit of GPFS: mmbackupconfig Of course there's quite a bit of overlap for clusters that have more than one filesystem, and even more for filesystems that we backup at the fileset level, but disk is cheap and the hope is it'll make a DR scenario a little bit less harrowing. On Sun, Oct 15, 2017 at 12:44:42PM +0000, atmane khiredine wrote: > Dear All, > > Is there a way to save the GPS configuration? > > OR how backup all GSS > > no backup of data or metadata only configuration for disaster recovery > > for example: > stanza > vdisk > pdisk > RAID code > recovery group > array > > Thank you > > Atmane Khiredine > HPC System Administrator | Office National de la M?t?orologie > T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From heiner.billich at psi.ch Mon Oct 16 14:36:09 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Mon, 16 Oct 2017 13:36:09 +0000 Subject: [gpfsug-discuss] slow startup of AFM flush to home Message-ID: Hello Scott, Thank you. I did set afmFlushThreadDelay = 1 and did get a much faster startup. Setting to 0 didn?t improve further. I?m not sure how much we?ll need this in production when most of the time the queue is full. But for benchmarking during setup it?s helps a lot. (we run 4.2.3-4 on RHEL7) Kind regards, Heiner Scott Fadden did write: When an AFM gateway is flushing data to the target (home) it starts flushing with a few threads (Don't remember the number) and ramps up to afmNumFlushThreads. How quickly this ramp up occurs is controlled by afmFlushThreadDealy. The default is 5 seconds. So flushing only adds threads once every 5 seconds. This was an experimental parameter so your milage may vary. Scott Fadden Spectrum Scale - Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/storage/spectrum/scale ----- Original message ----- From: "Billich Heinrich Rainer (PSI)" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home Date: Fri, Oct 13, 2017 10:16 AM Hello, Running an AFM IW cache we noticed that AFM starts the flushing of data from cache to home rather slow, say at 20MB/s, and only slowly increases to several 100MB/s after a few minutes. As soon as the pending queue gets no longer filled the data rate drops, again. I assume that this is a good behavior for WAN traffic where you don???t want to use the full bandwidth from the beginning but only if really needed. For our local setup with dedicated links I would prefer a much more aggressive behavior to get data transferred asap to home. Am I right, does AFM implement such a ???slow startup???, and is there a way to change this behavior? We did increase afmNumFlushThreads to 128. Currently we measure with many small files (1MB). For large files the behavior is different, we get a stable data rate from the beginning, but I did not yet try with a continuous write on the cache to see whether I see an increase after a while, too. Thank you, Heiner Billich -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From sfadden at us.ibm.com Mon Oct 16 16:34:33 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Mon, 16 Oct 2017 15:34:33 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server In-Reply-To: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu> References: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu>, <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Message-ID: An HTML attachment was scrubbed... URL: From er.a.ross at gmail.com Fri Oct 20 03:15:38 2017 From: er.a.ross at gmail.com (Eric Ross) Date: Thu, 19 Oct 2017 21:15:38 -0500 Subject: [gpfsug-discuss] file auditing capabilities Message-ID: I'm researching the file auditing capabilities possible with GPFS; I found this paper on the GPFS wiki: https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/f0cc9b82-a133-41b4-83fe-3f560e95b35a/attachment/0ab62645-e0ab-4377-81e7-abd11879bb75/media/Spectrum_Scale_Varonis_Audit_Logging.pdf I haven't found anything else on the subject, however. While I like the idea of being able to do this logging on the protocol node level, I'm also interested in the possibility of auditing files from native GPFS mounts. Additional digging uncovered references to Lightweight Events (LWE): http://files.gpfsug.org/presentations/2016/SC16/04_Scott_Fadden_Spectrum_Scale_Update.pdf Specifically, this references being able to use the policy engine to detect things like file opens, reads, and writes. Searching through the official GPFS documentation, I see references to these events in the transparent cloud tiering section: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_define_cloud_storage_tier.htm but, I don't see, or possibly have missed, the other section(s) defining what other EVENT parameters I can use. I'm curious to know more about these events, could anyone point me in the right direction? I'm wondering if I could use them to perform rudimentary auditing of the file system (e.g. a default policy in place to log a message of say user foo either wrote to and/or read from file bar). Thanks, -Eric From richardb+gpfsUG at ellexus.com Fri Oct 20 15:47:57 2017 From: richardb+gpfsUG at ellexus.com (Richard Booth) Date: Fri, 20 Oct 2017 15:47:57 +0100 Subject: [gpfsug-discuss] file auditing capabilities Message-ID: Hi Eric The company I work for could possibly help with this, Ellexus . Please feel free to get in touch if you need some help with this. Cheers Richard ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 19 Oct 2017 21:15:38 -0500 >> From: Eric Ross >> To: gpfsug-discuss at spectrumscale.org >> Subject: [gpfsug-discuss] file auditing capabilities >> Message-ID: >> > ail.com> >> Content-Type: text/plain; charset="UTF-8" >> >> I'm researching the file auditing capabilities possible with GPFS; I >> found this paper on the GPFS wiki: >> >> https://www.ibm.com/developerworks/community/wikis/form/anon >> ymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ >> f0cc9b82-a133-41b4-83fe-3f560e95b35a/attachment/0ab62645- >> e0ab-4377-81e7-abd11879bb75/media/Spectrum_Scale_Varonis_ >> Audit_Logging.pdf >> >> I haven't found anything else on the subject, however. >> >> While I like the idea of being able to do this logging on the protocol >> node level, I'm also interested in the possibility of auditing files >> from native GPFS mounts. >> >> Additional digging uncovered references to Lightweight Events (LWE): >> >> http://files.gpfsug.org/presentations/2016/SC16/04_Scott_Fad >> den_Spectrum_Scale_Update.pdf >> >> Specifically, this references being able to use the policy engine to >> detect things like file opens, reads, and writes. >> >> Searching through the official GPFS documentation, I see references to >> these events in the transparent cloud tiering section: >> >> https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/ >> com.ibm.spectrum.scale.v4r22.doc/bl1adm_define_cloud_storage_tier.htm >> >> but, I don't see, or possibly have missed, the other section(s) >> defining what other EVENT parameters I can use. >> >> I'm curious to know more about these events, could anyone point me in >> the right direction? >> >> I'm wondering if I could use them to perform rudimentary auditing of >> the file system (e.g. a default policy in place to log a message of >> say user foo either wrote to and/or read from file bar). >> >> Thanks, >> -Eric >> >> >> ------------------------------ >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> End of gpfsug-discuss Digest, Vol 69, Issue 38 >> ********************************************** >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 20 20:54:38 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 20 Oct 2017 19:54:38 +0000 Subject: [gpfsug-discuss] file auditing capabilities (Eric Ross) Message-ID: Disclaimer: all statements about future functionality are subject to change, and represent intentions only. That being said: Yes, we are working on File Audit Logging native to Spectrum Scale. The intention is to provide auditing capabilities in a protocol agnostic manner that will capture not only audit events that come through protocols but also GPFS/Scale native file system access events. The audit logs are written to a specified GPFS/Scale fileset in a format that is both human=-readable and easily parsable for automated consumption, reporting, or whatever else you might want to do with it. Currently, we intend to release this capability with Scale 5.0. The underlying technology for this is indeed LWE, which as some of you know is also underneath some other Scale features. The use of LWE allows us to do auditing very efficiently to minimize performance impact while also allowing scalability. We do not at this time have plans to expose LWE directly for end-user consumption -- it needs to be "packaged" in a more consumable way in order to be generally supportable. However, we do have intentions to expose other functionality on top of the LWE capability in the future. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From Stephan.Peinkofer at lrz.de Mon Oct 23 11:41:23 2017 From: Stephan.Peinkofer at lrz.de (Peinkofer, Stephan) Date: Mon, 23 Oct 2017 10:41:23 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management Message-ID: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Oct 23 12:00:50 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 23 Oct 2017 11:00:50 +0000 Subject: [gpfsug-discuss] el7.4 compatibility In-Reply-To: References: Message-ID: Just picking up this old thread, but... October updates: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html# linux 7.4 is now listed as supported with min scale version of 4.1.1.17 or 4.2.3.4 (incidentally 4.2.3.5 looks to have been released today). Simon On 27/09/2017, 09:16, "gpfsug-discuss-bounces at spectrumscale.org on behalf of kenneth.waegeman at ugent.be" wrote: >Hi, > >Is there already some information available of gpfs (and protocols) on >el7.4 ? > >Thanks! > >Kenneth > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Mon Oct 23 12:09:17 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 23 Oct 2017 13:09:17 +0200 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> References: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> Message-ID: You can lower LEASE_LIFETIME and GRACE_PERIOD to shorten the time it's in grace, to make it more bearable. Making export changes dynamic is something that's fixed in newer versions of nfs-ganesha than what's shipped with Scale: https://github.com/nfs-ganesha/nfs-ganesha/releases/tag/V2.4.0: "dynamic EXPORT configuration update (via dBus and SIGHUP)" Hopefully someone can comment on when we'll see nfs-ganesha v2.4+ included with Scale. -jf On Mon, Oct 23, 2017 at 12:41 PM, Peinkofer, Stephan < Stephan.Peinkofer at lrz.de> wrote: > Dear List, > > I?m currently working on a self service portal for managing NFS exports of > ISS. Basically something very similar to OpenStack Manila but tailored to > our specific needs. > While it was very easy to do this using the great REST API of ISS, I > stumbled across a fact that may be even a show stopper: According to the > documentation for mmnfs, each time we > create/change/delete a NFS export via mmnfs, ganesha service is restarted > on all nodes. > > I assume that this behaviour may cause problems (at least IO stalls) on > clients mounted the filesystem. So my question is, what is your experience > with CES NFS export management. > Do you see any problems when you add/change/delete exports and ganesha > gets restarted? > > Are there any (supported) workarounds for this problem? > > PS: As I think in 2017 CES Exports should be manageable without service > disruptions (and ganesha provides facilities to do so), I filed an RFE for > this: https://www.ibm.com/developerworks/rfe/execute? > use_case=viewRfe&CR_ID=111918 > > Many thanks in advance. > Best Regards, > Stephan Peinkofer > -- > Stephan Peinkofer > Dipl. Inf. (FH), M. Sc. (TUM) > > Leibniz Supercomputing Centre > Data and Storage Division > Boltzmannstra?e 1, 85748 Garching b. M?nchen > Tel: +49(0)89 35831-8715 <+49%2089%20358318715> Fax: +49(0)89 > 35831-9700 <+49%2089%20358319700> > URL: http://www.lrz.de > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Mon Oct 23 12:56:07 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Mon, 23 Oct 2017 17:26:07 +0530 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: Hi Stephan, I observed ganesha service getting restarted only after adding first nfs export. For rest of the operations (e.g. adding more nfs exports, changing nfs exports, removing nfs exports); ganesha service doesn't restart. My observations are based on following simple tests. I ran them against rhel7.3 test cluster having nfs-ganesha-2.5.2. tests: 1. created 1st nfs export - ganesha service was restarted 2. created 4 more nfs exports (mmnfs export add path) 3. changed 2 nfs exports (mmnfs export change path --nfschange); 4. removed all 5 exports one by one (mmnfs export remove path) 5. no nfs exports after step 4 on my test system. So, created a new nfs export (which will be the 1st nfs export). 6. change nfs export created in step 5 results observed: ganesha service restarted for test 1 and test 5. For rest tests (2,3,4,6); ganesha service didn't restart. Thanks, Chetan. From: "Peinkofer, Stephan" To: "gpfsug-discuss at spectrumscale.org" Date: 10/23/2017 04:11 PM Subject: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=uic-29lyJ5TCiTRi0FyznYhKJx5I7Vzu80WyYuZ4_iM&m=ghcZYswqgF3beYOogGGLsT1RyDRZrbLXdzp3Fbjmfrg&s=TUm7BM3sY75Nc20gOfhz9lvDgYJse0TM6-tIW8I1QiI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Mon Oct 23 13:16:17 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 23 Oct 2017 12:16:17 +0000 Subject: [gpfsug-discuss] Reminder: User group Meeting at SC17 - Registration and program details Message-ID: Reminder: Register for the SC17 User Group meeting if you are heading to SC17. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Centennial E Ballroom followed by reception in Centennial D Ballroom at 5:30pm Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Stephan.Peinkofer at lrz.de Mon Oct 23 13:20:47 2017 From: Stephan.Peinkofer at lrz.de (Peinkofer, Stephan) Date: Mon, 23 Oct 2017 12:20:47 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: <5BBED5D7-5E06-453F-B839-BC199EC74720@lrz.de> Dear Chetan, interesting. I?m running ISS 4.2.3-4 and it seems to ship with nfs-ganesha-2.3.2. So are you already using a future ISS version? Here is what I see: [root at datdsst102 pr74cu-dss-0002]# mmnfs export list Path Delegations Clients ---------------------------------------------------------- /dss/dsstestfs01/pr74cu-dss-0002 NONE 10.156.29.73 /dss/dsstestfs01/pr74cu-dss-0002 NONE 10.156.29.72 [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfschange "10.156.29.72(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfschange "10.156.29.72(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfsadd "10.156.29.74(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 ~]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfsremove 10.156.29.74 datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de On 23. Oct 2017, at 13:56, Chetan R Kulkarni > wrote: Hi Stephan, I observed ganesha service getting restarted only after adding first nfs export. For rest of the operations (e.g. adding more nfs exports, changing nfs exports, removing nfs exports); ganesha service doesn't restart. My observations are based on following simple tests. I ran them against rhel7.3 test cluster having nfs-ganesha-2.5.2. tests: 1. created 1st nfs export - ganesha service was restarted 2. created 4 more nfs exports (mmnfs export add path) 3. changed 2 nfs exports (mmnfs export change path --nfschange); 4. removed all 5 exports one by one (mmnfs export remove path) 5. no nfs exports after step 4 on my test system. So, created a new nfs export (which will be the 1st nfs export). 6. change nfs export created in step 5 results observed: ganesha service restarted for test 1 and test 5. For rest tests (2,3,4,6); ganesha service didn't restart. Thanks, Chetan. "Peinkofer, Stephan" ---10/23/2017 04:11:33 PM---Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically From: "Peinkofer, Stephan" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/23/2017 04:11 PM Subject: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=uic-29lyJ5TCiTRi0FyznYhKJx5I7Vzu80WyYuZ4_iM&m=ghcZYswqgF3beYOogGGLsT1RyDRZrbLXdzp3Fbjmfrg&s=TUm7BM3sY75Nc20gOfhz9lvDgYJse0TM6-tIW8I1QiI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Oct 23 14:42:51 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 23 Oct 2017 13:42:51 +0000 Subject: [gpfsug-discuss] Rainy days and Mondays and GPFS lying to me always get me down... Message-ID: Hi All, And I?m not really down, but it is a rainy Monday morning here and GPFS did give me a scare in the last hour, so I thought that was a funny subject line. So I have a >1 PB filesystem with 3 pools: 1) the system pool, which contains metadata only, 2) the data pool, which is where all I/O goes to by default, and 3) the capacity pool, which is where old crap gets migrated to. I logged on this morning to see an alert that my data pool was 100% full. I ran an mmdf from the cluster manager and, sure enough: (pool total) 509.3T 0 ( 0%) 0 ( 0%) I immediately tried copying a file to there and it worked, so I figured GPFS must be failing writes over to the capacity pool, but an mmlsattr on the file I copied showed it being in the data pool. Hmmm. I also noticed that ?df -h? said that the filesystem had 399 TB free, while mmdf said it only had 238 TB free. Hmmm. So after some fruitless poking around I decided that whatever was going to happen, I should kill the mmrestripefs I had running on the capacity pool ? let me emphasize that ? I had a restripe running on the capacity pool only (via the ?-P? option to mmrestripefs) but it was the data pool that said it was 100% full. I?m sure many of you have already figured out where this is going ? after killing the restripe I ran mmdf again and: (pool total) 509.3T 159T ( 31%) 1.483T ( 0%) I have never seen anything like this before ? any ideas, anyone? PMR time? Thanks! Kevin From valdis.kletnieks at vt.edu Mon Oct 23 19:13:05 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 23 Oct 2017 14:13:05 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: <32917.1508782385@turing-police.cc.vt.edu> On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From bbanister at jumptrading.com Mon Oct 23 19:23:33 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 23 Oct 2017 18:23:33 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: <32917.1508782385@turing-police.cc.vt.edu> References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: This becomes very disruptive when you have to add or remove many NFS exports. Is it possible to add and remove multiple entries at a time or is this YARFE time? -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu Sent: Monday, October 23, 2017 1:13 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Experience with CES NFS export management On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From stefan.dietrich at desy.de Mon Oct 23 19:34:02 2017 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Mon, 23 Oct 2017 20:34:02 +0200 (CEST) Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <2146307210.3678055.1508783642716.JavaMail.zimbra@desy.de> Hello Bryan, at least changing multiple entries at once is possible. You can copy /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf to e.g. /tmp, modify the export (remove/add nodes or options) and load the changed config via "mmnfs export load " That way, only a single restart is issued for Ganesha on the CES nodes. Adding/removing I did not try so far, to be honest for use-cases this is rather static. Regards, Stefan ----- Original Message ----- > From: "Bryan Banister" > To: "gpfsug main discussion list" > Sent: Monday, October 23, 2017 8:23:33 PM > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > This becomes very disruptive when you have to add or remove many NFS exports. > Is it possible to add and remove multiple entries at a time or is this YARFE > time? > -Bryan > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of > valdis.kletnieks at vt.edu > Sent: Monday, October 23, 2017 1:13 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > > On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > >> tests: >> 1. created 1st nfs export - ganesha service was restarted >> 2. created 4 more nfs exports (mmnfs export add path) >> 3. changed 2 nfs exports (mmnfs export change path --nfschange); >> 4. removed all 5 exports one by one (mmnfs export remove path) >> 5. no nfs exports after step 4 on my test system. So, created a new nfs >> export (which will be the 1st nfs export). >> 6. change nfs export created in step 5 > > mmnfs export change --nfsadd seems to generate a restart as well. > Particularly annoying when the currently running nfs.ganesha fails to > stop rpc.statd on the way down, and then bringing it back up fails because > the port is in use.... > > ________________________________ > > Note: This email is for the confidential use of the named addressee(s) only and > may contain proprietary, confidential or privileged information. If you are not > the intended recipient, you are hereby notified that any review, dissemination > or copying of this email is strictly prohibited, and to please notify the > sender immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The Company, > therefore, does not make any guarantees as to the completeness or accuracy of > this email or any attachments. This email is for informational purposes only > and does not constitute a recommendation, offer, request or solicitation of any > kind to buy, sell, subscribe, redeem or perform any type of transaction of a > financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From valdis.kletnieks at vt.edu Mon Oct 23 19:54:35 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 23 Oct 2017 14:54:35 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <53227.1508784875@turing-police.cc.vt.edu> On Mon, 23 Oct 2017 18:23:33 -0000, Bryan Banister said: > This becomes very disruptive when you have to add or remove many NFS exports. > Is it possible to add and remove multiple entries at a time or is this YARFE time? On the one hand, 'mmnfs export change [path] --nfsadd 'client1(options);client2(options);...)' is supported. On the other hand, after the initial install's rush of new NFS exports, the chances of having more than one client to change at a time are rather low. On the gripping hand, if a client later turns up an entire cluster that needs access, you can also say --nfsadd '172.28.40.0/23(options)' and get the whole cluster in one shot. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From oehmes at gmail.com Tue Oct 24 01:28:33 2017 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 24 Oct 2017 00:28:33 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: we can not commit on timelines on mailing lists, but this is a known issue and will be addressed in a future release. sven On Mon, Oct 23, 2017, 11:23 AM Bryan Banister wrote: > This becomes very disruptive when you have to add or remove many NFS > exports. Is it possible to add and remove multiple entries at a time or is > this YARFE time? > -Bryan > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto: > gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of > valdis.kletnieks at vt.edu > Sent: Monday, October 23, 2017 1:13 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > > On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > > > tests: > > 1. created 1st nfs export - ganesha service was restarted > > 2. created 4 more nfs exports (mmnfs export add path) > > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > > 4. removed all 5 exports one by one (mmnfs export remove path) > > 5. no nfs exports after step 4 on my test system. So, created a new nfs > > export (which will be the 1st nfs export). > > 6. change nfs export created in step 5 > > mmnfs export change --nfsadd seems to generate a restart as well. > Particularly annoying when the currently running nfs.ganesha fails to > stop rpc.statd on the way down, and then bringing it back up fails because > the port is in use.... > > ________________________________ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged information. > If you are not the intended recipient, you are hereby notified that any > review, dissemination or copying of this email is strictly prohibited, and > to please notify the sender immediately and destroy this email and any > attachments. Email transmission cannot be guaranteed to be secure or > error-free. The Company, therefore, does not make any guarantees as to the > completeness or accuracy of this email or any attachments. This email is > for informational purposes only and does not constitute a recommendation, > offer, request or solicitation of any kind to buy, sell, subscribe, redeem > or perform any type of transaction of a financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnaineni at in.ibm.com Tue Oct 24 08:57:29 2017 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Tue, 24 Oct 2017 13:27:29 +0530 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: As others have answered, 4.2.3 spectrum can add or remove exports without restarting nfs-ganesha service. Changing an existing export does need nfs-ganesha restart though. If you want to change multiple existing exports, you could use undocumented option "--nfsnorestart" to mmnfs. This should add export changes to NFS configuration but it won't restart nfs-ganesha service, so you will not see immediate results of your changes in the running server. Whenever you want your changes reflected, you could manually restart the service using "mmces" command. Regards, Malahal. From: Bryan Banister To: gpfsug main discussion list Date: 10/23/2017 11:53 PM Subject: Re: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org This becomes very disruptive when you have to add or remove many NFS exports. Is it possible to add and remove multiple entries at a time or is this YARFE time? -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu Sent: Monday, October 23, 2017 1:13 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Experience with CES NFS export management On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=dhIJJ5KI4U6ZUia7OPi_-AC3qBrYV9n93ww8Ffhl468&s=K4ii44lk1_auA_3g7SN-E1zmMZNtc1PqBSiQJVudc_w&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.khiredine at meteo.dz Tue Oct 24 10:20:25 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Tue, 24 Oct 2017 09:20:25 +0000 Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6340C7B0@SDEB-EXC02.meteo.dz> Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From valdis.kletnieks at vt.edu Tue Oct 24 15:36:46 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 24 Oct 2017 10:36:46 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <16412.1508855806@turing-police.cc.vt.edu> On Tue, 24 Oct 2017 13:27:29 +0530, "Malahal R Naineni" said: > If you want to change multiple existing exports, you could use > undocumented option "--nfsnorestart" to mmnfs. This should add export > changes to NFS configuration but it won't restart nfs-ganesha service, so > you will not see immediate results of your changes in the running server. > Whenever you want your changes reflected, you could manually restart the > service using "mmces" command. I owe you a beverage of your choice if we ever are in the same place at the same time - the fact that Ganesha got restarted on all nodes at once thus preventing a rolling restart and avoiding service interruption was the single biggest Ganesha wart we've encountered. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From UWEFALKE at de.ibm.com Tue Oct 24 17:49:19 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 24 Oct 2017 18:49:19 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Message-ID: Hi, I am about to run nsdperf for testing the IB fabric in a new system comprising ESS (BE) and Intel-based nodes. nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one test using RDMA: client server RDMA x86-64 ppc-64 on crash ppc-64 x86-64 on crash x86-64 ppc-64 off success x86-64 x86-64 on success ppc-64 ppc-64 on success That implies that the nsdperf RDMA test might struggle with BE vs LE. However, I learned from a talk given at a GPFS workshop in Germany in 2015 that RDMA works between Power-BE and Intel boxes. Has anyone made similar or contrary experiences? Is it an nsdperf issue or more general (I have not yet attempted any GPFS mount)? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From olaf.weiser at de.ibm.com Tue Oct 24 20:31:06 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 24 Oct 2017 21:31:06 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From sdenham at gmail.com Tue Oct 24 21:35:40 2017 From: sdenham at gmail.com (Scott D) Date: Tue, 24 Oct 2017 15:35:40 -0500 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: I have run nsdperf with RDMA enabled against an ESS ppc-64 server without any problems. I don't have access to that system at the moment, and it was running a fairly old (4.1.x) version of GPFS, not that that should matter for nsdperf unless that source code has changed since 4.1. Scott Denham Staff Engineer Cray, Inc On Tue, Oct 24, 2017 at 11:49 AM, Uwe Falke wrote: > Hi, > I am about to run nsdperf for testing the IB fabric in a new system > comprising ESS (BE) and Intel-based nodes. > nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one > test using RDMA: > > client server RDMA > x86-64 ppc-64 on crash > ppc-64 x86-64 on crash > x86-64 ppc-64 off success > x86-64 x86-64 on success > ppc-64 ppc-64 on success > > That implies that the nsdperf RDMA test might struggle with BE vs LE. > However, I learned from a talk given at a GPFS workshop in Germany in 2015 > that RDMA works between Power-BE and Intel boxes. Has anyone made similar > or contrary experiences? Is it an nsdperf issue or more general (I have > not yet attempted any GPFS mount)? > > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 25 09:52:29 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 25 Oct 2017 10:52:29 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: Hi, Scott, thanks, good to hear that it worked for you. I can at least confirm that GPFS RDMA itself does work between x86-64 clients the ESS here, it appears just nsdperf has an issue in my particular environment. I'll see what IBM support can do for me as Olaf suggested. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Scott D To: gpfsug main discussion list Date: 10/24/2017 10:35 PM Subject: Re: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org I have run nsdperf with RDMA enabled against an ESS ppc-64 server without any problems. I don't have access to that system at the moment, and it was running a fairly old (4.1.x) version of GPFS, not that that should matter for nsdperf unless that source code has changed since 4.1. Scott Denham Staff Engineer Cray, Inc On Tue, Oct 24, 2017 at 11:49 AM, Uwe Falke wrote: Hi, I am about to run nsdperf for testing the IB fabric in a new system comprising ESS (BE) and Intel-based nodes. nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one test using RDMA: client server RDMA x86-64 ppc-64 on crash ppc-64 x86-64 on crash x86-64 ppc-64 off success x86-64 x86-64 on success ppc-64 ppc-64 on success That implies that the nsdperf RDMA test might struggle with BE vs LE. However, I learned from a talk given at a GPFS workshop in Germany in 2015 that RDMA works between Power-BE and Intel boxes. Has anyone made similar or contrary experiences? Is it an nsdperf issue or more general (I have not yet attempted any GPFS mount)? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Tomasz.Wolski at ts.fujitsu.com Wed Oct 25 10:42:02 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 25 Oct 2017 09:42:02 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: <237580bb78cf4d9291c057926c90c265@R01UKEXCASM223.r01.fujitsu.local> Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" > To: gpfsug main discussion list > Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Oct 25 14:09:27 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 25 Oct 2017 09:09:27 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Oct 25 14:33:46 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 25 Oct 2017 13:33:46 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi Felipe On a related note, would everything in 4.2.3-4 efix2 be included in 4.2.3-5? In particular the previously discussed around defect 1020461? There was no APAR for this defect when I last looked on 30th August. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Felipe Knop Sent: 25 October 2017 14:09 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" > To: gpfsug main discussion list > Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Oct 25 16:23:42 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 25 Oct 2017 11:23:42 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Richard, I see that 4.2.3-4 efix2 has two defects, 1032655 (IV99796) and 1020461 (IV99675), and both these fixes are included in 4.2.3.5 . Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Sobey, Richard A" To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 09:34 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Felipe On a related note, would everything in 4.2.3-4 efix2 be included in 4.2.3-5? In particular the previously discussed around defect 1020461? There was no APAR for this defect when I last looked on 30th August. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Felipe Knop Sent: 25 October 2017 14:09 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org[ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=dhKhKiNBptpaDmggHSa8diP48O90VK2uzr-xo9C44uI&s=SCeTu6NeyjHm9D8S4VZVUnrALgCvNksAYTF9rfwD50g&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 25 17:17:09 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 25 Oct 2017 18:17:09 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Message-ID: Dear all, through some gpfsperf tests against an ESS block (config as is) I am seeing lots of waiters like NSDThread: on ThCond 0x3FFA800670A0 (FreePTrackCondvar), reason 'wait for free PTrack' That is not on file creation but on writing to an already existing file. what ressource is the system short of here? IMHO it cannot be physical data tracks on pdisks (the test does not allocate any space, just rewrites an existing file)? The only shortage in threads i could see might be Total server worker threads: running 3042, desired 3072, forNSD 2, forGNR 3070, nsdBigBufferSize 16777216 nsdMultiQueue: 512, nsdMultiQueueType: 1, nsdMinWorkerThreads: 3072, nsdMaxWorkerThreads: 3072 where a difference of 30 is between desired and running number of worker threads (but that is only 1% and 30 more would not necessarily make a big difference). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From vanfalen at mx1.ibm.com Wed Oct 25 22:26:50 2017 From: vanfalen at mx1.ibm.com (Emmanuel Barajas Gonzalez) Date: Wed, 25 Oct 2017 21:26:50 +0000 Subject: [gpfsug-discuss] Fileset quotas enforcement Message-ID: An HTML attachment was scrubbed... URL: From pinto at scinet.utoronto.ca Wed Oct 25 23:18:29 2017 From: pinto at scinet.utoronto.ca (Jaime Pinto) Date: Wed, 25 Oct 2017 18:18:29 -0400 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: <20171025181829.90173xxmr17nklo5@support.scinet.utoronto.ca> Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES http://www.scinethpc.ca/testimonials ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. From rohwedder at de.ibm.com Thu Oct 26 08:18:46 2017 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Thu, 26 Oct 2017 09:18:46 +0200 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: Please also note that when you write as root, you are not restricted by the quota limits. See example: Write as non root user and run into hard limit: [mr at home-11 limited]$ dd if=/dev/urandom of=testfile2 bs=1000 count=100000 dd: error writing ?testfile2?: Disk quota exceeded 29885+0 records in 29884+0 records out 29884000 bytes (30 MB) copied, 3.38491 s, 8.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 322176 214784 322304 128 7 days | 5 0 0 0 none Now write as root user and exceed hard limit: [root at home-11 limited]# dd if=/dev/urandom of=testfile2 bs=1000 count=100000 100000+0 records in 100000+0 records out 100000000 bytes (100 MB) copied, 12.8355 s, 7.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 390656 214784 322304 8939520 7 days | 5 0 0 40 none Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Martina K?deritz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Jaime Pinto" To: "gpfsug main discussion list" , "Emmanuel Barajas Gonzalez" Cc: gpfsug-discuss at spectrumscale.org Date: 10/26/2017 12:18 AM Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scinethpc.ca_testimonials&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=mno73q7nVmqF-YTzo7pAF50B_45epo6ZzccTAwcroRU&e= ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=F6chd4olb-U3RlL5pcqBPagIccRSJd963K0FFS7auko&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 18932891.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From TOMP at il.ibm.com Thu Oct 26 10:09:56 2017 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 26 Oct 2017 12:09:56 +0300 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: And this behavior can be changed using the enforceFilesetQuotaOnRoot options ( check mmchconfig man page) Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Markus Rohwedder" To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 26/10/2017 10:18 Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Please also note that when you write as root, you are not restricted by the quota limits. See example: Write as non root user and run into hard limit: [mr at home-11 limited]$ dd if=/dev/urandom of=testfile2 bs=1000 count=100000 dd: error writing ?testfile2?: Disk quota exceeded 29885+0 records in 29884+0 records out 29884000 bytes (30 MB) copied, 3.38491 s, 8.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 322176 214784 322304 128 7 days | 5 0 0 0 none Now write as root user and exceed hard limit: [root at home-11 limited]# dd if=/dev/urandom of=testfile2 bs=1000 count=100000 100000+0 records in 100000+0 records out 100000000 bytes (100 MB) copied, 12.8355 s, 7.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 390656 214784 322304 8939520 7 days | 5 0 0 40 none Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Martina K?deritz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 "Jaime Pinto" ---10/26/2017 12:18:45 AM---Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the l From: "Jaime Pinto" To: "gpfsug main discussion list" , "Emmanuel Barajas Gonzalez" Cc: gpfsug-discuss at spectrumscale.org Date: 10/26/2017 12:18 AM Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scinethpc.ca_testimonials&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=mno73q7nVmqF-YTzo7pAF50B_45epo6ZzccTAwcroRU&e= ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=F6chd4olb-U3RlL5pcqBPagIccRSJd963K0FFS7auko&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=RxLph-CHLj5Iq5-RYe9eqHId7vsI_uuX4W-Y145ETD8&s=3cgWIXnSFvb65_5JkJDygm3hnSOeeCfYnDnPJdX-hWY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Thu Oct 26 10:16:20 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 26 Oct 2017 09:16:20 +0000 Subject: [gpfsug-discuss] Windows [10] and Spectrum Scale Message-ID: Hi all In the FAQ I note that Windows 10 is not supported at all, and neither is encryption on Windows nodes generally. However the context here is Spectrum Scale v4. Can I take it to mean that this also applies to Scale 4.1/4.2/...? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From vanfalen at mx1.ibm.com Thu Oct 26 14:50:05 2017 From: vanfalen at mx1.ibm.com (Emmanuel Barajas Gonzalez) Date: Thu, 26 Oct 2017 13:50:05 +0000 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46716A4E467141C003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4642530E4641FB0003256C7C22581C5.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E463FD50E463FAF8003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46402D0E4640078003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4641128E4640ED0003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46416A8E4641450003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4644278E4643FF0003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E460E9D8E466F160003256C7C22581C5.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Thu Oct 26 18:03:58 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 26 Oct 2017 17:03:58 +0000 Subject: [gpfsug-discuss] Gartner 2017 - Distributed File systems and Object Storage Message-ID: Interesting read: https://www.gartner.com/doc/reprints?id=1-4IE870C&ct=171017&st=sb Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christian.Fey at sva.de Fri Oct 27 07:30:31 2017 From: Christian.Fey at sva.de (Fey, Christian) Date: Fri, 27 Oct 2017 06:30:31 +0000 Subject: [gpfsug-discuss] how to deal with custom samba options in ces Message-ID: <2ee0987ac21d409b948ae13fbe3d9e97@sva.de> Hi all, I'm just in the process of migration different samba clusters to ces and I recognized, that some clusters have options set like "strict locking = yes" and I'm not sure how to deal with this. From what I know, there is no "CES way" to set every samba option. It would be possible to set with "net" commands I think but probably this will lead to an unsupported state. Anyone came through this? Mit freundlichen Gr??en / Best Regards Christian Fey SVA System Vertrieb Alexander GmbH Borsigstra?e 14 65205 Wiesbaden Tel.: +49 6122 536-0 Fax: +49 6122 536-399 E-Mail: christian.fey at sva.de http://www.sva.de Gesch?ftsf?hrung: Philipp Alexander, Sven Eichelbaum Sitz der Gesellschaft: Wiesbaden Registergericht: Amtsgericht Wiesbaden, HRB 10315 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From sannaik2 at in.ibm.com Fri Oct 27 08:06:50 2017 From: sannaik2 at in.ibm.com (Sandeep Naik1) Date: Fri, 27 Oct 2017 12:36:50 +0530 Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk In-Reply-To: References: Message-ID: Hi Atmane, The missing path from old mmlspdisk (/dev/sdob) and the log file (/dev/sdge) do not match. This may be because server was booted after the old mmlspdisk was taken. The path name are not guarantied across reboot. The log is reporting problem with /dev/sdge. You should check if OS can see path /dev/sdge (use lsscsi). If the disk is accessible from other path than I don't believe it is problem with the disk. Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: atmane khiredine To: "gpfsug-discuss at spectrumscale.org" Date: 24/10/2017 02:50 PM Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=QsMCUxg_qSYCs6Joccb2Brey1phAF_tJFrEnVD6LNoc&s=eSulhfhE2jQnmMrmb9_eoomafxb5xI3KL5Y6n3rH5CE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Fri Oct 27 20:48:08 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 27 Oct 2017 19:48:08 +0000 Subject: [gpfsug-discuss] how to deal with custom samba options in ces In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From johnbent at gmail.com Sat Oct 28 05:15:59 2017 From: johnbent at gmail.com (John Bent) Date: Fri, 27 Oct 2017 22:15:59 -0600 Subject: [gpfsug-discuss] Announcing IO-500 and soliciting submissions Message-ID: Hello GPFS community, After BoFs at last year's SC and the last two ISC's, the IO-500 is formalized and is now accepting submissions in preparation for our first IO-500 list at this year's SC BoF: http://sc17.supercomputing.org/presentation/?id=bof108&sess=sess319 The goal of the IO-500 is simple: to improve parallel file systems by ensuring that sites publish results of both "hero" and "anti-hero" runs and by sharing the tuning and configuration they applied to achieve those results. After receiving feedback from a few trial users, the framework is significantly improved: > git clone https://github.com/VI4IO/io-500-dev > cd io-500-dev > ./utilities/prepare.sh > ./io500.sh > # tune and rerun > # email results to submit at io500.org This, perhaps with a bit of tweaking and please consult our 'doc' directory for troubleshooting, should get a very small toy problem up and running quickly. It then does become a bit challenging to tune the problem size as well as the underlying file system configuration (e.g. striping parameters) to get a valid, and impressive, result. The basic format of the benchmark is to run both a "hero" and "antihero" IOR test as well as a "hero" and "antihero" mdtest. The write/create phase of these tests must last for at least five minutes to ensure that the test is not measuring cache speeds. One of the more challenging aspects is that there is a requirement to search through the metadata of the files that this benchmark creates. Currently we provide a simple serial version of this test (i.e. the GNU find command) as well as a simple python MPI parallel tree walking program. Even with the MPI program, the find can take an extremely long amount of time to finish. You are encouraged to replace these provided tools with anything of your own devise that satisfies the required functionality. This is one area where we particularly hope to foster innovation as we have heard from many file system admins that metadata search in current parallel file systems can be painfully slow. Now is your chance to show the community just how awesome we all know GPFS to be. We are excited to introduce this benchmark and foster this community. We hope you give the benchmark a try and join our community if you haven't already. Please let us know right away in any of our various communications channels (as described in our documentation) if you encounter any problems with the benchmark or have questions about tuning or have suggestions for others. We hope to see your results in email and to see you in person at the SC BoF. Thanks, IO 500 Committee John Bent, Julian Kunkle, Jay Lofstead -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.khiredine at meteo.dz Sat Oct 28 08:29:49 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Sat, 28 Oct 2017 07:29:49 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 69, Issue 54 In-Reply-To: References: Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6340D83B@SDEB-EXC02.meteo.dz> dear Sandeep Naik, Thank you for that answer the OS can see all the path but gss sees only one path for one disk lssci indicates that I have 238 disk 6 SSD and 232 HDD but the gss indicates that it sees only one path with the cmd mmlspdisk all I think it's a disk problem but he sees it with another path if these a problem of SAS cable logically all the disk connect with the cable shows a single path Do you have any ideas ?? GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz ________________________________________ De : gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] de la part de gpfsug-discuss-request at spectrumscale.org [gpfsug-discuss-request at spectrumscale.org] Envoy? : vendredi 27 octobre 2017 08:06 ? : gpfsug-discuss at spectrumscale.org Objet : gpfsug-discuss Digest, Vol 69, Issue 54 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Gartner 2017 - Distributed File systems and Object Storage (Oesterlin, Robert) 2. how to deal with custom samba options in ces (Fey, Christian) 3. Re: GSS GPFS Storage Server show one path for one Disk (Sandeep Naik1) ---------------------------------------------------------------------- Message: 1 Date: Thu, 26 Oct 2017 17:03:58 +0000 From: "Oesterlin, Robert" To: gpfsug main discussion list Subject: [gpfsug-discuss] Gartner 2017 - Distributed File systems and Object Storage Message-ID: Content-Type: text/plain; charset="utf-8" Interesting read: https://www.gartner.com/doc/reprints?id=1-4IE870C&ct=171017&st=sb Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 27 Oct 2017 06:30:31 +0000 From: "Fey, Christian" To: gpfsug main discussion list Subject: [gpfsug-discuss] how to deal with custom samba options in ces Message-ID: <2ee0987ac21d409b948ae13fbe3d9e97 at sva.de> Content-Type: text/plain; charset="iso-8859-1" Hi all, I'm just in the process of migration different samba clusters to ces and I recognized, that some clusters have options set like "strict locking = yes" and I'm not sure how to deal with this. From what I know, there is no "CES way" to set every samba option. It would be possible to set with "net" commands I think but probably this will lead to an unsupported state. Anyone came through this? Mit freundlichen Gr??en / Best Regards Christian Fey SVA System Vertrieb Alexander GmbH Borsigstra?e 14 65205 Wiesbaden Tel.: +49 6122 536-0 Fax: +49 6122 536-399 E-Mail: christian.fey at sva.de http://www.sva.de Gesch?ftsf?hrung: Philipp Alexander, Sven Eichelbaum Sitz der Gesellschaft: Wiesbaden Registergericht: Amtsgericht Wiesbaden, HRB 10315 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: ------------------------------ Message: 3 Date: Fri, 27 Oct 2017 12:36:50 +0530 From: "Sandeep Naik1" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Message-ID: Content-Type: text/plain; charset="utf-8" Hi Atmane, The missing path from old mmlspdisk (/dev/sdob) and the log file (/dev/sdge) do not match. This may be because server was booted after the old mmlspdisk was taken. The path name are not guarantied across reboot. The log is reporting problem with /dev/sdge. You should check if OS can see path /dev/sdge (use lsscsi). If the disk is accessible from other path than I don't believe it is problem with the disk. Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: atmane khiredine To: "gpfsug-discuss at spectrumscale.org" Date: 24/10/2017 02:50 PM Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=QsMCUxg_qSYCs6Joccb2Brey1phAF_tJFrEnVD6LNoc&s=eSulhfhE2jQnmMrmb9_eoomafxb5xI3KL5Y6n3rH5CE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 69, Issue 54 ********************************************** From r.sobey at imperial.ac.uk Mon Oct 30 15:32:10 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 30 Oct 2017 15:32:10 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Message-ID: All, Since upgrading to Windows 10 build 1709 aka Autumn Creator's Update our Previous Versions is wonky... as in you just get a flat list of previous versions^^snapshots with no date associated with them. I am not able to test this on a server with the latest Samba release so at the moment I'm stuck reporting that GPFS 4.2.3-4 with smb 4.5.10 doesn't play nicely with Windows 10 1709. Screenshot is attached for an example. Can anyone corroborate my findings? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: prv-ver.png Type: image/png Size: 16452 bytes Desc: prv-ver.png URL: From christof.schmitt at us.ibm.com Mon Oct 30 20:25:26 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 30 Oct 2017 20:25:26 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From peter.smith at framestore.com Tue Oct 31 13:10:47 2017 From: peter.smith at framestore.com (Peter Smith) Date: Tue, 31 Oct 2017 13:10:47 +0000 Subject: [gpfsug-discuss] FreeBSD client? Message-ID: Hi Does such a thing exist? :-) TIA -- [image: Framestore] Peter Smith ? Senior Systems Engineer London ? New York ? Los Angeles ? Chicago ? Montr?al T +44 (0)20 7344 8000 ? M +44 (0)7816 123009 <+44%20%280%297816%20123009> 19-23 Wells Street, London W1T 3PQ Twitter ? Facebook ? framestore.com [image: https://www.framestore.com/] -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Oct 31 14:20:27 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 31 Oct 2017 14:20:27 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 In-Reply-To: References: Message-ID: Thanks Christof, will do. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: 30 October 2017 20:25 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Richard, in a quick test with Windows 10 Pro 1709 connecting to gpfs.smb 4.5.10_gpfs_21 i do not see the problem from the screenshot. All files reported in "Previous Versions" have a date associated. For debugging the problem on your system, i would suggest to enable traces and recreate the problem. Replace the x.x.x.x with the IP address of the Windows 10 client: mmprotocoltrace start network -c x.x.x.x mmprotocoltrace start smb -c x.x.x.x (open the "Previous Versions" dialog) mmprotocoltrace stop smb mmprotocoltrace stop network The best way to track the analysis would be through a PMR. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: "Sobey, Richard A" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" Cc: Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Date: Mon, Oct 30, 2017 8:32 AM All, Since upgrading to Windows 10 build 1709 aka Autumn Creator?s Update our Previous Versions is wonky? as in you just get a flat list of previous versions^^snapshots with no date associated with them. I am not able to test this on a server with the latest Samba release so at the moment I?m stuck reporting that GPFS 4.2.3-4 with smb 4.5.10 doesn?t play nicely with Windows 10 1709. Screenshot is attached for an example. Can anyone corroborate my findings? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=Bfd_a1yscUVzXzIRuwarah8UedH7U1Uln5AFFPQayR4&s=URMLuAJbrlEOj4xt3_7_Cm0Rj9DfFovuEUOGc4zQUUY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at u.washington.edu Tue Oct 31 14:41:58 2017 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Tue, 31 Oct 2017 07:41:58 -0700 Subject: [gpfsug-discuss] FreeBSD client? In-Reply-To: References: Message-ID: <20171031144158.GC17659@illiuin> I doubt it, since IBM would need to tailor a kernel layer for FreeBSD (not the kind of thing you can run with the x86 Linux userspace emulation in FreeBSD), which would be a lot of work for not a lot of demand. On Tue, Oct 31, 2017 at 01:10:47PM +0000, Peter Smith wrote: > Hi > > Does such a thing exist? :-) -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From j.ouwehand at vumc.nl Mon Oct 2 14:35:23 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Mon, 2 Oct 2017 13:35:23 +0000 Subject: [gpfsug-discuss] number of SMBD processes Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Hello, Since we use new "IBM Spectrum Scale SMB CES" nodes, we see that that the number of SMBD processes has increased significantly from ~ 4,000 to ~ 7,500. We also see that the SMBD processes are not closed. This is likely because the Samba global-parameter "deadtime" is missing. ------------ https://www.samba.org/samba/docs/using_samba/ch11.html This global option sets the number of minutes that Samba will wait for an inactive client before closing its session with the Samba server. A client is considered inactive when it has no open files and no data is being sent from it. The default value for this option is 0, which means that Samba never closes any connection, regardless of how long they have been inactive. This can lead to unnecessary consumption of the server's resources by inactive clients. We recommend that you override the default as follows: [global] deadtime = 10 ------------ Is this Samba parameter "deadtime" supported by IBM? Kindly regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6431 bytes Desc: image001.gif URL: From bbanister at jumptrading.com Mon Oct 2 15:10:24 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 14:10:24 +0000 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: References: Message-ID: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> Thanks for posting this Sandeep! As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn't know that this site was an available resource. Bookmarking is good, but there unfortunately is not a way to "watch" the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. Thanks again, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sandeep Ramesh Sent: Friday, September 29, 2017 11:02 PM To: gpfsug-discuss at spectrumscale.org Cc: Theodore Hoover Jr ; Doris Conti Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 Note: External Email ________________________________ Hi Folks I was asked by Doris Conti to send the below to our Spectrum Scale User group. Below is a consolidated link that list all the enablement on Spectrum Scale/ESS that was done in 1H 2017 - which have blogs and videos from development and offering management. https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media Do note, Spectrum Scale developers keep blogging on the below site which is worth bookmarking: https://developer.ibm.com/storage/blog/ (as recent as 4 new blogs in Sept) Thanks Sandeep Linkedin: https://www.linkedin.com/in/sandeeprpatil Spectrum Scale Dev. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Mon Oct 2 15:13:52 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 14:13:52 +0000 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: References: Message-ID: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Hi Kristy, I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? Thanks for any help here! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Thursday, September 21, 2017 1:49 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. Best, Kristy On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: Thanks Doug. If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. Cheers, Kristy On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. For more information http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ Doug Mobile _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Oct 2 15:23:25 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 2 Oct 2017 14:23:25 +0000 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> References: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> Message-ID: <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> I?d agree with Bryan here ? Depending on me to passively check this for new material isn?t a great mechanism. If it?s on the mailing list, I will probably see it in a timely manner. On a related note, IBM has too many places to look for ?timely? Scale information/tips/how-to?s. It really should be in central repository and then use links to this location. I?m not sure I have a great idea here, but I have a list of 6-8 placed bookmarked that I go looking for articles. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of Bryan Banister Reply-To: gpfsug main discussion list Date: Monday, October 2, 2017 at 9:11 AM To: gpfsug main discussion list Cc: Theodore Hoover Jr , Doris Conti Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 Thanks for posting this Sandeep! As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn?t know that this site was an available resource. Bookmarking is good, but there unfortunately is not a way to ?watch? the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. Thanks again, -Bryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Mon Oct 2 15:31:32 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Mon, 2 Oct 2017 10:31:32 -0400 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> References: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> Message-ID: <8A33571E-905B-41D8-A934-C984A90EF6F9@ulmer.org> I?ve been told in the past that the Spectrum Scale Wiki is the place to watch for the most timely information, and there is a way to "follow" the wiki so you get notified of updates. That being said, I?ve not gotten "following" it to work yet so I don?t know what that actually *means*. I?d love to get a daily digest of all of the changes to that Wiki ? or even just a URL I would watch with IFTTT that would actually show me links to all of the updates. -- Stephen > On Oct 2, 2017, at 10:23 AM, Oesterlin, Robert > wrote: > > I?d agree with Bryan here ? Depending on me to passively check this for new material isn?t a great mechanism. If it?s on the mailing list, I will probably see it in a timely manner. > > On a related note, IBM has too many places to look for ?timely? Scale information/tips/how-to?s. It really should be in central repository and then use links to this location. I?m not sure I have a great idea here, but I have a list of 6-8 placed bookmarked that I go looking for articles. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > 507-269-0413 > > > From: > on behalf of Bryan Banister > > Reply-To: gpfsug main discussion list > > Date: Monday, October 2, 2017 at 9:11 AM > To: gpfsug main discussion list > > Cc: Theodore Hoover Jr >, Doris Conti > > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 > > Thanks for posting this Sandeep! > > As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn?t know that this site was an available resource. > > Bookmarking is good, but there unfortunately is not a way to ?watch? the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. > > Thanks again, > -Bryan > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Mon Oct 2 18:12:37 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 2 Oct 2017 17:12:37 +0000 Subject: [gpfsug-discuss] number of SMBD processes In-Reply-To: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> References: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.gif at 01D33B90.D2CAECC0.gif Type: image/gif Size: 6431 bytes Desc: not available URL: From ckerner at illinois.edu Mon Oct 2 19:20:39 2017 From: ckerner at illinois.edu (Chad Kerner) Date: Mon, 2 Oct 2017 13:20:39 -0500 Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image Message-ID: Has anyone tried building the GPL layer on the latest CentOS 7.4 image with the 3.10.0-693.2.2.el7.x86_64 kernel? This is trying to build 4.2.0.4 on Centos 7.4-1708. protector -Wformat=0 -Wno-format-security -I/usr/lpp/mmfs/src/gpl-linux -c kdump.c cc kdump.o kdump-kern.o kdump-kern-dwarfs.o -o kdump -lpthread kdump-kern.o: In function `GetOffset': kdump-kern.c:(.text+0x9): undefined reference to `page_offset_base' kdump-kern.o: In function `KernInit': kdump-kern.c:(.text+0x58): undefined reference to `page_offset_base' collect2: error: ld returned 1 exit status make[1]: *** [modules] Error 1 make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' make: *** [Modules] Error 1 -------------------------------------------------------- Thanks, Chad -- Chad Kerner, Senior Storage Engineer Storage Enabling Technologies National Center for Supercomputing Applications University of Illinois, Urbana-Champaign -------------- next part -------------- An HTML attachment was scrubbed... URL: From JRLang at uwyo.edu Mon Oct 2 20:31:59 2017 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Mon, 2 Oct 2017 19:31:59 +0000 Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image In-Reply-To: References: Message-ID: Chad I asked this same question last week. The answer is to upgrade to Scpectrum 4.2.3.4 jeff From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Chad Kerner Sent: Monday, October 2, 2017 1:21 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image Has anyone tried building the GPL layer on the latest CentOS 7.4 image with the 3.10.0-693.2.2.el7.x86_64 kernel? This is trying to build 4.2.0.4 on Centos 7.4-1708. protector -Wformat=0 -Wno-format-security -I/usr/lpp/mmfs/src/gpl-linux -c kdump.c cc kdump.o kdump-kern.o kdump-kern-dwarfs.o -o kdump -lpthread kdump-kern.o: In function `GetOffset': kdump-kern.c:(.text+0x9): undefined reference to `page_offset_base' kdump-kern.o: In function `KernInit': kdump-kern.c:(.text+0x58): undefined reference to `page_offset_base' collect2: error: ld returned 1 exit status make[1]: *** [modules] Error 1 make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' make: *** [Modules] Error 1 -------------------------------------------------------- Thanks, Chad -- Chad Kerner, Senior Storage Engineer Storage Enabling Technologies National Center for Supercomputing Applications University of Illinois, Urbana-Champaign -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Oct 2 22:24:43 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 2 Oct 2017 14:24:43 -0700 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> References: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Message-ID: Trying to get details on availability. More when I hear back. -Kristy > On Oct 2, 2017, at 7:13 AM, Bryan Banister wrote: > > Hi Kristy, > > I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? > > Thanks for any help here! > -Bryan > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org ] On Behalf Of Kristy Kallback-Rose > Sent: Thursday, September 21, 2017 1:49 PM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC > > Note: External Email > Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. > > Best, > Kristy > > On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: > > Thanks Doug. > > If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. > > Cheers, > Kristy > > On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: > > > Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. > > > For more information > http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ > > Doug > > Mobile > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Mon Oct 2 22:26:57 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 21:26:57 +0000 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: References: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Message-ID: Kristy, Thanks for the quick response. I did reach out to Karthik about the File System Corruption (MMFSCK) presentation, which was really what I lost. I?m sure he?ll get me the presentation, so please don?t rush at this point on my account! Sorry for the fire drill, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Monday, October 02, 2017 4:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Trying to get details on availability. More when I hear back. -Kristy On Oct 2, 2017, at 7:13 AM, Bryan Banister > wrote: Hi Kristy, I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? Thanks for any help here! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Thursday, September 21, 2017 1:49 PM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. Best, Kristy On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: Thanks Doug. If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. Cheers, Kristy On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. For more information http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ Doug Mobile _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leslie.james.elliott at gmail.com Tue Oct 3 12:32:56 2017 From: leslie.james.elliott at gmail.com (leslie elliott) Date: Tue, 3 Oct 2017 21:32:56 +1000 Subject: [gpfsug-discuss] transparent cloud tiering Message-ID: hi I am trying to change the account for the cloud tier but am having some problems any hints would be appreciated I am not interested in the data locally or migrated but do not seem to be able to recall this so would just like to repurpose it with the new account I can see in the logs 2017-10-03_15:38:49.226+1000: [W] Snapshot quiesce of SG cloud01 snap -1/0 doing 'mmcrsnapshot :MCST.scan.6' timed out on node . Retrying if possible. which is no doubt the reason for the following mmcloudgateway account delete --cloud-nodeclass TCTNodeClass --cloud-name gpfscloud1234 mmcloudgateway: Sending the command to the first successful node starting with gpfs-dev02 mmcloudgateway: This may take a while... mmcloudgateway: Error detected on node gpfs-dev02 The return code is 94. The error is: MCSTG00084E: Command Failed with following reason: Unable to create snapshot for file system /gpfs/itscloud01, [Ljava.lang.String;@3353303e failed with: com.ibm.gpfsconnector.messages.GpfsConnectorException: Command [/usr/lpp/mmfs/bin/mmcrsnapshot, cloud01, MCST.scan.4] failed with the following return code: 78.. mmcloudgateway: Sending the command to the next node gpfs-dev04 mmcloudgateway: Error detected on node gpfs-dev04 The return code is 94. The error is: MCSTG00084E: Command Failed with following reason: Unable to create snapshot for file system /gpfs/cloud01, [Ljava.lang.String;@90a887ad failed with: com.ibm.gpfsconnector.messages.GpfsConnectorException: Command [/usr/lpp/mmfs/bin/mmcrsnapshot, cloud01, MCST.scan.6] failed with the following return code: 78.. mmcloudgateway: Command failed. Examine previous error messages to determine cause. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Oct 3 12:57:21 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 3 Oct 2017 07:57:21 -0400 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f@nasa.gov> Hi All, At the SSUG in NY there was mention of operating systems as well as feature deprecations that would occur in the lifecycle of the next major release of GPFS. I'm not sure if this is public knowledge yet so I haven't mentioned specifics but given the proposed release time frame of the next major release I thought customers may appreciate having access to this information so they could provide feedback about the potential impact to their environment if these deprecations do occur. Any chance someone from IBM could provide specifics here so folks can chime in? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From j.ouwehand at vumc.nl Wed Oct 4 12:59:45 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Wed, 4 Oct 2017 11:59:45 +0000 Subject: [gpfsug-discuss] number of SMBD processes In-Reply-To: References: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E4185E@sp-mx-mbx4> Hello Christof, Thank you very much for the explanation. You have point us in the right direction. Vriendelijke groet, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Christof Schmitt Verzonden: maandag 2 oktober 2017 19:13 Aan: gpfsug-discuss at spectrumscale.org CC: gpfsug-discuss at spectrumscale.org Onderwerp: Re: [gpfsug-discuss] number of SMBD processes Hello, the short answer is that the "deadtime" parameter is not a supported parameter in Spectrum Scale. The longer answer is that setting "deadtime" likely does not solve any issue. "deadtime" was introduced in Samba mainly for older protocol versions. While it is implemented independent of protocol versions, not the statement about "no open files" for a connection to be closed. Spectrum Scale only supports SMB versions 2 and 3. Basically everything there is based on an open file handle. Most SMB 2/3 clients open at least the root directory of the export and register for change notifications there and the client then can wait for any time for changes. That is a valid case, and the open directory handle prevents the connection from being affected by any setting of the "deadtime" parameter. Clients that are no longer active and have not properly closed the connection are detected on the TCP level: # mmsmb config list | grep sock socket options TCP_NODELAY SO_KEEPALIVE TCP_KEEPCNT=4 TCP_KEEPIDLE=240 TCP_KEEPINTVL=15 Every client that no longer responds for 5 minutes will have the connection dropped (240s + 4x15s). On the other hand, if the SMB clients are still responding to TCP keep-alive packets, then the connection is considered valid. It might be interesting to look into the unwanted connections and possibly capture a network trace or look into the client systems to better understand the situation. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: "Ouwehand, JJ" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" > Cc: Subject: [gpfsug-discuss] number of SMBD processes Date: Mon, Oct 2, 2017 6:35 AM Hello, Since we use new ?IBM Spectrum Scale SMB CES? nodes, we see that that the number of SMBD processes has increased significantly from ~ 4,000 to ~ 7,500. We also see that the SMBD processes are not closed. This is likely because the Samba global-parameter ?deadtime? is missing. ------------ https://www.samba.org/samba/docs/using_samba/ch11.html This global option sets the number of minutes that Samba will wait for an inactive client before closing its session with the Samba server. A client is considered inactive when it has no open files and no data is being sent from it. The default value for this option is 0, which means that Samba never closes any connection, regardless of how long they have been inactive. This can lead to unnecessary consumption of the server's resources by inactive clients. We recommend that you override the default as follows: [global] deadtime = 10 ------------ Is this Samba parameter ?deadtime? supported by IBM? Kindly regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=LCAKWPxQj5PMUf5YKTH3Z0zW9cDW--1AO_mljWE3ni8&s=y0FjQ5P-9Q7YjxyvuNNa4kdzHZKfrsjW81pGDLMNuig&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6431 bytes Desc: image001.gif URL: From heiner.billich at psi.ch Wed Oct 4 18:26:03 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Wed, 4 Oct 2017 17:26:03 +0000 Subject: [gpfsug-discuss] AFM - prefetch of many small files - tuning - storage latency required to increase max socket buffer size ... Message-ID: <0A9C5A40-221C-46B5-B7E3-72A9D5A6D483@psi.ch> Hello, A while ago I asked the list for advice on how to tune AFM to speed-up the prefetch of small files (~1MB). In the meantime, we got some results which I want to share. We had to increase the maximum socket buffer sizes to very high values of 40-80MB. Consider that we use IP over Infiniband and the bandwidth-delay-product is about 5MB (1-10us latency). How do we explain this? The reads on the nfs server have a latency of about 20ms. This is physics of disks. Hence a single client can get up to 50 requests/s. Each request is 1MB. To get 1GB/s we need 20 clients in parallel. At all times we have about 20 requests pending. Looks like the server does allocate the socket buffer space before it asks for the data. Hence it allocates/blocks about 20MB at all times. Surprisingly it?s storage latency and not network latency that required us to increase the max. socket buffer size. For large files prefetch works and reduces the latency of reads drastically and no special tuning is required. We did test with kernel-nfs and gpfs 4.2.3 on RHEL7. Whether ganesha shows a similar pattern would be interesting to know. Once we fixed the nfs issues afm did show a nice parallel prefetch up to ~1GB/s with 1MB sized files without any tuning. Still much below the 4GB/s measured with iperf between the two nodes ?. Kind regards, Heiner -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From kkr at lbl.gov Wed Oct 4 22:44:10 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Wed, 4 Oct 2017 14:44:10 -0700 Subject: [gpfsug-discuss] NYC Meeting Presos (Workaround) Message-ID: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> Hi, I?m having some trouble getting links added to the SS/GPFS UG page, but I want to share the presos I have so far, a couple more are coming soon. So, as a workaround (as storage people we can appreciate workarounds, right?!), here are the links to the slides I have thus far: Spectrum Scale Object at CSCS: http://files.gpfsug.org/presentations/2017/NYC/Day4-ss_object_cscs.pdf File System Corruptions & Best Practices: http://files.gpfsug.org/presentations/2017/NYC/Day4-hpcxxl2017_fsck_karthik.pdf Spectrum Scale Cloud Enablement: http://files.gpfsug.org/presentations/2017/NYC/Day4-SpectrumScaleCloudEnablement%20Sept28.pdf IBM Spectrum Scale 4.2.3 Security Overview: http://files.gpfsug.org/presentations/2017/NYC/Day4-Spectrum%20Scale%20Security-%20User%20Group%20-%20NYCSept2017.pdf What?s New in Spectrum Scale: http://files.gpfsug.org/presentations/2017/NYC/Day4-NYC%20User%20Group%202017%20What's%20New%20v2.pdf Cheers, Kristy -------------- next part -------------- An HTML attachment was scrubbed... URL: From jez.tucker at gpfsug.org Thu Oct 5 11:11:53 2017 From: jez.tucker at gpfsug.org (Jez Tucker) Date: Thu, 5 Oct 2017 11:11:53 +0100 Subject: [gpfsug-discuss] NYC Meeting Presos (Workaround) In-Reply-To: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> References: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> Message-ID: *waves hands*? - I can help here if you have issues.? Same for anyone else. ping me 1::1 On 04/10/17 22:44, Kristy Kallback-Rose wrote: > Hi, > > I?m having some trouble getting links added to the SS/GPFS UG page, > but I want to share the presos I have so far, a couple more are coming > soon. So, as a workaround (as storage people we can appreciate > workarounds, right?!), here are the links to the slides I have thus far: > > Spectrum Scale Object at CSCS: > http://files.gpfsug.org/presentations/2017/NYC/Day4-ss_object_cscs.pdf > > File System Corruptions & Best Practices: > http://files.gpfsug.org/presentations/2017/NYC/Day4-hpcxxl2017_fsck_karthik.pdf > > Spectrum Scale Cloud Enablement: > http://files.gpfsug.org/presentations/2017/NYC/Day4-SpectrumScaleCloudEnablement%20Sept28.pdf > > IBM Spectrum Scale 4.2.3 Security Overview: > http://files.gpfsug.org/presentations/2017/NYC/Day4-Spectrum%20Scale%20Security-%20User%20Group%20-%20NYCSept2017.pdf > > What?s New in Spectrum Scale: > http://files.gpfsug.org/presentations/2017/NYC/Day4-NYC%20User%20Group%202017%20What's%20New%20v2.pdf > > > Cheers, > Kristy > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From listymclistfaces at gmail.com Fri Oct 6 13:56:04 2017 From: listymclistfaces at gmail.com (listy mclistface) Date: Fri, 6 Oct 2017 13:56:04 +0100 Subject: [gpfsug-discuss] Client power failure Message-ID: Hi, Although our NSD nodes are on UPS etc, we have some clients which aren't. Do we run the risk of FS corruption if we drop client nodes mid write? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jez.tucker at gpfsug.org Fri Oct 6 14:14:59 2017 From: jez.tucker at gpfsug.org (Jez Tucker) Date: Fri, 6 Oct 2017 14:14:59 +0100 Subject: [gpfsug-discuss] Client power failure In-Reply-To: References: Message-ID: <61604124-ec28-c930-7ea3-a20a6223b779@gpfsug.org> Hi ? Can we please refrain from completely anonymous emails ListyMcListFaces ;-) Ta ListMasterMcListAdmin On 06/10/17 13:56, listy mclistface wrote: > Hi, > > Although our NSD nodes are on UPS etc, we have some clients which > aren't.? ?Do we run the risk of FS corruption if we drop client nodes > mid write? > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Oct 6 14:24:11 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 6 Oct 2017 13:24:11 +0000 Subject: [gpfsug-discuss] Client power failure Message-ID: I agree ? anonymous ones should be dropped from the list. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of Jez Tucker Reply-To: "jez.tucker at gpfsug.org" , gpfsug main discussion list Date: Friday, October 6, 2017 at 8:17 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Client power failure Can we please refrain from completely anonymous emails ListyMcListFaces ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Fri Oct 6 14:45:38 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Fri, 6 Oct 2017 13:45:38 +0000 Subject: [gpfsug-discuss] Client power failure In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 6 21:39:28 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 6 Oct 2017 20:39:28 +0000 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release In-Reply-To: References: Message-ID: Hi Aaron, I appreciate your care with this. The user group are the first users to be briefed on this. We're not quite ready to put more in writing just yet, however I will be at SC17 and hope to be able to do so at that time. (I'll also take any other questions that people want to ask, including "where's my RFE?"...) I also want to add one note about the meaning of feature deprecation, because it's not well understood even within IBM: If we deprecate a feature with the next major release it does NOT mean we are dropping support there and then. It means we are announcing the INTENTION to drop support in some future release, and encourage you to (a) start making plans on migration to a supported alternative, and (b) chime in on what you need in order to be able to satisfactorily migrate if our proposed alternative is not adequate. regards, Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com ------------------------------ Message: 2 Date: Tue, 3 Oct 2017 07:57:21 -0400 From: Aaron Knister To: gpfsug main discussion list Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f at nasa.gov> Content-Type: text/plain; charset="utf-8"; format=flowed Hi All, At the SSUG in NY there was mention of operating systems as well as feature deprecations that would occur in the lifecycle of the next major release of GPFS. I'm not sure if this is public knowledge yet so I haven't mentioned specifics but given the proposed release time frame of the next major release I thought customers may appreciate having access to this information so they could provide feedback about the potential impact to their environment if these deprecations do occur. Any chance someone from IBM could provide specifics here so folks can chime in? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ------------------------------ From aaron.s.knister at nasa.gov Fri Oct 6 23:30:05 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 6 Oct 2017 18:30:05 -0400 Subject: [gpfsug-discuss] changing default configuration values Message-ID: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From scale at us.ibm.com Sat Oct 7 04:06:41 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 6 Oct 2017 23:06:41 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Hi Aaron, The default value applies to all nodes in the cluster. Thus changing it will change all nodes in the cluster. You need to run mmchconfig to customize the node override again. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 10/06/2017 06:30 PM Subject: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From john.hearns at asml.com Mon Oct 9 09:38:29 2017 From: john.hearns at asml.com (John Hearns) Date: Mon, 9 Oct 2017 08:38:29 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Aaron, The reply you just got her is absolutely the correct one. However, its worth contributing something here. I have recently bene dealing with the parameter verbsPorts - which is a list of the interfaces which verbs should use. I found on our cluyster it was set to use dual ports for all nodes, including servers, when only our servers have dual ports. I will follow the advice below and make a global change, then change back the configuration for the server. It is worth looking though at mmllnodeclass -all There is a rather rich set of nodeclasses, including clientNodes managerNodes nonNsdNodes nonQuorumNodes So if you want to make changes to a certain type of node in your cluster you will be able to achieve it using nodeclasses. Bond, James Bond commander.bond at mi6.gov.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of IBM Spectrum Scale Sent: Saturday, October 07, 2017 5:07 AM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Hi Aaron, The default value applies to all nodes in the cluster. Thus changing it will change all nodes in the cluster. You need to run mmchconfig to customize the node override again. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. [Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is there a way to change the default value of a configurati]Aaron Knister ---10/06/2017 06:30:20 PM---Is there a way to change the default value of a configuration option without overriding any overrid From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM Subject: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From john.hearns at asml.com Mon Oct 9 09:44:28 2017 From: john.hearns at asml.com (John Hearns) Date: Mon, 9 Oct 2017 08:44:28 +0000 Subject: [gpfsug-discuss] Setting fo verbsRdmaSend Message-ID: We have a GPFS setup which is completely Infiniband connected. Version 4.2.3.4 I see that verbsRdmaCm is set to Disabled. Reading up about this, I am inclined to leave this disabled. Can anyone comment on the likely effects of changing it, and if there are any real benefits in performance? commander.bond at mi6.gov.uk -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.ouwehand at vumc.nl Mon Oct 9 10:13:07 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Mon, 9 Oct 2017 09:13:07 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Hello, Currently we are urgently looking for an on-access antivirus solution for our IBM Spectrum Scale SMB CES Cluster. Unfortunately IBM has no such solution. Does anyone have a good supported solution? Kind regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [cid:image003.png at 01D340EF.9527A0C0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 8437 bytes Desc: image003.png URL: From r.sobey at imperial.ac.uk Mon Oct 9 10:16:35 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 9 Oct 2017 09:16:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus In-Reply-To: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> References: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Message-ID: According to one of the presentations posted on this list a few days ago, there is "bulk antivirus scanning with Symantec AV" "coming soon". From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ouwehand, JJ Sent: 09 October 2017 10:13 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus Hello, Currently we are urgently looking for an on-access antivirus solution for our IBM Spectrum Scale SMB CES Cluster. Unfortunately IBM has no such solution. Does anyone have a good supported solution? Kind regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [cid:image001.png at 01D340E7.AF732BA0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8437 bytes Desc: image001.png URL: From daniel.kidger at uk.ibm.com Mon Oct 9 10:27:57 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Mon, 9 Oct 2017 09:27:57 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus In-Reply-To: References: , <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.png at 01D340E7.AF732BA0.png Type: image/png Size: 8437 bytes Desc: not available URL: From a.khiredine at meteo.dz Mon Oct 9 13:47:09 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Mon, 9 Oct 2017 12:47:09 +0000 Subject: [gpfsug-discuss] how gpfs work when disk fail Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> dear all how gpfs work when disk fail this is a example scenario when disk fail 1 Server 2 Disk directly attached to the local node 100GB mmlscluster GPFS cluster information ======================== GPFS cluster name: test.gpfs GPFS cluster id: 174397273000001824 GPFS UID domain: test.gpfs Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs Secondary server: (none) Node Daemon node name IP address Admin node name Designation ------------------------------------------------------------------- 1 gpfs 192.168.1.10 gpfs quorum-manager cat disk %nsd: device=/dev/sdb nsd=nsda servers=gpfs usage=dataAndMetadata pool=system %nsd: device=/dev/sdc nsd=nsdb servers=gpfs usage=dataAndMetadata pool=system mmcrnsd -F disk.txt mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------- nsdsdbgpfsa C0A8000F59DB69E2 /dev/sdb generic gpfsa-ib server node nsdsdcgpfsa C0A8000F59DB69E3 /dev/sdc generic gpfsa-ib server node mmcrfs gpfs -F disk.txt -B 1M -L 32M -T /gpfs -A no -m 2 -M 3 -r 2 -R 3 mmmount gpfs df -h gpfs 200G 3,8G 197G 2% /gpfs <-- The Disk Have 200GB my question is the following ?? if I write 180 GB of data in /gpfs and the disk /dev/sdb is fail how the disk and/or GPFS continues to support all my data Thanks Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From S.J.Thompson at bham.ac.uk Mon Oct 9 13:57:08 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 9 Oct 2017 12:57:08 +0000 Subject: [gpfsug-discuss] AFM fun (more!) Message-ID: Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon From janfrode at tanso.net Mon Oct 9 14:45:32 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 9 Oct 2017 15:45:32 +0200 Subject: [gpfsug-discuss] how gpfs work when disk fail In-Reply-To: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> References: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> Message-ID: You don't have room to write 180GB of file data, only ~100GB. When you write f.ex. 90 GB of file data, each filesystem block will get one copy written to each of your disks, occuppying 180 GB on total disk space. So you can always read if from the other disks if one should fail. This is controlled by your "-m 2 -r 2" settings, and the default failureGroup -1 since you didn't specify a failure group in your disk descriptor. Normally I would always specify a failure group when doing replication. -jf On Mon, Oct 9, 2017 at 2:47 PM, atmane khiredine wrote: > dear all > > how gpfs work when disk fail > > this is a example scenario when disk fail > > 1 Server > > 2 Disk directly attached to the local node 100GB > > mmlscluster > > GPFS cluster information > ======================== > GPFS cluster name: test.gpfs > GPFS cluster id: 174397273000001824 > GPFS UID domain: test.gpfs > Remote shell command: /usr/bin/ssh > Remote file copy command: /usr/bin/scp > Repository type: server-based > > GPFS cluster configuration servers: > ----------------------------------- > Primary server: gpfs > Secondary server: (none) > > Node Daemon node name IP address Admin node name Designation > ------------------------------------------------------------------- > 1 gpfs 192.168.1.10 gpfs quorum-manager > > cat disk > > %nsd: > device=/dev/sdb > nsd=nsda > servers=gpfs > usage=dataAndMetadata > pool=system > > %nsd: > device=/dev/sdc > nsd=nsdb > servers=gpfs > usage=dataAndMetadata > pool=system > > mmcrnsd -F disk.txt > > mmlsnsd -X > > Disk name NSD volume ID Device Devtype Node name Remarks > ------------------------------------------------------------ > --------------- > nsdsdbgpfsa C0A8000F59DB69E2 /dev/sdb generic gpfsa-ib server node > nsdsdcgpfsa C0A8000F59DB69E3 /dev/sdc generic gpfsa-ib server node > > > mmcrfs gpfs -F disk.txt -B 1M -L 32M -T /gpfs -A no -m 2 -M 3 -r 2 -R 3 > > mmmount gpfs > > df -h > > gpfs 200G 3,8G 197G 2% /gpfs <-- The Disk Have 200GB > > my question is the following ?? > > if I write 180 GB of data in /gpfs > and the disk /dev/sdb is fail > how the disk and/or GPFS continues to support all my data > > Thanks > > Atmane Khiredine > HPC System Administrator | Office National de la M?t?orologie > T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : > a.khiredine at meteo.dz > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Oct 9 15:38:15 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 9 Oct 2017 14:38:15 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Oct 9 19:55:45 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 14:55:45 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Thanks John! Funnily enough playing with node classes is what sent me down this path. I had a bunch of nodes defined (just over 1000) with a lower pagepool than the default. I then started using nodeclasses to clean up the config and I noticed that if you define a parameter with a nodeclass it doesn't override any previously set values for nodes in the node class. What I mean by that is if you do this: - mmchconfig pagepool=256M -N n001 - add node n001 to nodeclass mynodeclass - mmchconfig pagepool=256M -N mynodeclass after the 2nd chconfig there is still a definition for pagepool=256M for node n001. I tried to clean things up by doing "mmchconfig pagepool=DEFAULT -N n001" however the default value of the pagepool in our config is 1024M not the "1G" mmchconfig expects as the defualt value so I wasn't able to remove the explicit definition of pagepool for n001. What I ended up doing was an "mmchconfig pagepool=1024M -N n001" and that removed the explicit definitions. -Aaron On 10/9/17 4:38 AM, John Hearns wrote: > Aaron, > > The reply you just got her is absolutely the correct one. > > However, its worth contributing something here. I have recently bene > dealing with the parameter verbsPorts ? which is a list of the > interfaces which verbs should use. I found on our cluyster it was set to > use dual ports for all nodes, including servers, when only our servers > have dual ports.? I will follow the advice below and make a global > change, then change back the configuration for the server. > > It is worth looking though at? mmllnodeclass ?all > > There is a rather rich set of nodeclasses, including?? clientNodes > ??managerNodes nonNsdNodes? nonQuorumNodes > > So if you want to make changes to a certain type of node in your cluster > you will be able to achieve it using nodeclasses. > > Bond, James Bond > > commander.bond at mi6.gov.uk > > *From:* gpfsug-discuss-bounces at spectrumscale.org > [mailto:gpfsug-discuss-bounces at spectrumscale.org] *On Behalf Of *IBM > Spectrum Scale > *Sent:* Saturday, October 07, 2017 5:07 AM > *To:* gpfsug main discussion list > *Cc:* gpfsug-discuss-bounces at spectrumscale.org > *Subject:* Re: [gpfsug-discuss] changing default configuration values > > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 > . > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > > To: gpfsug main discussion list > > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > ------------------------------------------------------------------------ > > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the > body of this communication or the attachment thereto (if any), the > information is provided on an AS-IS basis without any express or implied > warranties or liabilities. To the extent you are relying on this > information, you are doing so at your own risk. If you are not the > intended recipient, please notify the sender immediately by replying to > this message and destroy all copies of this message and any attachments. > Neither the sender nor the company/group of companies he or she > represents shall be liable for the proper and complete transmission of > the information contained in this communication, or for any delay in its > receipt. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 19:56:25 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 14:56:25 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: <01c2a2bb-f332-e067-e7b5-6954df14c25d@nasa.gov> Thanks! Good to know. On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 20:00:02 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 15:00:02 -0400 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release In-Reply-To: References: Message-ID: <49283f9f-12b1-6381-6890-37d16aa87635@nasa.gov> Thanks Carl. Unfortunately I won't be at SC17 this year but thankfully a number of my colleagues will be so I'll send them with a list of questions on my behalf :) On 10/6/17 4:39 PM, Carl Zetie wrote: > Hi Aaron, > > I appreciate your care with this. The user group are the first users to be briefed on this. > > We're not quite ready to put more in writing just yet, however I will be at SC17 and hope > to be able to do so at that time. (I'll also take any other questions that people want to > ask, including "where's my RFE?"...) > > I also want to add one note about the meaning of feature deprecation, because it's not well > understood even within IBM: If we deprecate a feature with the next major release it does > NOT mean we are dropping support there and then. It means we are announcing the INTENTION > to drop support in some future release, and encourage you to (a) start making plans on > migration to a supported alternative, and (b) chime in on what you need in order to be > able to satisfactorily migrate if our proposed alternative is not adequate. > > regards, > > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > > ------------------------------ > > Message: 2 > Date: Tue, 3 Oct 2017 07:57:21 -0400 > From: Aaron Knister > To: gpfsug main discussion list > Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major > release > Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f at nasa.gov> > Content-Type: text/plain; charset="utf-8"; format=flowed > > Hi All, > > At the SSUG in NY there was mention of operating systems as well as > feature deprecations that would occur in the lifecycle of the next major > release of GPFS. I'm not sure if this is public knowledge yet so I > haven't mentioned specifics but given the proposed release time frame of > the next major release I thought customers may appreciate having access > to this information so they could provide feedback about the potential > impact to their environment if these deprecations do occur. Any chance > someone from IBM could provide specifics here so folks can chime in? > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 21:46:59 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 16:46:59 -0400 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> Message-ID: <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Hi Sven, Just wondering if you've had any additional thoughts/conversations about this. -Aaron On 9/8/17 5:21 PM, Sven Oehme wrote: > Hi, > > the code assumption is that the underlying device has no volatile write > cache, i was absolute sure we have that somewhere in the FAQ, but i > couldn't find it, so i will talk to somebody to correct this. > if i understand > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt?correct > one could enforce this by setting REQ_FUA, but thats not explicitly set > today, at least i can't see it. i will discuss this with one of our devs > who owns this code and come back. > > sven > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > wrote: > > Thanks Sven. I didn't think GPFS itself was caching anything on that > layer, but it's my understanding that O_DIRECT isn't sufficient to force > I/O to be flushed (e.g. the device itself might have a volatile caching > layer). Take someone using ZFS zvol's as NSDs. I can write() all day log > to that zvol (even with O_DIRECT) but there is absolutely no guarantee > those writes have been committed to stable storage and aren't just > sitting in RAM until an fsync() occurs (or some other bio function that > causes a flush). I also don't believe writing to a SATA drive with > O_DIRECT will force cache flushes of the drive's writeback cache.. > although I just tested that one and it seems to actually trigger a scsi > cache sync. Interesting. > > -Aaron > > On 9/7/17 10:55 PM, Sven Oehme wrote: > > I am not sure what exactly you are looking for but all > blockdevices are > > opened with O_DIRECT , we never cache anything on this layer . > > > > > > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > > > >> wrote: > > > >? ? ?Hi Everyone, > > > >? ? ?This is something that's come up in the past and has recently > resurfaced > >? ? ?with a project I've been working on, and that is-- it seems > to me as > >? ? ?though mmfsd never attempts to flush the cache of the block > devices its > >? ? ?writing to (looking at blktrace output seems to confirm > this). Is this > >? ? ?actually the case? I've looked at the gpl headers for linux > and I don't > >? ? ?see any sign of blkdev_fsync, blkdev_issue_flush, WRITE_FLUSH, or > >? ? ?REQ_FLUSH. I'm sure there's other ways to trigger this > behavior that > >? ? ?GPFS may very well be using that I've missed. That's why I'm > asking :) > > > >? ? ?I figure with FPO being pushed as an HDFS replacement using > commodity > >? ? ?drives this feature has *got* to be in the code somewhere. > > > >? ? ?-Aaron > > > >? ? ?-- > >? ? ?Aaron Knister > >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ?Goddard Space Flight Center > > (301) 286-2776 > >? ? ?_______________________________________________ > >? ? ?gpfsug-discuss mailing list > >? ? ?gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Mon Oct 9 22:07:10 2017 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 09 Oct 2017 21:07:10 +0000 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Message-ID: Hi, yeah sorry i intended to reply back before my vacation and forgot about it the the vacation flushed it all away :-D so right now the assumption in Scale/GPFS is that the underlying storage doesn't have any form of enabled volatile write cache. the problem seems to be that even if we set REQ_FUA some stacks or devices may not have implemented that at all or correctly, so even we would set it there is no guarantee that it will do what you think it does. the benefit of adding the flag at least would allow us to blame everything on the underlying stack/device , but i am not sure that will make somebody happy if bad things happen, therefore the requirement of a non-volatile device will still be required at all times underneath Scale. so if you think we should do this, please open a PMR with the details of your test so it can go its regular support path. you can mention me in the PMR as a reference as we already looked at the places the request would have to be added. Sven On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister wrote: > Hi Sven, > > Just wondering if you've had any additional thoughts/conversations about > this. > > -Aaron > > On 9/8/17 5:21 PM, Sven Oehme wrote: > > Hi, > > > > the code assumption is that the underlying device has no volatile write > > cache, i was absolute sure we have that somewhere in the FAQ, but i > > couldn't find it, so i will talk to somebody to correct this. > > if i understand > > > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt > correct > > one could enforce this by setting REQ_FUA, but thats not explicitly set > > today, at least i can't see it. i will discuss this with one of our devs > > who owns this code and come back. > > > > sven > > > > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > > wrote: > > > > Thanks Sven. I didn't think GPFS itself was caching anything on that > > layer, but it's my understanding that O_DIRECT isn't sufficient to > force > > I/O to be flushed (e.g. the device itself might have a volatile > caching > > layer). Take someone using ZFS zvol's as NSDs. I can write() all day > log > > to that zvol (even with O_DIRECT) but there is absolutely no > guarantee > > those writes have been committed to stable storage and aren't just > > sitting in RAM until an fsync() occurs (or some other bio function > that > > causes a flush). I also don't believe writing to a SATA drive with > > O_DIRECT will force cache flushes of the drive's writeback cache.. > > although I just tested that one and it seems to actually trigger a > scsi > > cache sync. Interesting. > > > > -Aaron > > > > On 9/7/17 10:55 PM, Sven Oehme wrote: > > > I am not sure what exactly you are looking for but all > > blockdevices are > > > opened with O_DIRECT , we never cache anything on this layer . > > > > > > > > > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > > > > > > >> wrote: > > > > > > Hi Everyone, > > > > > > This is something that's come up in the past and has recently > > resurfaced > > > with a project I've been working on, and that is-- it seems > > to me as > > > though mmfsd never attempts to flush the cache of the block > > devices its > > > writing to (looking at blktrace output seems to confirm > > this). Is this > > > actually the case? I've looked at the gpl headers for linux > > and I don't > > > see any sign of blkdev_fsync, blkdev_issue_flush, > WRITE_FLUSH, or > > > REQ_FLUSH. I'm sure there's other ways to trigger this > > behavior that > > > GPFS may very well be using that I've missed. That's why I'm > > asking :) > > > > > > I figure with FPO being pushed as an HDFS replacement using > > commodity > > > drives this feature has *got* to be in the code somewhere. > > > > > > -Aaron > > > > > > -- > > > Aaron Knister > > > NASA Center for Climate Simulation (Code 606.2) > > > Goddard Space Flight Center > > > (301) 286-2776 > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Oct 10 00:19:20 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 19:19:20 -0400 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Message-ID: <7090f583-d021-dd98-e55c-23eac83849ef@nasa.gov> Thanks, Sven. I think my goal was for the REQ_FUA flag to be used in alignment with the consistency expectations of the filesystem. Meaning if I was writing to a file on a filesystem (e.g. dd if=/dev/zero of=/gpfs/fs0/file1) that the write requests to the disk addresses containing data on the file wouldn't be issued with REQ_FUA. However, once the file was closed the close() wouldn't return until a disk buffer flush had occurred. For more important operations (e.g. metadata updates, log operations) I would expect/suspect REQ_FUA would be issued more frequently. The advantage here is it would allow GPFS to run ontop of block devices that don't perform well with the present synchronous workload of mmfsd (e.g. ZFS, and various other software-defined storage or hardware appliances) but that can perform well when only periodically (e.g. every few seconds) asked to flush pending data to disk. I also think this would be *really* important in an FPO environment where individual drives will probably have caches on by default and I'm not sure direct I/O is sufficient to force linux to issue scsi synchronize cache commands to those devices. I'm guessing that this is far from easy but I figured I'd ask. -Aaron On 10/9/17 5:07 PM, Sven Oehme wrote: > Hi, > > yeah sorry i intended to reply back before my vacation and forgot about > it the the vacation flushed it all away :-D > so right now the assumption in Scale/GPFS is that the underlying storage > doesn't have any form of enabled volatile write cache. the problem seems > to be that even if we set?REQ_FUA some stacks or devices may not have > implemented that at all or correctly, so even we would set it there is > no guarantee that it will do what you think it does. the benefit of > adding the flag at least would allow us to blame everything on the > underlying stack/device , but i am not sure that will make somebody > happy if bad things happen, therefore the requirement of a non-volatile > device will still be required at all times underneath Scale. > so if you think we should do this, please open a PMR with the details of > your test so it can go its regular support path. you can mention me in > the PMR as a reference as we already looked at the places the request > would have to be added.?? > > Sven > > > On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister > wrote: > > Hi Sven, > > Just wondering if you've had any additional thoughts/conversations about > this. > > -Aaron > > On 9/8/17 5:21 PM, Sven Oehme wrote: > > Hi, > > > > the code assumption is that the underlying device has no volatile > write > > cache, i was absolute sure we have that somewhere in the FAQ, but i > > couldn't find it, so i will talk to somebody to correct this. > > if i understand > > > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt?correct > > one could enforce this by setting REQ_FUA, but thats not > explicitly set > > today, at least i can't see it. i will discuss this with one of > our devs > > who owns this code and come back. > > > > sven > > > > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > > > >> wrote: > > > >? ? ?Thanks Sven. I didn't think GPFS itself was caching anything > on that > >? ? ?layer, but it's my understanding that O_DIRECT isn't > sufficient to force > >? ? ?I/O to be flushed (e.g. the device itself might have a > volatile caching > >? ? ?layer). Take someone using ZFS zvol's as NSDs. I can write() > all day log > >? ? ?to that zvol (even with O_DIRECT) but there is absolutely no > guarantee > >? ? ?those writes have been committed to stable storage and aren't just > >? ? ?sitting in RAM until an fsync() occurs (or some other bio > function that > >? ? ?causes a flush). I also don't believe writing to a SATA drive with > >? ? ?O_DIRECT will force cache flushes of the drive's writeback cache.. > >? ? ?although I just tested that one and it seems to actually > trigger a scsi > >? ? ?cache sync. Interesting. > > > >? ? ?-Aaron > > > >? ? ?On 9/7/17 10:55 PM, Sven Oehme wrote: > >? ? ? > I am not sure what exactly you are looking for but all > >? ? ?blockdevices are > >? ? ? > opened with O_DIRECT , we never cache anything on this layer . > >? ? ? > > >? ? ? > > >? ? ? > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > >? ? ? > > > >? ? ? > > >? ? ? >>> wrote: > >? ? ? > > >? ? ? >? ? ?Hi Everyone, > >? ? ? > > >? ? ? >? ? ?This is something that's come up in the past and has > recently > >? ? ?resurfaced > >? ? ? >? ? ?with a project I've been working on, and that is-- it seems > >? ? ?to me as > >? ? ? >? ? ?though mmfsd never attempts to flush the cache of the block > >? ? ?devices its > >? ? ? >? ? ?writing to (looking at blktrace output seems to confirm > >? ? ?this). Is this > >? ? ? >? ? ?actually the case? I've looked at the gpl headers for linux > >? ? ?and I don't > >? ? ? >? ? ?see any sign of blkdev_fsync, blkdev_issue_flush, > WRITE_FLUSH, or > >? ? ? >? ? ?REQ_FLUSH. I'm sure there's other ways to trigger this > >? ? ?behavior that > >? ? ? >? ? ?GPFS may very well be using that I've missed. That's > why I'm > >? ? ?asking :) > >? ? ? > > >? ? ? >? ? ?I figure with FPO being pushed as an HDFS replacement using > >? ? ?commodity > >? ? ? >? ? ?drives this feature has *got* to be in the code somewhere. > >? ? ? > > >? ? ? >? ? ?-Aaron > >? ? ? > > >? ? ? >? ? ?-- > >? ? ? >? ? ?Aaron Knister > >? ? ? >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ? >? ? ?Goddard Space Flight Center > >? ? ? > (301) 286-2776 > >? ? ? >? ? ?_______________________________________________ > >? ? ? >? ? ?gpfsug-discuss mailing list > >? ? ? >? ? ?gpfsug-discuss at spectrumscale.org > > >? ? ? > >? ? ? > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? ? ? > > >? ? ? > > >? ? ? > > >? ? ? > _______________________________________________ > >? ? ? > gpfsug-discuss mailing list > >? ? ? > gpfsug-discuss at spectrumscale.org > > >? ? ? > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? ? ? > > > > >? ? ?-- > >? ? ?Aaron Knister > >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ?Goddard Space Flight Center > >? ? ?(301) 286-2776 > >? ? ?_______________________________________________ > >? ? ?gpfsug-discuss mailing list > >? ? ?gpfsug-discuss at spectrumscale.org > > >? ? ?http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From vpuvvada at in.ibm.com Tue Oct 10 05:56:21 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 10 Oct 2017 10:26:21 +0530 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Tue Oct 10 08:47:23 2017 From: john.hearns at asml.com (John Hearns) Date: Tue, 10 Oct 2017 07:47:23 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: > The queues are steadily rising and we've seen them over 1000000 ... There is definitely a song here... I see you playing the blues guitar... I can't answer your question directly. As I recall you are at the latest version? We recently had to update to 4.2.3.4 due to an AFM issue - where if the home NFS share was disconnected, a read operation would finish early and not re-start. One thing I would do is look at where the 'real' NFS mount is being done (apology - I assume an NFS home). Log on to bber-afmgw01 and find where the home filesystem is being mounted, which is below /var/mmfs/afm Have a ferret around in there - do you still have that filesystem mounted? -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: Monday, October 09, 2017 2:57 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] AFM fun (more!) Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-appsgpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homesgpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Caa732d9965f64983c2e508d50f15424e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=wVJhicLSj%2FWUjedvBKo6MG%2FYrtFAaWKxMeqiUrKRHfM%3D&reserved=0 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From john.hearns at asml.com Tue Oct 10 09:42:05 2017 From: john.hearns at asml.com (John Hearns) Date: Tue, 10 Oct 2017 08:42:05 +0000 Subject: [gpfsug-discuss] Recommended pagepool size on clients? Message-ID: May I ask how to size pagepool on clients? Somehow I hear an enormous tin can being opened behind me... and what sounds like lots of worms... Anyway, I currently have mmhealth reporting gpfs_pagepool_small. Pagepool is set to 1024M on clients, and I now note the documentation says you get this warning when pagepool is lower or equal to 1GB We did do some IOR benchmarking which shows better performance with an increased pagepool size. I am looking for some rules of thumb for sizing for an 128Gbyte RAM client. And yup, I know the answer will be 'depends on your workload' I agree though that 1024M is too low. Illya,kuryakin at uncle.int -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scottg at emailhosting.com Tue Oct 10 10:49:54 2017 From: scottg at emailhosting.com (Scott Goldman) Date: Tue, 10 Oct 2017 05:49:54 -0400 Subject: [gpfsug-discuss] changing default configuration values Message-ID: So, I think brings up one of the slight frustrations I've always had with mmconfig.. If I have a cluster to which new nodes will eventually be added, OR, I have standard I always wish to apply, there is no way to say "all FUTURE" nodes need to have my defaults.. I just have to remember to extended the changes in as new nodes are brought into the cluster. Is there a way to accomplish this? Thanks ? Original Message ? From: aaron.s.knister at nasa.gov Sent: October 9, 2017 2:56 PM To: gpfsug-discuss at spectrumscale.org Reply-to: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Thanks! Good to know. On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Tue Oct 10 13:02:14 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:02:14 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: Use mmchconfig and change the defaults, and then have a node class for "not the defaults"? Apply settings to a node class and add all new clients to the node class? Note there was some version of Scale where node classes were enumerated when the config was set for the node class, but in (4.2.3 at least), this works as expected, I.e. The node class is not expanded when doing mmchconfig -N Simon On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So, I think brings up one of the slight frustrations I've always had with >mmconfig.. > >If I have a cluster to which new nodes will eventually be added, OR, I >have standard I always wish to apply, there is no way to say "all FUTURE" >nodes need to have my defaults.. I just have to remember to extended the >changes in as new nodes are brought into the cluster. > >Is there a way to accomplish this? >Thanks > > Original Message >From: aaron.s.knister at nasa.gov >Sent: October 9, 2017 2:56 PM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Thanks! Good to know. > >On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >> Hi Aaron, >> >> The default value applies to all nodes in the cluster. Thus changing it >> will change all nodes in the cluster. You need to run mmchconfig to >> customize the node override again. >> >> >> Regards, The Spectrum Scale (GPFS) team >> >> >>------------------------------------------------------------------------- >>----------------------------------------- >> If you feel that your question can benefit other users of Spectrum >>Scale >> (GPFS), then please post it to the public IBM developerWroks Forum at >> >>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111111 >>1-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please contact >> 1-800-237-5511 in the United States or your local IBM Service Center in >> other countries. >> >> The forum is informally monitored as time permits and should not be >>used >> for priority messages to the Spectrum Scale (GPFS) team. >> >> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >> there a way to change the default value of a configuratiAaron Knister >> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >> of a configuration option without overriding any overrid >> >> From: Aaron Knister >> To: gpfsug main discussion list >> Date: 10/06/2017 06:30 PM >> Subject: [gpfsug-discuss] changing default configuration values >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Is there a way to change the default value of a configuration option >> without overriding any overrides in place? >> >> Take the following situation: >> >> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >> n001,n002) >> >> Is there a way to then set the default value of foo to qux without >> changing the value of foo for nodes n001 and n002? >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_li >>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sb >>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4 >>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scottg at emailhosting.com Tue Oct 10 13:04:30 2017 From: scottg at emailhosting.com (Scott Goldman) Date: Tue, 10 Oct 2017 08:04:30 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: Message-ID: So when a node is added to the node class, my defaults" will be applied? If so,excellent. Thanks ? Original Message ? From: S.J.Thompson at bham.ac.uk Sent: October 10, 2017 8:02 AM To: gpfsug-discuss at spectrumscale.org Reply-to: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Use mmchconfig and change the defaults, and then have a node class for "not the defaults"? Apply settings to a node class and add all new clients to the node class? Note there was some version of Scale where node classes were enumerated when the config was set for the node class, but in (4.2.3 at least), this works as expected, I.e. The node class is not expanded when doing mmchconfig -N Simon On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So, I think brings up one of the slight frustrations I've always had with >mmconfig.. > >If I have a cluster to which new nodes will eventually be added, OR, I >have standard I always wish to apply, there is no way to say "all FUTURE" >nodes need to have my defaults.. I just have to remember to extended the >changes in as new nodes are brought into the cluster. > >Is there a way to accomplish this? >Thanks > >? Original Message >From: aaron.s.knister at nasa.gov >Sent: October 9, 2017 2:56 PM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Thanks! Good to know. > >On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >> Hi Aaron, >> >> The default value applies to all nodes in the cluster. Thus changing it >> will change all nodes in the cluster. You need to run mmchconfig to >> customize the node override again. >> >> >> Regards, The Spectrum Scale (GPFS) team >> >> >>------------------------------------------------------------------------- >>----------------------------------------- >> If you feel that your question can benefit other users of Spectrum >>Scale >> (GPFS), then please post it to the public IBM developerWroks Forum at >> >>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111111 >>1-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please contact >> 1-800-237-5511 in the United States or your local IBM Service Center in >> other countries. >> >> The forum is informally monitored as time permits and should not be >>used >> for priority messages to the Spectrum Scale (GPFS) team. >> >> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >> there a way to change the default value of a configuratiAaron Knister >> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >> of a configuration option without overriding any overrid >> >> From: Aaron Knister >> To: gpfsug main discussion list >> Date: 10/06/2017 06:30 PM >> Subject: [gpfsug-discuss] changing default configuration values >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Is there a way to change the default value of a configuration option >> without overriding any overrides in place? >> >> Take the following situation: >> >> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >> n001,n002) >> >> Is there a way to then set the default value of foo to qux without >> changing the value of foo for nodes n001 and n002? >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_li >>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sb >>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4 >>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Robert.Oesterlin at nuance.com Tue Oct 10 13:27:45 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 10 Oct 2017 12:27:45 +0000 Subject: [gpfsug-discuss] changing default configuration values Message-ID: <1BFF991D-4ABD-4C3A-B6FB-41CEABFCD4FB@nuance.com> Yes, this is exactly what we do for our LROC enabled nodes. Add them to the node class and you're all set. Bob Oesterlin Sr Principal Storage Engineer, Nuance ?On 10/10/17, 7:03 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: Apply settings to a node class and add all new clients to the node class? From S.J.Thompson at bham.ac.uk Tue Oct 10 13:30:57 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:30:57 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: Yes, but obviously only when you recycle mmfsd on the node after adding it to the node class, e.g. page pool cannot be changed online. We do this all the time, e.g. We have nodes with different IB fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have classes for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different DCs etc) Simon On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So when a node is added to the node class, my defaults" will be applied? >If so,excellent. Thanks > > > Original Message >From: S.J.Thompson at bham.ac.uk >Sent: October 10, 2017 8:02 AM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Use mmchconfig and change the defaults, and then have a node class for >"not the defaults"? > >Apply settings to a node class and add all new clients to the node class? > >Note there was some version of Scale where node classes were enumerated >when the config was set for the node class, but in (4.2.3 at least), this >works as expected, I.e. The node class is not expanded when doing >mmchconfig -N > >Simon > >On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf >of scottg at emailhosting.com" behalf of scottg at emailhosting.com> wrote: > >>So, I think brings up one of the slight frustrations I've always had with >>mmconfig.. >> >>If I have a cluster to which new nodes will eventually be added, OR, I >>have standard I always wish to apply, there is no way to say "all FUTURE" >>nodes need to have my defaults.. I just have to remember to extended the >>changes in as new nodes are brought into the cluster. >> >>Is there a way to accomplish this? >>Thanks >> >> Original Message >>From: aaron.s.knister at nasa.gov >>Sent: October 9, 2017 2:56 PM >>To: gpfsug-discuss at spectrumscale.org >>Reply-to: gpfsug-discuss at spectrumscale.org >>Subject: Re: [gpfsug-discuss] changing default configuration values >> >>Thanks! Good to know. >> >>On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>> Hi Aaron, >>> >>> The default value applies to all nodes in the cluster. Thus changing it >>> will change all nodes in the cluster. You need to run mmchconfig to >>> customize the node override again. >>> >>> >>> Regards, The Spectrum Scale (GPFS) team >>> >>> >>>------------------------------------------------------------------------ >>>- >>>----------------------------------------- >>> If you feel that your question can benefit other users of Spectrum >>>Scale >>> (GPFS), then please post it to the public IBM developerWroks Forum at >>> >>>https://www.ibm.com/developerworks/community/forums/html/forum?id=111111 >>>1 >>>1-0000-0000-0000-000000000479. >>> >>> >>> If your query concerns a potential software error in Spectrum Scale >>> (GPFS) and you have an IBM software maintenance contract please contact >>> 1-800-237-5511 in the United States or your local IBM Service Center in >>> other countries. >>> >>> The forum is informally monitored as time permits and should not be >>>used >>> for priority messages to the Spectrum Scale (GPFS) team. >>> >>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >>> there a way to change the default value of a configuratiAaron Knister >>> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >>> of a configuration option without overriding any overrid >>> >>> From: Aaron Knister >>> To: gpfsug main discussion list >>> Date: 10/06/2017 06:30 PM >>> Subject: [gpfsug-discuss] changing default configuration values >>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>> >>> >>>------------------------------------------------------------------------ >>> >>> >>> >>> Is there a way to change the default value of a configuration option >>> without overriding any overrides in place? >>> >>> Take the following situation: >>> >>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>> n001,n002) >>> >>> Is there a way to then set the default value of foo to qux without >>> changing the value of foo for nodes n001 and n002? >>> >>> -Aaron >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> >>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_l >>>i >>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2S >>>b >>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_ >>>4 >>>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >>-- >>Aaron Knister >>NASA Center for Climate Simulation (Code 606.2) >>Goddard Space Flight Center >>(301) 286-2776 >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Tue Oct 10 13:32:25 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 10 Oct 2017 08:32:25 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Simon, Does that mean node classes don't work the way individual node names do with the "-i/-I" options? -Aaron On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: > Yes, but obviously only when you recycle mmfsd on the node after adding it > to the node class, e.g. page pool cannot be changed online. > > We do this all the time, e.g. We have nodes with different IB > fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have classes > for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different > DCs etc) > > Simon > > On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of scottg at emailhosting.com" behalf of scottg at emailhosting.com> wrote: > >> So when a node is added to the node class, my defaults" will be applied? >> If so,excellent. Thanks >> >> >> Original Message >> From: S.J.Thompson at bham.ac.uk >> Sent: October 10, 2017 8:02 AM >> To: gpfsug-discuss at spectrumscale.org >> Reply-to: gpfsug-discuss at spectrumscale.org >> Subject: Re: [gpfsug-discuss] changing default configuration values >> >> Use mmchconfig and change the defaults, and then have a node class for >> "not the defaults"? >> >> Apply settings to a node class and add all new clients to the node class? >> >> Note there was some version of Scale where node classes were enumerated >> when the config was set for the node class, but in (4.2.3 at least), this >> works as expected, I.e. The node class is not expanded when doing >> mmchconfig -N >> >> Simon >> >> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So, I think brings up one of the slight frustrations I've always had with >>> mmconfig.. >>> >>> If I have a cluster to which new nodes will eventually be added, OR, I >>> have standard I always wish to apply, there is no way to say "all FUTURE" >>> nodes need to have my defaults.. I just have to remember to extended the >>> changes in as new nodes are brought into the cluster. >>> >>> Is there a way to accomplish this? >>> Thanks >>> >>> Original Message >>> From: aaron.s.knister at nasa.gov >>> Sent: October 9, 2017 2:56 PM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Thanks! Good to know. >>> >>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>> Hi Aaron, >>>> >>>> The default value applies to all nodes in the cluster. Thus changing it >>>> will change all nodes in the cluster. You need to run mmchconfig to >>>> customize the node override again. >>>> >>>> >>>> Regards, The Spectrum Scale (GPFS) team >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> - >>>> ----------------------------------------- >>>> If you feel that your question can benefit other users of Spectrum >>>> Scale >>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>> >>>> https://www.ibm.com/developerworks/community/forums/html/forum?id=111111 >>>> 1 >>>> 1-0000-0000-0000-000000000479. >>>> >>>> >>>> If your query concerns a potential software error in Spectrum Scale >>>> (GPFS) and you have an IBM software maintenance contract please contact >>>> 1-800-237-5511 in the United States or your local IBM Service Center in >>>> other countries. >>>> >>>> The forum is informally monitored as time permits and should not be >>>> used >>>> for priority messages to the Spectrum Scale (GPFS) team. >>>> >>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >>>> there a way to change the default value of a configuratiAaron Knister >>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >>>> of a configuration option without overriding any overrid >>>> >>>> From: Aaron Knister >>>> To: gpfsug main discussion list >>>> Date: 10/06/2017 06:30 PM >>>> Subject: [gpfsug-discuss] changing default configuration values >>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> >>>> Is there a way to change the default value of a configuration option >>>> without overriding any overrides in place? >>>> >>>> Take the following situation: >>>> >>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>> n001,n002) >>>> >>>> Is there a way to then set the default value of foo to qux without >>>> changing the value of foo for nodes n001 and n002? >>>> >>>> -Aaron >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_l >>>> i >>>> stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2S >>>> b >>>> on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_ >>>> 4 >>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Tue Oct 10 13:36:14 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:36:14 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Tue Oct 10 15:45:32 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 10:45:32 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: It's always helpful to check and confirm that you get what you expected. mmlsconfig shows the value in the configuration and "mmfsadm dump config" shows the value in the GPFS daemon currently running. [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1073741824 c10f1n11.gpfs.net: pagepool 1073741824 [root at c10f1n11 gitr]# mmchconfig pagepool=1500M -i -N c69bc2xn03 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root at c10f1n11 gitr]# Tue Oct 10 10:36:49 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation started [root at c10f1n11 gitr]# Tue Oct 10 10:36:52 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G pagepool 1500M [c69bc2xn03] [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1572864000 c10f1n11.gpfs.net: pagepool 1073741824 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/10/2017 08:36 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From scale at us.ibm.com Tue Oct 10 15:51:37 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 10:51:37 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: For a customer production system, "mmdiag --config" rather than "mmfsadm dump config" should be used. The mmdiag command is meant for end users while the "mmfsadm dump" command is a service aid that carries greater risks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: IBM Spectrum Scale/Poughkeepsie/IBM To: gpfsug main discussion list Date: 10/10/2017 10:48 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: Enci Zhong It's always helpful to check and confirm that you get what you expected. mmlsconfig shows the value in the configuration and "mmfsadm dump config" shows the value in the GPFS daemon currently running. [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1073741824 c10f1n11.gpfs.net: pagepool 1073741824 [root at c10f1n11 gitr]# mmchconfig pagepool=1500M -i -N c69bc2xn03 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root at c10f1n11 gitr]# Tue Oct 10 10:36:49 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation started [root at c10f1n11 gitr]# Tue Oct 10 10:36:52 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G pagepool 1500M [c69bc2xn03] [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1572864000 c10f1n11.gpfs.net: pagepool 1073741824 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/10/2017 08:36 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From scale at us.ibm.com Tue Oct 10 16:09:20 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 11:09:20 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Leo.Earl at uea.ac.uk Tue Oct 10 16:29:47 2017 From: Leo.Earl at uea.ac.uk (Leo Earl (ITCS - Staff)) Date: Tue, 10 Oct 2017 15:29:47 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: Hi Simon, (My first ever post - queue being shot down in flames) Whilst this doesn't answer any of your questions (directly) One thing we do tend to look at when we see (what appears) to be a static "Queue Length" value, is the data which is actually inflight from an AFM perspective, so that we can ascertain whether the reason is something like, a user writing huge files to the cache, which take time to sync with home, and thus remain in the queue, providing a static "Queue Length" [root at server ~]# mmafmctl gpfs getstate | awk ' $6 >=50 ' Fileset Name Fileset Target Fileset State Gateway Node Queue State Queue Length Queue numExec afmcancergenetics :/leo/res_hpc/afm-leo Dirty csgpfs01 Active 60 126157822 [root at server ~]# So for instance, by using the tsfindinode command, to have a look at the size of the file which is currently "inflight" from an AFM perspective: [root at server ~]# mmfsadm dump afm | more Fileset: afm-leo 12 (gpfs) mode: independent-writer queue: Normal myFileset MDS: home: proto: nfs port: 2049 lastCmd: 6 handler: Mounted Dirty refCount: 5 queue: delay 300 QLen 0+9 flushThds 4 maxFlushThds 4 numExec 436 qfs 0 err 0 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 1 i/o: pReadGWs 0 (All) pReadChunkSize 134217728 pReadThresh: -2 >> Disabled << i/o: prefetchThresh 0 (Prefetch) iw: afmIwTakeoverTime 0 Priority Queue: (listed by execution order) (state: Active) Write [601414711.601379492] inflight (377743888481 @ 0) chunks 0 bytes 0 0 thread_id 7630 Write [601717612.601465868] inflight (462997479227 @ 0) chunks 0 bytes 0 1 thread_id 10200 Write [601717612.601465870] inflight (391663667550 @ 0) chunks 0 bytes 0 2 thread_id 10287 Write [601717612.601465871] inflight (377743888481 @ 0) chunks 0 bytes 0 3 thread_id 10333 Write [601717612.601573418] queued (387002794104 @ 0) chunks 0 bytes 0 4 Write [601414711.601650195] queued (342305480538 @ 0) chunks 0 bytes 0 5 ResetDirty [538455296.-1] queued etype normal normal 19061213 ResetDirty [601623334.-1] queued etype normal normal 19061213 RecoveryMarker [-1.-1] queued etype normal normal 19061213 Normal Queue: Empty (state: Active) Fileset: afmdata 20 (gpfs) #Use the file inode ID to determine the actual file which is inflight between cache and home [root at server cancergenetics]# tsfindinode -i 601379492 /gpfs/afm/cancergenetics > inode.out [root at server ~]# cat /root/inode.out 601379492 0 0xCCB6 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server ~]# ls -l /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq -rw-r--r-- 1 bbn16cku MED_pg 377743888481 Mar 25 05:48 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server I am not sure if that helps and you probably already know about it inflight checking... Kind Regards, Leo Leo Earl | Head of Research & Specialist Computing Room ITCS 01.16, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ +44 (0) 1603 593856 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 October 2017 05:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM fun (more!) Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Oct 10 17:03:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 16:03:35 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: So as you might expect, we've been poking at this all day. We'd typically get to ~1000 entries in the queue having taken access to the FS away from users (yeah its that bad), but the remaining items would stay for ever as far as we could see. By copying the file, removing and then moving the copied file, we're able to get it back into a clean state. But then we ran a sample user job, and instantly the next job hung up the queue (we're talking like <100MB files here). Interestingly we looked at the queue to see what was going on (with saferdump, always use saferdump!!!) Normal Queue: (listed by execution order) (state: Active) 95 Write [6060026.6060026] inflight (18 @ 0) thread_id 44812 96 Write [13808655.13808655] queued (18 @ 0) 97 Truncate [6060026] queued 98 Truncate [13808655] queued 124 Write [6060000.6060000] inflight (18 @ 0) thread_id 44835 125 Truncate [6060000] queued 159 Write [6060013.6060013] inflight (18 @ 0) thread_id 21329 160 Truncate [6060013] queued 171 Write [5953611.5953611] inflight (18 @ 0) thread_id 44837 172 Truncate [5953611] queued Note that each inode that is inflight is followed by a queued Truncate... We are running efix2, because there is an issue with truncate not working prior to this (it doesn't get sent to home), so this smells like an AFM bug to me. We have a PMR open... Simon From: > on behalf of "Leo Earl (ITCS - Staff)" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 10 October 2017 at 16:29 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] AFM fun (more!) Hi Simon, (My first ever post ? queue being shot down in flames) Whilst this doesn?t answer any of your questions (directly) One thing we do tend to look at when we see (what appears) to be a static ?Queue Length? value, is the data which is actually inflight from an AFM perspective, so that we can ascertain whether the reason is something like, a user writing huge files to the cache, which take time to sync with home, and thus remain in the queue, providing a static ?Queue Length? [root at server ~]# mmafmctl gpfs getstate | awk ' $6 >=50 ' Fileset Name Fileset Target Fileset State Gateway Node Queue State Queue Length Queue numExec afmcancergenetics :/leo/res_hpc/afm-leo Dirty csgpfs01 Active 60 126157822 [root at server ~]# So for instance, by using the tsfindinode command, to have a look at the size of the file which is currently ?inflight? from an AFM perspective: [root at server ~]# mmfsadm dump afm | more Fileset: afm-leo 12 (gpfs) mode: independent-writer queue: Normal myFileset MDS: home: proto: nfs port: 2049 lastCmd: 6 handler: Mounted Dirty refCount: 5 queue: delay 300 QLen 0+9 flushThds 4 maxFlushThds 4 numExec 436 qfs 0 err 0 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 1 i/o: pReadGWs 0 (All) pReadChunkSize 134217728 pReadThresh: -2 >> Disabled << i/o: prefetchThresh 0 (Prefetch) iw: afmIwTakeoverTime 0 Priority Queue: (listed by execution order) (state: Active) Write [601414711.601379492] inflight (377743888481 @ 0) chunks 0 bytes 0 0 thread_id 7630 Write [601717612.601465868] inflight (462997479227 @ 0) chunks 0 bytes 0 1 thread_id 10200 Write [601717612.601465870] inflight (391663667550 @ 0) chunks 0 bytes 0 2 thread_id 10287 Write [601717612.601465871] inflight (377743888481 @ 0) chunks 0 bytes 0 3 thread_id 10333 Write [601717612.601573418] queued (387002794104 @ 0) chunks 0 bytes 0 4 Write [601414711.601650195] queued (342305480538 @ 0) chunks 0 bytes 0 5 ResetDirty [538455296.-1] queued etype normal normal 19061213 ResetDirty [601623334.-1] queued etype normal normal 19061213 RecoveryMarker [-1.-1] queued etype normal normal 19061213 Normal Queue: Empty (state: Active) Fileset: afmdata 20 (gpfs) #Use the file inode ID to determine the actual file which is inflight between cache and home [root at server cancergenetics]# tsfindinode -i 601379492 /gpfs/afm/cancergenetics > inode.out [root at server ~]# cat /root/inode.out 601379492 0 0xCCB6 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server ~]# ls -l /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq -rw-r--r-- 1 bbn16cku MED_pg 377743888481 Mar 25 05:48 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server I am not sure if that helps and you probably already know about it inflight checking? Kind Regards, Leo Leo Earl | Head of Research & Specialist Computing Room ITCS 01.16, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ +44 (0) 1603 593856 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 October 2017 05:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] AFM fun (more!) Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Oct 10 19:00:55 2017 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 10 Oct 2017 18:00:55 +0000 Subject: [gpfsug-discuss] Recommended pagepool size on clients? In-Reply-To: References: Message-ID: if this is a new cluster and you use reasonable new HW, i probably would start with just the following settings on the clients : pagepool=4g,workerThreads=256,maxStatCache=0,maxFilesToCache=256k depending on what storage you use and what workload you have you may have to set a couple of other settings too, but that should be a good start. we plan to make this whole process significant easier in the future, The Next Major Scale release will eliminate the need for another ~20 parameters in special cases and we will simplify the communication setup a lot too. beyond that we started working on introducing tuning suggestions based on the running system environment but there is no release targeted for that yet. Sven On Tue, Oct 10, 2017 at 1:42 AM John Hearns wrote: > May I ask how to size pagepool on clients? Somehow I hear an enormous tin > can being opened behind me? and what sounds like lots of worms? > > > > Anyway, I currently have mmhealth reporting gpfs_pagepool_small. Pagepool > is set to 1024M on clients, > > and I now note the documentation says you get this warning when pagepool > is lower or equal to 1GB > > We did do some IOR benchmarking which shows better performance with an > increased pagepool size. > > > > I am looking for some rules of thumb for sizing for an 128Gbyte RAM client. > > And yup, I know the answer will be ?depends on your workload? > > I agree though that 1024M is too low. > > > > Illya,kuryakin at uncle.int > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdeluca at gmail.com Tue Oct 10 19:51:28 2017 From: bdeluca at gmail.com (Ben De Luca) Date: Tue, 10 Oct 2017 20:51:28 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: > Bob, > > The problem may occur when the TCP connection is broken between two nodes. > While in the vast majority of the cases when data stops flowing through the > connection, the result is one of the nodes getting expelled, there are > cases where the TCP connection simply breaks -- that is relatively rare but > happens on occasion. There is logic in the mmfsd daemon to detect the > disconnection and attempt to reconnect to the destination in question. If > the reconnect is successful then steps are taken to recover the state kept > by the daemons, and that includes resending some RPCs that were in flight > when the disconnection took place. > > As the flash describes, a problem in the logic to resend some RPCs was > causing one of the RPC headers to be omitted, resulting in the RPC data to > be interpreted as the (missing) header. Normally the result is an assert on > the receiving end, like the "logAssertFailed: !"Request and queue size > mismatch" assert described in the flash. However, it's at least > conceivable (though expected to very rare) that the content of the RPC data > could be interpreted as a valid RPC header. In the case of an RPC which > involves data transfer between an NSD client and NSD server, that might > result in incorrect data being written to some NSD device. > > Disconnect/reconnect scenarios appear to be uncommon. An entry like > > [N] Reconnected to xxx.xxx.xxx.xxx nodename > > in mmfs.log would be an indication that a reconnect has occurred. By > itself, the reconnect will not imply that data or the file system was > corrupted, since that will depend on what RPCs were pending when the > connection happened. In the case the assert above is hit, no corruption is > expected, since the daemon will go down before incorrect data gets written. > > Reconnects involving an NSD server are those which present the highest > risk, given that NSD-related RPCs are used to write data into NSDs > > Even on clusters that have not been subjected to disconnects/reconnects > before, such events might still happen in the future in case of network > glitches. It's then recommended that an efix for the problem be applied in > a timely fashion. > > > Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 > > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------ > ------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/ > forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM > Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale > (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file > system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Can anyone from the Scale team comment? > > Anytime I see ?may result in file system corruption or undetected file > data corruption? it gets my attention. > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > *Storage * > IBM My Notifications > Check out the *IBM Electronic Support* > > > > IBM Spectrum Scale > *: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect > function may result in file system corruption or undetected file data > corruption* > > IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 > levels, in which resending an NSD RPC after a network reconnect function > may result in file system corruption or undetected file data corruption. > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r= > IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=- > AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Tue Oct 10 23:15:11 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 00:15:11 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bdeluca at gmail.com Wed Oct 11 05:40:21 2017 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 11 Oct 2017 06:40:21 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: > Hi, I understood the failure to occur requires that the RPC payload of > the RPC resent without actual header can be mistaken for a valid RPC > header. The resend mechanism is probably not considering what the actual > content/target the RPC has. > So, in principle, the RPC could be to update a data block, or a metadata > block - so it may hit just a single data file or corrupt your entire file > system. > However, I think the likelihood that the RPC content can go as valid RPC > header is very low. > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org > Date: 10/10/2017 08:52 PM > Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum > Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in > file system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > does this corrupt the entire filesystem or just the open files that are > being written too? > > One is horrific and the other is just mildly bad. > > On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: > Bob, > > The problem may occur when the TCP connection is broken between two nodes. > While in the vast majority of the cases when data stops flowing through > the connection, the result is one of the nodes getting expelled, there are > cases where the TCP connection simply breaks -- that is relatively rare > but happens on occasion. There is logic in the mmfsd daemon to detect the > disconnection and attempt to reconnect to the destination in question. If > the reconnect is successful then steps are taken to recover the state kept > by the daemons, and that includes resending some RPCs that were in flight > when the disconnection took place. > > As the flash describes, a problem in the logic to resend some RPCs was > causing one of the RPC headers to be omitted, resulting in the RPC data to > be interpreted as the (missing) header. Normally the result is an assert > on the receiving end, like the "logAssertFailed: !"Request and queue size > mismatch" assert described in the flash. However, it's at least > conceivable (though expected to very rare) that the content of the RPC > data could be interpreted as a valid RPC header. In the case of an RPC > which involves data transfer between an NSD client and NSD server, that > might result in incorrect data being written to some NSD device. > > Disconnect/reconnect scenarios appear to be uncommon. An entry like > > [N] Reconnected to xxx.xxx.xxx.xxx nodename > > in mmfs.log would be an indication that a reconnect has occurred. By > itself, the reconnect will not imply that data or the file system was > corrupted, since that will depend on what RPCs were pending when the > connection happened. In the case the assert above is hit, no corruption is > expected, since the daemon will go down before incorrect data gets > written. > > Reconnects involving an NSD server are those which present the highest > risk, given that NSD-related RPCs are used to write data into NSDs > > Even on clusters that have not been subjected to disconnects/reconnects > before, such events might still happen in the future in case of network > glitches. It's then recommended that an efix for the problem be applied in > a timely fashion. > > > Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 > > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------ > ------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/ > forums/html/forum?id=11111111-0000-0000-0000-000000000479 > . > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM > Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale > (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file > system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Can anyone from the Scale team comment? > > Anytime I see ?may result in file system corruption or undetected file > data corruption? it gets my attention. > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > > > > > > > > Storage > IBM My Notifications > Check out the IBM Electronic Support > > > > > > > > IBM Spectrum Scale > > > > : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect > function may result in file system corruption or undetected file data > corruption > > > > IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 > levels, in which resending an NSD RPC after a network reconnect function > may result in file system corruption or undetected file data corruption. > > > > > > > > > > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r= > IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=- > AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Wed Oct 11 07:08:33 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 11 Oct 2017 06:08:33 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From arc at b4restore.com Wed Oct 11 08:46:03 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 07:46:03 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Oct 11 09:01:47 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 11 Oct 2017 09:01:47 +0100 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From arc at b4restore.com Wed Oct 11 09:18:01 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 08:18:01 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> Message-ID: <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> Hi Jonathan, Yes I thought about that but the system is located at a customer site and they are not willing to do that, unfortunately. That's why I was hoping there was a way around it Andi R. Christiansen -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:02 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 09:32:37 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 08:32:37 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Oct 11 09:46:46 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 08:46:46 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors Message-ID: I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon From jonathan.buzzard at strath.ac.uk Wed Oct 11 09:53:34 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 11 Oct 2017 09:53:34 +0100 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> Message-ID: <1507712014.9906.5.camel@strath.ac.uk> On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From arc at b4restore.com Wed Oct 11 09:59:20 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 08:59:20 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <1507712014.9906.5.camel@strath.ac.uk> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> <1507712014.9906.5.camel@strath.ac.uk> Message-ID: Yes i think my last resort might be to go to customer with a separate switch and do the reconfiguration. Thanks ? -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:54 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From arc at b4restore.com Wed Oct 11 10:02:08 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 09:02:08 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: <674e2c9b6c3f450b8f85b2d36a504597@B4RWEX01.internal.b4restore.com> Hi Simon, I will do that before I go to the customer with a separate switch as a last resort :) Thanks Venlig hilsen / Best Regards Andi Rhod Christiansen From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: 11. oktober 2017 10:33 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 11 11:19:13 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 12:19:13 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Hm , mmfsck will return not very reliable results in online mode, especially it will report many issues which are just due to the transient states in a files system in operation. It should however not find less issues than in off-line mode. mmrestripefs -c does not do any logical checks, it just checks for differences of multiple replicas of the same data/metadata. File system errors can be caused by such discrepancies (if an odd/corrupt replica is used by the GPFS), but can also be caused (probably more likely) by logical errors / bugs when metadata were modified in the file system. In those cases, all the replicas are identical nevertheless corrupt (cannot be found by mmrestripefs) So, mmrestripefs -c is like scrubbing for silent data corruption (on its own, it cannot decide which is the correct replica!), while mmfsck checks the filesystem structure for logical consistency. If the contents of the replicas of a data block differ, mmfsck won't see any problem (as long as the fs metadata are consistent), but mmrestripefs -c will. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 10/11/2017 10:47 AM Subject: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 11:31:53 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 10:31:53 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Wed Oct 11 11:58:52 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 12:58:52 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: If you do both, you are on the safe side. I am not sure wether mmfsck reads both replica of the metadata (if it it does, than one could spare the mmrestripefs -c WRT metadata, but I don't think so), if not, one could still have luckily checked using valid metadata where maybe one (or more) MD block has (have) an invalid replica which might come up another time ... But the mmfsrestripefs -c is not only ensuring the sanity of the FS but also of the data stored within (which is not necessarily the same). Mostly, however, filesystem checks are only done if fs issues are indicated by errors in the logs. Do you have reason to assume your fs has probs? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 12:32 PM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 12:22:26 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 11:22:26 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Yes I get we should only be doing this if we think we have a problem. And the answer is, right now, we're not entirely clear. We have a couple of issues our users are reporting to us, and its not clear to us if they are related, an FS problem or ACLs getting in the way. We do have users who are trying to work on files getting IO error, and we have an AFM sync issue. The disks are all online, I poked the FS with tsdbfs and the files look OK - (small files, but content of the block matches). Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error reported to user when they access a file even if its not an offline file??) We have a PMR open with IBM on this already. But there's a wanting to be sure in our own minds that we don't have an underlying FS problem. I.e. I have confidence that I can tell my users, yes I know you are seeing weird stuff, but we have run checks and are not introducing data corruption. Simon On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Mostly, however, filesystem checks are only done if fs issues are >indicated by errors in the logs. Do you have reason to assume your fs has >probs? From stockf at us.ibm.com Wed Oct 11 12:55:18 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 11 Oct 2017 07:55:18 -0400 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Generally you should not run mmfsck unless you see MMFS_FSSTRUCT errors in your system logs. To my knowledge online mmfsck only checks for a subset of problems, notably lost blocks, but that situation does not indicate any problems with the file system. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 06:32 AM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr????en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch??ftsf??hrung: >Thomas Wolter, Sven Schoo?? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Wed Oct 11 13:30:49 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 11 Oct 2017 08:30:49 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 11 15:01:54 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 16:01:54 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Usually, IO errors point to some basic problem reading/writing data . if there are repoducible errors, it's IMHO always a nice thing to trace GPFS for such an access. Often that reveals already the area where the cause lies and maybe even the details of it. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 01:22 PM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org Yes I get we should only be doing this if we think we have a problem. And the answer is, right now, we're not entirely clear. We have a couple of issues our users are reporting to us, and its not clear to us if they are related, an FS problem or ACLs getting in the way. We do have users who are trying to work on files getting IO error, and we have an AFM sync issue. The disks are all online, I poked the FS with tsdbfs and the files look OK - (small files, but content of the block matches). Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error reported to user when they access a file even if its not an offline file??) We have a PMR open with IBM on this already. But there's a wanting to be sure in our own minds that we don't have an underlying FS problem. I.e. I have confidence that I can tell my users, yes I know you are seeing weird stuff, but we have run checks and are not introducing data corruption. Simon On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Mostly, however, filesystem checks are only done if fs issues are >indicated by errors in the logs. Do you have reason to assume your fs has >probs? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 15:13:03 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 14:13:03 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: So with the help of IBM support and Venkat (thanks guys!), we think its a problem with DMAPI. As we initially saw this as an issue with AFM replication, we had traces from there, and had entries like: gpfsWrite exit: failed err 688 Now apparently err 688 relates to "DMAPI disposition", once we had this we were able to get someone to take a look at the HSM dsmrecalld, it was running, but had failed over to a node that wasn't able to service requests properly. (multiple NSD servers with different file-systems each running dsmrecalld, but I don't think you can scope nods XYZ to filesystem ABC but not DEF). Anyway once we got that fixed, a bunch of stuff in the AFM cache popped out (and a little poke for some stuff that hadn't updated metadata cache probably). So hopefully its now also solved for our other users. What is complicated here is that a DMAPI issue was giving intermittent IO errors, people could write into new folders, but not existing files, though I could (some sort of Schr?dinger's cat IO issue??). So hopefully we are fixed... Simon On 11/10/2017, 15:01, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Usually, IO errors point to some basic problem reading/writing data . >if there are repoducible errors, it's IMHO always a nice thing to trace >GPFS for such an access. Often that reveals already the area where the >cause lies and maybe even the details of it. > > > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: gpfsug main discussion list >Date: 10/11/2017 01:22 PM >Subject: Re: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >Yes I get we should only be doing this if we think we have a problem. > >And the answer is, right now, we're not entirely clear. > >We have a couple of issues our users are reporting to us, and its not >clear to us if they are related, an FS problem or ACLs getting in the way. > >We do have users who are trying to work on files getting IO error, and we >have an AFM sync issue. The disks are all online, I poked the FS with >tsdbfs and the files look OK - (small files, but content of the block >matches). > >Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error >reported to user when they access a file even if its not an offline >file??) > >We have a PMR open with IBM on this already. > >But there's a wanting to be sure in our own minds that we don't have an >underlying FS problem. I.e. I have confidence that I can tell my users, >yes I know you are seeing weird stuff, but we have run checks and are not >introducing data corruption. > >Simon > >On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf >of UWEFALKE at de.ibm.com" behalf of UWEFALKE at de.ibm.com> wrote: > >>Mostly, however, filesystem checks are only done if fs issues are >>indicated by errors in the logs. Do you have reason to assume your fs has >>probs? > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From truongv at us.ibm.com Wed Oct 11 17:14:21 2017 From: truongv at us.ibm.com (Truong Vu) Date: Wed, 11 Oct 2017 12:14:21 -0400 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to the network In-Reply-To: References: Message-ID: What you can do is create network alias to the old IP. Run mmchnode to change hostname/IP for non-quorum nodes first. Make one (or more) of the nodes you just change a quorum node. Change all of the quorum nodes that still on old IPs to non-quorum. Then change IPs on them. Thanks, Tru. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 10/11/2017 04:53 AM Subject: gpfsug-discuss Digest, Vol 69, Issue 26 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Changing ip on spectrum scale cluster with every node down and not connected to network. (Andi Rhod Christiansen) 2. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Jonathan Buzzard) 3. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Andi Rhod Christiansen) 4. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Simon Thompson (IT Research Support)) 5. Checking a file-system for errors (Simon Thompson (IT Research Support)) 6. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Jonathan Buzzard) ---------------------------------------------------------------------- Message: 1 Date: Wed, 11 Oct 2017 07:46:03 +0000 From: Andi Rhod Christiansen To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <3e6e1727224143ac9b8488d16f40fcb3 at B4RWEX01.internal.b4restore.com> Content-Type: text/plain; charset="us-ascii" Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: < https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20171011_820adb01_attachment-2D0001.html&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=NrezaW_ayd5u-bE6ppJ6p3FBluuDTtv6KHqb4TwaGsY&e= > ------------------------------ Message: 2 Date: Wed, 11 Oct 2017 09:01:47 +0100 From: Jonathan Buzzard To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <8b9180bf-0bef-4e42-020b-28a9610012a1 at strath.ac.uk> Content-Type: text/plain; charset=windows-1252; format=flowed On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ------------------------------ Message: 3 Date: Wed, 11 Oct 2017 08:18:01 +0000 From: Andi Rhod Christiansen To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <9fcbdf3fa2df4df5bd25f4e93d2a3e79 at B4RWEX01.internal.b4restore.com> Content-Type: text/plain; charset="us-ascii" Hi Jonathan, Yes I thought about that but the system is located at a customer site and they are not willing to do that, unfortunately. That's why I was hoping there was a way around it Andi R. Christiansen -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:02 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= ------------------------------ Message: 4 Date: Wed, 11 Oct 2017 08:32:37 +0000 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: Content-Type: text/plain; charset="us-ascii" I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org< mailto:gpfsug-discuss at spectrumscale.org>" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org< mailto:gpfsug-discuss at spectrumscale.org>" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: < https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20171011_cd962e6b_attachment-2D0001.html&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=Iy6NQR-GJD1Hkc0A0C96Jkesrs6h-6HpOnnw3MOQmi4&e= > ------------------------------ Message: 5 Date: Wed, 11 Oct 2017 08:46:46 +0000 From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Checking a file-system for errors Message-ID: Content-Type: text/plain; charset="us-ascii" I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon ------------------------------ Message: 6 Date: Wed, 11 Oct 2017 09:53:34 +0100 From: Jonathan Buzzard To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <1507712014.9906.5.camel at strath.ac.uk> Content-Type: text/plain; charset="UTF-8" On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= End of gpfsug-discuss Digest, Vol 69, Issue 26 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Thu Oct 12 18:41:49 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 12 Oct 2017 17:41:49 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> SC17 is only 1 month away! Here are the GPFS/Scale User group meeting details. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Look for signs when you arrive Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeep.patil at in.ibm.com Fri Oct 13 09:20:56 2017 From: sandeep.patil at in.ibm.com (Sandeep Ramesh) Date: Fri, 13 Oct 2017 13:50:56 +0530 Subject: [gpfsug-discuss] New Redpapers on Spectrum Scale/ESS GUI Published Message-ID: Dear Spectrum Scale User Group Members, New Redpapers on Spectrum Scale GUI and ESS GUI has been published yesterday. To help keep the community informed. Monitoring and Managing IBM Spectrum Scale Using the GUI http://www.redbooks.ibm.com/redpieces/abstracts/redp5458.html?Open Monitoring and Managing the IBM Elastic Storage Server Using the GUI http://www.redbooks.ibm.com/redpieces/abstracts/redp5471.html?Open thx Spectrum Scale Dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Oct 13 10:47:39 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 13 Oct 2017 09:47:39 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> References: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> Message-ID: I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: 12 October 2017 18:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details SC17 is only 1 month away! Here are the GPFS/Scale User group meeting details. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Look for signs when you arrive Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 13 13:12:59 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 12:12:59 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: >I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? I believe (hope?) that's just a placeholder, and we'll actually use the time for something more engaging... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 09:47:39 +0000 From: "Sobey, Richard A" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="utf-8" I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? From r.sobey at imperial.ac.uk Fri Oct 13 13:45:43 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 13 Oct 2017 12:45:43 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: Actually, I was being 100% serious :) Although it's a boring topic, it's nonetheless fairly crucial and I'd like to see more about it. I won't be at SC17 unless you're livestreaming it anyway. Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie Sent: 13 October 2017 13:13 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details >I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? I believe (hope?) that's just a placeholder, and we'll actually use the time for something more engaging... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 09:47:39 +0000 From: "Sobey, Richard A" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="utf-8" I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From john.hearns at asml.com Fri Oct 13 13:56:18 2017 From: john.hearns at asml.com (John Hearns) Date: Fri, 13 Oct 2017 12:56:18 +0000 Subject: [gpfsug-discuss] How to simulate an NSD failure? Message-ID: I have set up a small testbed, consisting of three nodes. Two of the nodes have a disk which is being used as an NSD. This is being done for some preparation for fun and games with some whizzy new servers. The testbed has spinning drives. I have created two NSDs and have set the data replication to 1 (this is deliberate). I am trying to fail an NSD and find which files have parts on the failed NSD. A first test with 'mmdeldisk' didn't have much effect as SpectrumScale is smart enough to copy the data off the drive. I now take the drive offline and delete it by echo offline > /sys/block/sda/device/state echo 1 > /sys/block/sda/delete Short of going to the data centre and physically pulling the drive that's a pretty final way of stopping access to a drive. I then wrote 100 files to the filesystem, the node with the NSD did log "rejecting I/O to offline device" However mmlsdisk says that this disk is status 'ready' I am going to stop that NSD and run an mmdeldisk - at which point I do expect things to go south rapidly. I just am not understanding at what point a failed write would be detected? Or once a write fails are all the subsequent writes Routed off to the active NSD(s) ?? Sorry if I am asking an idiot question. Inspector.clouseau at surete.fr -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Oct 13 14:38:26 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Fri, 13 Oct 2017 13:38:26 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: The slides from the Manchester meeting are at: http://files.gpfsug.org/presentations/2017/Manchester/09_licensing-update.p df We moved all of our socket licenses to per TB DME earlier this year, and then also have DME per drive for our Lenovo DSS-G system, which for various reasons is in a different cluster There are certainly people in IBM UK who understand this process if that was something you wanted to look at. Simon On 13/10/2017, 13:45, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Sobey, Richard A" wrote: >Actually, I was being 100% serious :) Although it's a boring topic, it's >nonetheless fairly crucial and I'd like to see more about it. I won't be >at SC17 unless you're livestreaming it anyway. > >Richard > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie >Sent: 13 October 2017 13:13 >To: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration >and program details > >>I *need* to see the presentation from the licensing session ? everyone?s >>favourite topic ? > > >I believe (hope?) that's just a placeholder, and we'll actually use the >time for something more engaging... > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > >Message: 3 >Date: Fri, 13 Oct 2017 09:47:39 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > utlook.com> > >Content-Type: text/plain; charset="utf-8" > >I *need* to see the presentation from the licensing session ? everyone?s >favourite topic ? > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From heiner.billich at psi.ch Fri Oct 13 15:15:53 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Fri, 13 Oct 2017 14:15:53 +0000 Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home Message-ID: <94041E4C-3978-4D39-86EA-79629FC17AB8@psi.ch> Hello, Running an AFM IW cache we noticed that AFM starts the flushing of data from cache to home rather slow, say at 20MB/s, and only slowly increases to several 100MB/s after a few minutes. As soon as the pending queue gets no longer filled the data rate drops, again. I assume that this is a good behavior for WAN traffic where you don?t want to use the full bandwidth from the beginning but only if really needed. For our local setup with dedicated links I would prefer a much more aggressive behavior to get data transferred asap to home. Am I right, does AFM implement such a ?slow startup?, and is there a way to change this behavior? We did increase afmNumFlushThreads to 128. Currently we measure with many small files (1MB). For large files the behavior is different, we get a stable data rate from the beginning, but I did not yet try with a continuous write on the cache to see whether I see an increase after a while, too. Thank you, Heiner Billich -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From carlz at us.ibm.com Fri Oct 13 15:46:47 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 14:46:47 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 -Registrationandprogram details In-Reply-To: References: Message-ID: Hi Richard, I'm always happy to have a separate conversation if you have any questions about licensing. Ping me on my email address below. Same goes for anybody else who won't be at SC17. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com >------------------------------ > >Message: 2 >Date: Fri, 13 Oct 2017 12:45:43 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > rod.outlook.com> > >Content-Type: text/plain; charset="us-ascii" > >Actually, I was being 100% serious :) Although it's a boring topic, >it's nonetheless fairly crucial and I'd like to see more about it. I >won't be at SC17 unless you're livestreaming it anyway. > >Richard > >won't be >>at SC17 unless you're livestreaming it anyway. >> >>Richard >> From sfadden at us.ibm.com Fri Oct 13 16:56:56 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Fri, 13 Oct 2017 15:56:56 +0000 Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Fri Oct 13 17:32:35 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Fri, 13 Oct 2017 16:32:35 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From alex at calicolabs.com Fri Oct 13 17:53:40 2017 From: alex at calicolabs.com (Alex Chekholko) Date: Fri, 13 Oct 2017 09:53:40 -0700 Subject: [gpfsug-discuss] How to simulate an NSD failure? In-Reply-To: References: Message-ID: John, I think a "philosophical" difference between GPFS code and newer filesystems which were written later, in the age of "commodity hardware", is that GPFS expects the underlying hardware to be very reliable. So "disks" are typically RAID arrays available via multiple paths. And network links should have no errors, and be highly reliable, etc. GPFS does not detect these things well as it does not expect them to fail. That's why you see some discussions around "improving network diagnostics" and "improving troubleshooting tools" and things like that. Having a failed NSD is highly unusual for a GPFS system and you should design your system so that situation does not happen. In your example here, if data is striped across two NSDs and one of them becomes inaccessible, when a client tries to write, it should get an I/O error, and perhaps even unmount the filesystem (depending on where you metadata lives). Regards, Alex On Fri, Oct 13, 2017 at 5:56 AM, John Hearns wrote: > I have set up a small testbed, consisting of three nodes. Two of the nodes > have a disk which is being used as an NSD. > > This is being done for some preparation for fun and games with some whizzy > new servers. The testbed has spinning drives. > > I have created two NSDs and have set the data replication to 1 (this is > deliberate). > > I am trying to fail an NSD and find which files have parts on the failed > NSD. > > A first test with ?mmdeldisk? didn?t have much effect as SpectrumScale is > smart enough to copy the data off the drive. > > > > I now take the drive offline and delete it by > > echo offline > /sys/block/sda/device/state > > echo 1 > /sys/block/sda/delete > > > > Short of going to the data centre and physically pulling the drive that?s > a pretty final way of stopping access to a drive. > > I then wrote 100 files to the filesystem, the node with the NSD did log > ?rejecting I/O to offline device? > > However mmlsdisk says that this disk is status ?ready? > > > > I am going to stop that NSD and run an mmdeldisk ? at which point I do > expect things to go south rapidly. > > I just am not understanding at what point a failed write would be > detected? Or once a write fails are all the subsequent writes > > Routed off to the active NSD(s) ?? > > > > Sorry if I am asking an idiot question. > > > > Inspector.clouseau at surete.fr > > > > > > > > > > > > > > > > > > > > > > > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhabib73 at gmail.com Fri Oct 13 18:48:57 2017 From: mhabib73 at gmail.com (Muhammad Habib) Date: Fri, 13 Oct 2017 13:48:57 -0400 Subject: [gpfsug-discuss] How to simulate an NSD failure? In-Reply-To: References: Message-ID: If your devices/disks are multipath , make sure you remove all paths in order for disk to go offline. Also following line does not see correct: echo 1 > /sys/block/sda/delete , it should rather be echo 1 > /sys/block/sda/device/delete Further after you removed the disks , did you run the fdisk -l , to make sure its completely gone , also if the /var/log/messages confirms the disk is offline. Once all this confirmed then GPFS should take disks down and logs should tell you as well. Thanks M.Habib On Fri, Oct 13, 2017 at 8:56 AM, John Hearns wrote: > I have set up a small testbed, consisting of three nodes. Two of the nodes > have a disk which is being used as an NSD. > > This is being done for some preparation for fun and games with some whizzy > new servers. The testbed has spinning drives. > > I have created two NSDs and have set the data replication to 1 (this is > deliberate). > > I am trying to fail an NSD and find which files have parts on the failed > NSD. > > A first test with ?mmdeldisk? didn?t have much effect as SpectrumScale is > smart enough to copy the data off the drive. > > > > I now take the drive offline and delete it by > > echo offline > /sys/block/sda/device/state > > echo 1 > /sys/block/sda/delete > > > > Short of going to the data centre and physically pulling the drive that?s > a pretty final way of stopping access to a drive. > > I then wrote 100 files to the filesystem, the node with the NSD did log > ?rejecting I/O to offline device? > > However mmlsdisk says that this disk is status ?ready? > > > > I am going to stop that NSD and run an mmdeldisk ? at which point I do > expect things to go south rapidly. > > I just am not understanding at what point a failed write would be > detected? Or once a write fails are all the subsequent writes > > Routed off to the active NSD(s) ?? > > > > Sorry if I am asking an idiot question. > > > > Inspector.clouseau at surete.fr > > > > > > > > > > > > > > > > > > > > > > > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- This communication contains confidential information intended only for the persons to whom it is addressed. Any other distribution, copying or disclosure is strictly prohibited. If you have received this communication in error, please notify the sender and delete this e-mail message immediately. Le pr?sent message contient des renseignements de nature confidentielle r?serv?s uniquement ? l'usage du destinataire. Toute diffusion, distribution, divulgation, utilisation ou reproduction de la pr?sente communication, et de tout fichier qui y est joint, est strictement interdite. Si vous avez re?u le pr?sent message ?lectronique par erreur, veuillez informer imm?diatement l'exp?diteur et supprimer le message de votre ordinateur et de votre serveur. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gcorneau at us.ibm.com Fri Oct 13 19:50:05 2017 From: gcorneau at us.ibm.com (Glen Corneau) Date: Fri, 13 Oct 2017 13:50:05 -0500 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: , Message-ID: The original announcement letter for Spectrum Scale Data Management Edition (US version reference below) re-defines a terabyte (nice eh?, should be a tebibyte). My math agrees with yours, assuming your file system size is actually 1PB versus 1PiB Terabyte Terabyte is a unit of measure by which the Program can be licensed. A terabyte is 2 to the 40th power bytes. Licensee must obtain an entitlement for each terabyte available to the Program. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpateam&supplier=897&letternum=ENUS216-158 ------------------ Glen Corneau Power Systems Washington Systems Center gcorneau at us.ibm.com From: "Daniel Kidger" To: gpfsug-discuss at spectrumscale.org Date: 10/13/2017 11:32 AM Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Sent by: gpfsug-discuss-bounces at spectrumscale.org All, For me the URL looks to have got mangled. Should be: http://files.gpfsug.org/presentations/2017/Manchester/09_licensing-update.pdf with the index page that points to it here: http://www.spectrumscale.org/presentations/ Also a personal note, I have found increasing confusion of decimal v. binary units for storage capacity. I understand that Spectrum Scale uses binary TiB, but say an 1TB drive is in decimal so c. 10% difference. So a 1 Petabyte filesystem needs only 909 TiB of Spectrum Scale licenses. Any comments from others? Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: "Simon Thompson (IT Research Support)" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Date: Fri, Oct 13, 2017 2:38 PM The slides from the Manchester meeting are at: https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2017_Manchester_09-5Flicensing-2Dupdate.p&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=f6qsuSorl92LShV92TTaXNyG3KU0VvuFN4YhT_LTTFc&e= df We moved all of our socket licenses to per TB DME earlier this year, and then also have DME per drive for our Lenovo DSS-G system, which for various reasons is in a different cluster There are certainly people in IBM UK who understand this process if that was something you wanted to look at. Simon On 13/10/2017, 13:45, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Sobey, Richard A" wrote: >Actually, I was being 100% serious :) Although it's a boring topic, it's >nonetheless fairly crucial and I'd like to see more about it. I won't be >at SC17 unless you're livestreaming it anyway. > >Richard > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie >Sent: 13 October 2017 13:13 >To: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration >and program details > >>I *need* to see the presentation from the licensing session ? everyone?s >>favourite topic ? > > >I believe (hope?) that's just a placeholder, and we'll actually use the >time for something more engaging... > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > >Message: 3 >Date: Fri, 13 Oct 2017 09:47:39 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > utlook.com> > >Content-Type: text/plain; charset="utf-8" > >I *need* to see the presentation from the licensing session ? everyone?s >favourite topic ? > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=d-vphLEe_UlGazP6RdYAyyAA3Qv5S9IRVNuO1i9vjJc&m=rOPfwzvHMD3_MRZy2WHgOGtmYQya-jWx5d_s92EeJRk&s=LkQ4lwnC-ATFnHjydppCXDasUDijS9DUh0p-cFaM0NM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 13 20:10:56 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 19:10:56 +0000 Subject: [gpfsug-discuss] Scale per TB (was: User group Meeting at SC17 - Registration and program details) In-Reply-To: References: Message-ID: Yeah, I know... It's actually an IBM thing, not just a Scale thing. Some time in the distant past, IBM decided that too few people were familiar with the term "tebibyte" or its official abbreviation "TiB", so in the IBM licensing catalog there is the "Terabyte" (really a tebibyte) and the "Decimal Terabyte" (an actual terabyte). When we made the capacity license we had to decide which one to use, and we decided to err on the side of giving people the larger amount. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 13:50:05 -0500 From: "Glen Corneau" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="us-ascii" The original announcement letter for Spectrum Scale Data Management Edition (US version reference below) re-defines a terabyte (nice eh?, should be a tebibyte). My math agrees with yours, assuming your file system size is actually 1PB versus 1PiB Terabyte Terabyte is a unit of measure by which the Program can be licensed. A terabyte is 2 to the 40th power bytes. Licensee must obtain an entitlement for each terabyte available to the Program. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpateam&supplier=897&letternum=ENUS216-158 ------------------ Glen Corneau Power Systems Washington Systems Center gcorneau at us.ibm.com From: "Daniel Kidger" To: gpfsug-discuss at spectrumscale.org Date: 10/13/2017 11:32 AM Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Sent by: gpfsug-discuss-bounces at spectrumscale.org All, For me the URL looks to have got mangled. Should be: https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2017_Manchester_09-5Flicensing-2Dupdate.pdf&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=obB2s7QQTgU9QMn1708Vpg&m=6slj4_SM9ZLHQtguBkVK7Xg2UN1RlDFCjKJoGhWn2wU&s=NU2Hs398IPSytPh8bYplXjFChhaF9G21Pt4YoHvbrPY&e= with the index page that points to it here: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spectrumscale.org_presentations_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=obB2s7QQTgU9QMn1708Vpg&m=6slj4_SM9ZLHQtguBkVK7Xg2UN1RlDFCjKJoGhWn2wU&s=CLN7JkpjQsfPdvOapYPGX3o7gHZj8AOh7tYSusTZJPE&e= Also a personal note, I have found increasing confusion of decimal v. binary units for storage capacity. I understand that Spectrum Scale uses binary TiB, but say an 1TB drive is in decimal so c. 10% difference. So a 1 Petabyte filesystem needs only 909 TiB of Spectrum Scale licenses. Any comments from others? Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com From a.khiredine at meteo.dz Sun Oct 15 13:44:42 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Sun, 15 Oct 2017 12:44:42 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Dear All, Is there a way to save the GPS configuration? OR how backup all GSS no backup of data or metadata only configuration for disaster recovery for example: stanza vdisk pdisk RAID code recovery group array Thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From skylar2 at u.washington.edu Mon Oct 16 14:29:33 2017 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Mon, 16 Oct 2017 13:29:33 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server In-Reply-To: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> References: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Message-ID: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu> I'm not familiar with GSS, but we have a script that executes the following before backing up a GPFS filesystem so that we have human-readable configuration information: mmlsconfig mmlsnsd mmlscluster mmlsnode mmlsdisk ${FS_NAME} -L mmlsfileset ${FS_NAME} -L mmlspool ${FS_NAME} all -L mmlslicense -L mmlspolicy ${FS_NAME} -L And then executes this for the benefit of GPFS: mmbackupconfig Of course there's quite a bit of overlap for clusters that have more than one filesystem, and even more for filesystems that we backup at the fileset level, but disk is cheap and the hope is it'll make a DR scenario a little bit less harrowing. On Sun, Oct 15, 2017 at 12:44:42PM +0000, atmane khiredine wrote: > Dear All, > > Is there a way to save the GPS configuration? > > OR how backup all GSS > > no backup of data or metadata only configuration for disaster recovery > > for example: > stanza > vdisk > pdisk > RAID code > recovery group > array > > Thank you > > Atmane Khiredine > HPC System Administrator | Office National de la M?t?orologie > T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From heiner.billich at psi.ch Mon Oct 16 14:36:09 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Mon, 16 Oct 2017 13:36:09 +0000 Subject: [gpfsug-discuss] slow startup of AFM flush to home Message-ID: Hello Scott, Thank you. I did set afmFlushThreadDelay = 1 and did get a much faster startup. Setting to 0 didn?t improve further. I?m not sure how much we?ll need this in production when most of the time the queue is full. But for benchmarking during setup it?s helps a lot. (we run 4.2.3-4 on RHEL7) Kind regards, Heiner Scott Fadden did write: When an AFM gateway is flushing data to the target (home) it starts flushing with a few threads (Don't remember the number) and ramps up to afmNumFlushThreads. How quickly this ramp up occurs is controlled by afmFlushThreadDealy. The default is 5 seconds. So flushing only adds threads once every 5 seconds. This was an experimental parameter so your milage may vary. Scott Fadden Spectrum Scale - Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/storage/spectrum/scale ----- Original message ----- From: "Billich Heinrich Rainer (PSI)" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home Date: Fri, Oct 13, 2017 10:16 AM Hello, Running an AFM IW cache we noticed that AFM starts the flushing of data from cache to home rather slow, say at 20MB/s, and only slowly increases to several 100MB/s after a few minutes. As soon as the pending queue gets no longer filled the data rate drops, again. I assume that this is a good behavior for WAN traffic where you don???t want to use the full bandwidth from the beginning but only if really needed. For our local setup with dedicated links I would prefer a much more aggressive behavior to get data transferred asap to home. Am I right, does AFM implement such a ???slow startup???, and is there a way to change this behavior? We did increase afmNumFlushThreads to 128. Currently we measure with many small files (1MB). For large files the behavior is different, we get a stable data rate from the beginning, but I did not yet try with a continuous write on the cache to see whether I see an increase after a while, too. Thank you, Heiner Billich -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From sfadden at us.ibm.com Mon Oct 16 16:34:33 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Mon, 16 Oct 2017 15:34:33 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server In-Reply-To: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu> References: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu>, <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Message-ID: An HTML attachment was scrubbed... URL: From er.a.ross at gmail.com Fri Oct 20 03:15:38 2017 From: er.a.ross at gmail.com (Eric Ross) Date: Thu, 19 Oct 2017 21:15:38 -0500 Subject: [gpfsug-discuss] file auditing capabilities Message-ID: I'm researching the file auditing capabilities possible with GPFS; I found this paper on the GPFS wiki: https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/f0cc9b82-a133-41b4-83fe-3f560e95b35a/attachment/0ab62645-e0ab-4377-81e7-abd11879bb75/media/Spectrum_Scale_Varonis_Audit_Logging.pdf I haven't found anything else on the subject, however. While I like the idea of being able to do this logging on the protocol node level, I'm also interested in the possibility of auditing files from native GPFS mounts. Additional digging uncovered references to Lightweight Events (LWE): http://files.gpfsug.org/presentations/2016/SC16/04_Scott_Fadden_Spectrum_Scale_Update.pdf Specifically, this references being able to use the policy engine to detect things like file opens, reads, and writes. Searching through the official GPFS documentation, I see references to these events in the transparent cloud tiering section: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_define_cloud_storage_tier.htm but, I don't see, or possibly have missed, the other section(s) defining what other EVENT parameters I can use. I'm curious to know more about these events, could anyone point me in the right direction? I'm wondering if I could use them to perform rudimentary auditing of the file system (e.g. a default policy in place to log a message of say user foo either wrote to and/or read from file bar). Thanks, -Eric From richardb+gpfsUG at ellexus.com Fri Oct 20 15:47:57 2017 From: richardb+gpfsUG at ellexus.com (Richard Booth) Date: Fri, 20 Oct 2017 15:47:57 +0100 Subject: [gpfsug-discuss] file auditing capabilities Message-ID: Hi Eric The company I work for could possibly help with this, Ellexus . Please feel free to get in touch if you need some help with this. Cheers Richard ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 19 Oct 2017 21:15:38 -0500 >> From: Eric Ross >> To: gpfsug-discuss at spectrumscale.org >> Subject: [gpfsug-discuss] file auditing capabilities >> Message-ID: >> > ail.com> >> Content-Type: text/plain; charset="UTF-8" >> >> I'm researching the file auditing capabilities possible with GPFS; I >> found this paper on the GPFS wiki: >> >> https://www.ibm.com/developerworks/community/wikis/form/anon >> ymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ >> f0cc9b82-a133-41b4-83fe-3f560e95b35a/attachment/0ab62645- >> e0ab-4377-81e7-abd11879bb75/media/Spectrum_Scale_Varonis_ >> Audit_Logging.pdf >> >> I haven't found anything else on the subject, however. >> >> While I like the idea of being able to do this logging on the protocol >> node level, I'm also interested in the possibility of auditing files >> from native GPFS mounts. >> >> Additional digging uncovered references to Lightweight Events (LWE): >> >> http://files.gpfsug.org/presentations/2016/SC16/04_Scott_Fad >> den_Spectrum_Scale_Update.pdf >> >> Specifically, this references being able to use the policy engine to >> detect things like file opens, reads, and writes. >> >> Searching through the official GPFS documentation, I see references to >> these events in the transparent cloud tiering section: >> >> https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/ >> com.ibm.spectrum.scale.v4r22.doc/bl1adm_define_cloud_storage_tier.htm >> >> but, I don't see, or possibly have missed, the other section(s) >> defining what other EVENT parameters I can use. >> >> I'm curious to know more about these events, could anyone point me in >> the right direction? >> >> I'm wondering if I could use them to perform rudimentary auditing of >> the file system (e.g. a default policy in place to log a message of >> say user foo either wrote to and/or read from file bar). >> >> Thanks, >> -Eric >> >> >> ------------------------------ >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> End of gpfsug-discuss Digest, Vol 69, Issue 38 >> ********************************************** >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 20 20:54:38 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 20 Oct 2017 19:54:38 +0000 Subject: [gpfsug-discuss] file auditing capabilities (Eric Ross) Message-ID: Disclaimer: all statements about future functionality are subject to change, and represent intentions only. That being said: Yes, we are working on File Audit Logging native to Spectrum Scale. The intention is to provide auditing capabilities in a protocol agnostic manner that will capture not only audit events that come through protocols but also GPFS/Scale native file system access events. The audit logs are written to a specified GPFS/Scale fileset in a format that is both human=-readable and easily parsable for automated consumption, reporting, or whatever else you might want to do with it. Currently, we intend to release this capability with Scale 5.0. The underlying technology for this is indeed LWE, which as some of you know is also underneath some other Scale features. The use of LWE allows us to do auditing very efficiently to minimize performance impact while also allowing scalability. We do not at this time have plans to expose LWE directly for end-user consumption -- it needs to be "packaged" in a more consumable way in order to be generally supportable. However, we do have intentions to expose other functionality on top of the LWE capability in the future. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From Stephan.Peinkofer at lrz.de Mon Oct 23 11:41:23 2017 From: Stephan.Peinkofer at lrz.de (Peinkofer, Stephan) Date: Mon, 23 Oct 2017 10:41:23 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management Message-ID: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Oct 23 12:00:50 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 23 Oct 2017 11:00:50 +0000 Subject: [gpfsug-discuss] el7.4 compatibility In-Reply-To: References: Message-ID: Just picking up this old thread, but... October updates: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html# linux 7.4 is now listed as supported with min scale version of 4.1.1.17 or 4.2.3.4 (incidentally 4.2.3.5 looks to have been released today). Simon On 27/09/2017, 09:16, "gpfsug-discuss-bounces at spectrumscale.org on behalf of kenneth.waegeman at ugent.be" wrote: >Hi, > >Is there already some information available of gpfs (and protocols) on >el7.4 ? > >Thanks! > >Kenneth > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Mon Oct 23 12:09:17 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 23 Oct 2017 13:09:17 +0200 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> References: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> Message-ID: You can lower LEASE_LIFETIME and GRACE_PERIOD to shorten the time it's in grace, to make it more bearable. Making export changes dynamic is something that's fixed in newer versions of nfs-ganesha than what's shipped with Scale: https://github.com/nfs-ganesha/nfs-ganesha/releases/tag/V2.4.0: "dynamic EXPORT configuration update (via dBus and SIGHUP)" Hopefully someone can comment on when we'll see nfs-ganesha v2.4+ included with Scale. -jf On Mon, Oct 23, 2017 at 12:41 PM, Peinkofer, Stephan < Stephan.Peinkofer at lrz.de> wrote: > Dear List, > > I?m currently working on a self service portal for managing NFS exports of > ISS. Basically something very similar to OpenStack Manila but tailored to > our specific needs. > While it was very easy to do this using the great REST API of ISS, I > stumbled across a fact that may be even a show stopper: According to the > documentation for mmnfs, each time we > create/change/delete a NFS export via mmnfs, ganesha service is restarted > on all nodes. > > I assume that this behaviour may cause problems (at least IO stalls) on > clients mounted the filesystem. So my question is, what is your experience > with CES NFS export management. > Do you see any problems when you add/change/delete exports and ganesha > gets restarted? > > Are there any (supported) workarounds for this problem? > > PS: As I think in 2017 CES Exports should be manageable without service > disruptions (and ganesha provides facilities to do so), I filed an RFE for > this: https://www.ibm.com/developerworks/rfe/execute? > use_case=viewRfe&CR_ID=111918 > > Many thanks in advance. > Best Regards, > Stephan Peinkofer > -- > Stephan Peinkofer > Dipl. Inf. (FH), M. Sc. (TUM) > > Leibniz Supercomputing Centre > Data and Storage Division > Boltzmannstra?e 1, 85748 Garching b. M?nchen > Tel: +49(0)89 35831-8715 <+49%2089%20358318715> Fax: +49(0)89 > 35831-9700 <+49%2089%20358319700> > URL: http://www.lrz.de > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Mon Oct 23 12:56:07 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Mon, 23 Oct 2017 17:26:07 +0530 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: Hi Stephan, I observed ganesha service getting restarted only after adding first nfs export. For rest of the operations (e.g. adding more nfs exports, changing nfs exports, removing nfs exports); ganesha service doesn't restart. My observations are based on following simple tests. I ran them against rhel7.3 test cluster having nfs-ganesha-2.5.2. tests: 1. created 1st nfs export - ganesha service was restarted 2. created 4 more nfs exports (mmnfs export add path) 3. changed 2 nfs exports (mmnfs export change path --nfschange); 4. removed all 5 exports one by one (mmnfs export remove path) 5. no nfs exports after step 4 on my test system. So, created a new nfs export (which will be the 1st nfs export). 6. change nfs export created in step 5 results observed: ganesha service restarted for test 1 and test 5. For rest tests (2,3,4,6); ganesha service didn't restart. Thanks, Chetan. From: "Peinkofer, Stephan" To: "gpfsug-discuss at spectrumscale.org" Date: 10/23/2017 04:11 PM Subject: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=uic-29lyJ5TCiTRi0FyznYhKJx5I7Vzu80WyYuZ4_iM&m=ghcZYswqgF3beYOogGGLsT1RyDRZrbLXdzp3Fbjmfrg&s=TUm7BM3sY75Nc20gOfhz9lvDgYJse0TM6-tIW8I1QiI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Mon Oct 23 13:16:17 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 23 Oct 2017 12:16:17 +0000 Subject: [gpfsug-discuss] Reminder: User group Meeting at SC17 - Registration and program details Message-ID: Reminder: Register for the SC17 User Group meeting if you are heading to SC17. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Centennial E Ballroom followed by reception in Centennial D Ballroom at 5:30pm Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Stephan.Peinkofer at lrz.de Mon Oct 23 13:20:47 2017 From: Stephan.Peinkofer at lrz.de (Peinkofer, Stephan) Date: Mon, 23 Oct 2017 12:20:47 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: <5BBED5D7-5E06-453F-B839-BC199EC74720@lrz.de> Dear Chetan, interesting. I?m running ISS 4.2.3-4 and it seems to ship with nfs-ganesha-2.3.2. So are you already using a future ISS version? Here is what I see: [root at datdsst102 pr74cu-dss-0002]# mmnfs export list Path Delegations Clients ---------------------------------------------------------- /dss/dsstestfs01/pr74cu-dss-0002 NONE 10.156.29.73 /dss/dsstestfs01/pr74cu-dss-0002 NONE 10.156.29.72 [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfschange "10.156.29.72(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfschange "10.156.29.72(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfsadd "10.156.29.74(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 ~]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfsremove 10.156.29.74 datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de On 23. Oct 2017, at 13:56, Chetan R Kulkarni > wrote: Hi Stephan, I observed ganesha service getting restarted only after adding first nfs export. For rest of the operations (e.g. adding more nfs exports, changing nfs exports, removing nfs exports); ganesha service doesn't restart. My observations are based on following simple tests. I ran them against rhel7.3 test cluster having nfs-ganesha-2.5.2. tests: 1. created 1st nfs export - ganesha service was restarted 2. created 4 more nfs exports (mmnfs export add path) 3. changed 2 nfs exports (mmnfs export change path --nfschange); 4. removed all 5 exports one by one (mmnfs export remove path) 5. no nfs exports after step 4 on my test system. So, created a new nfs export (which will be the 1st nfs export). 6. change nfs export created in step 5 results observed: ganesha service restarted for test 1 and test 5. For rest tests (2,3,4,6); ganesha service didn't restart. Thanks, Chetan. "Peinkofer, Stephan" ---10/23/2017 04:11:33 PM---Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically From: "Peinkofer, Stephan" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/23/2017 04:11 PM Subject: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=uic-29lyJ5TCiTRi0FyznYhKJx5I7Vzu80WyYuZ4_iM&m=ghcZYswqgF3beYOogGGLsT1RyDRZrbLXdzp3Fbjmfrg&s=TUm7BM3sY75Nc20gOfhz9lvDgYJse0TM6-tIW8I1QiI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Oct 23 14:42:51 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 23 Oct 2017 13:42:51 +0000 Subject: [gpfsug-discuss] Rainy days and Mondays and GPFS lying to me always get me down... Message-ID: Hi All, And I?m not really down, but it is a rainy Monday morning here and GPFS did give me a scare in the last hour, so I thought that was a funny subject line. So I have a >1 PB filesystem with 3 pools: 1) the system pool, which contains metadata only, 2) the data pool, which is where all I/O goes to by default, and 3) the capacity pool, which is where old crap gets migrated to. I logged on this morning to see an alert that my data pool was 100% full. I ran an mmdf from the cluster manager and, sure enough: (pool total) 509.3T 0 ( 0%) 0 ( 0%) I immediately tried copying a file to there and it worked, so I figured GPFS must be failing writes over to the capacity pool, but an mmlsattr on the file I copied showed it being in the data pool. Hmmm. I also noticed that ?df -h? said that the filesystem had 399 TB free, while mmdf said it only had 238 TB free. Hmmm. So after some fruitless poking around I decided that whatever was going to happen, I should kill the mmrestripefs I had running on the capacity pool ? let me emphasize that ? I had a restripe running on the capacity pool only (via the ?-P? option to mmrestripefs) but it was the data pool that said it was 100% full. I?m sure many of you have already figured out where this is going ? after killing the restripe I ran mmdf again and: (pool total) 509.3T 159T ( 31%) 1.483T ( 0%) I have never seen anything like this before ? any ideas, anyone? PMR time? Thanks! Kevin From valdis.kletnieks at vt.edu Mon Oct 23 19:13:05 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 23 Oct 2017 14:13:05 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: <32917.1508782385@turing-police.cc.vt.edu> On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From bbanister at jumptrading.com Mon Oct 23 19:23:33 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 23 Oct 2017 18:23:33 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: <32917.1508782385@turing-police.cc.vt.edu> References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: This becomes very disruptive when you have to add or remove many NFS exports. Is it possible to add and remove multiple entries at a time or is this YARFE time? -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu Sent: Monday, October 23, 2017 1:13 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Experience with CES NFS export management On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From stefan.dietrich at desy.de Mon Oct 23 19:34:02 2017 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Mon, 23 Oct 2017 20:34:02 +0200 (CEST) Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <2146307210.3678055.1508783642716.JavaMail.zimbra@desy.de> Hello Bryan, at least changing multiple entries at once is possible. You can copy /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf to e.g. /tmp, modify the export (remove/add nodes or options) and load the changed config via "mmnfs export load " That way, only a single restart is issued for Ganesha on the CES nodes. Adding/removing I did not try so far, to be honest for use-cases this is rather static. Regards, Stefan ----- Original Message ----- > From: "Bryan Banister" > To: "gpfsug main discussion list" > Sent: Monday, October 23, 2017 8:23:33 PM > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > This becomes very disruptive when you have to add or remove many NFS exports. > Is it possible to add and remove multiple entries at a time or is this YARFE > time? > -Bryan > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of > valdis.kletnieks at vt.edu > Sent: Monday, October 23, 2017 1:13 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > > On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > >> tests: >> 1. created 1st nfs export - ganesha service was restarted >> 2. created 4 more nfs exports (mmnfs export add path) >> 3. changed 2 nfs exports (mmnfs export change path --nfschange); >> 4. removed all 5 exports one by one (mmnfs export remove path) >> 5. no nfs exports after step 4 on my test system. So, created a new nfs >> export (which will be the 1st nfs export). >> 6. change nfs export created in step 5 > > mmnfs export change --nfsadd seems to generate a restart as well. > Particularly annoying when the currently running nfs.ganesha fails to > stop rpc.statd on the way down, and then bringing it back up fails because > the port is in use.... > > ________________________________ > > Note: This email is for the confidential use of the named addressee(s) only and > may contain proprietary, confidential or privileged information. If you are not > the intended recipient, you are hereby notified that any review, dissemination > or copying of this email is strictly prohibited, and to please notify the > sender immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The Company, > therefore, does not make any guarantees as to the completeness or accuracy of > this email or any attachments. This email is for informational purposes only > and does not constitute a recommendation, offer, request or solicitation of any > kind to buy, sell, subscribe, redeem or perform any type of transaction of a > financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From valdis.kletnieks at vt.edu Mon Oct 23 19:54:35 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 23 Oct 2017 14:54:35 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <53227.1508784875@turing-police.cc.vt.edu> On Mon, 23 Oct 2017 18:23:33 -0000, Bryan Banister said: > This becomes very disruptive when you have to add or remove many NFS exports. > Is it possible to add and remove multiple entries at a time or is this YARFE time? On the one hand, 'mmnfs export change [path] --nfsadd 'client1(options);client2(options);...)' is supported. On the other hand, after the initial install's rush of new NFS exports, the chances of having more than one client to change at a time are rather low. On the gripping hand, if a client later turns up an entire cluster that needs access, you can also say --nfsadd '172.28.40.0/23(options)' and get the whole cluster in one shot. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From oehmes at gmail.com Tue Oct 24 01:28:33 2017 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 24 Oct 2017 00:28:33 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: we can not commit on timelines on mailing lists, but this is a known issue and will be addressed in a future release. sven On Mon, Oct 23, 2017, 11:23 AM Bryan Banister wrote: > This becomes very disruptive when you have to add or remove many NFS > exports. Is it possible to add and remove multiple entries at a time or is > this YARFE time? > -Bryan > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto: > gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of > valdis.kletnieks at vt.edu > Sent: Monday, October 23, 2017 1:13 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > > On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > > > tests: > > 1. created 1st nfs export - ganesha service was restarted > > 2. created 4 more nfs exports (mmnfs export add path) > > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > > 4. removed all 5 exports one by one (mmnfs export remove path) > > 5. no nfs exports after step 4 on my test system. So, created a new nfs > > export (which will be the 1st nfs export). > > 6. change nfs export created in step 5 > > mmnfs export change --nfsadd seems to generate a restart as well. > Particularly annoying when the currently running nfs.ganesha fails to > stop rpc.statd on the way down, and then bringing it back up fails because > the port is in use.... > > ________________________________ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged information. > If you are not the intended recipient, you are hereby notified that any > review, dissemination or copying of this email is strictly prohibited, and > to please notify the sender immediately and destroy this email and any > attachments. Email transmission cannot be guaranteed to be secure or > error-free. The Company, therefore, does not make any guarantees as to the > completeness or accuracy of this email or any attachments. This email is > for informational purposes only and does not constitute a recommendation, > offer, request or solicitation of any kind to buy, sell, subscribe, redeem > or perform any type of transaction of a financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnaineni at in.ibm.com Tue Oct 24 08:57:29 2017 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Tue, 24 Oct 2017 13:27:29 +0530 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: As others have answered, 4.2.3 spectrum can add or remove exports without restarting nfs-ganesha service. Changing an existing export does need nfs-ganesha restart though. If you want to change multiple existing exports, you could use undocumented option "--nfsnorestart" to mmnfs. This should add export changes to NFS configuration but it won't restart nfs-ganesha service, so you will not see immediate results of your changes in the running server. Whenever you want your changes reflected, you could manually restart the service using "mmces" command. Regards, Malahal. From: Bryan Banister To: gpfsug main discussion list Date: 10/23/2017 11:53 PM Subject: Re: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org This becomes very disruptive when you have to add or remove many NFS exports. Is it possible to add and remove multiple entries at a time or is this YARFE time? -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu Sent: Monday, October 23, 2017 1:13 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Experience with CES NFS export management On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=dhIJJ5KI4U6ZUia7OPi_-AC3qBrYV9n93ww8Ffhl468&s=K4ii44lk1_auA_3g7SN-E1zmMZNtc1PqBSiQJVudc_w&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.khiredine at meteo.dz Tue Oct 24 10:20:25 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Tue, 24 Oct 2017 09:20:25 +0000 Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6340C7B0@SDEB-EXC02.meteo.dz> Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From valdis.kletnieks at vt.edu Tue Oct 24 15:36:46 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 24 Oct 2017 10:36:46 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <16412.1508855806@turing-police.cc.vt.edu> On Tue, 24 Oct 2017 13:27:29 +0530, "Malahal R Naineni" said: > If you want to change multiple existing exports, you could use > undocumented option "--nfsnorestart" to mmnfs. This should add export > changes to NFS configuration but it won't restart nfs-ganesha service, so > you will not see immediate results of your changes in the running server. > Whenever you want your changes reflected, you could manually restart the > service using "mmces" command. I owe you a beverage of your choice if we ever are in the same place at the same time - the fact that Ganesha got restarted on all nodes at once thus preventing a rolling restart and avoiding service interruption was the single biggest Ganesha wart we've encountered. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From UWEFALKE at de.ibm.com Tue Oct 24 17:49:19 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 24 Oct 2017 18:49:19 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Message-ID: Hi, I am about to run nsdperf for testing the IB fabric in a new system comprising ESS (BE) and Intel-based nodes. nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one test using RDMA: client server RDMA x86-64 ppc-64 on crash ppc-64 x86-64 on crash x86-64 ppc-64 off success x86-64 x86-64 on success ppc-64 ppc-64 on success That implies that the nsdperf RDMA test might struggle with BE vs LE. However, I learned from a talk given at a GPFS workshop in Germany in 2015 that RDMA works between Power-BE and Intel boxes. Has anyone made similar or contrary experiences? Is it an nsdperf issue or more general (I have not yet attempted any GPFS mount)? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From olaf.weiser at de.ibm.com Tue Oct 24 20:31:06 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 24 Oct 2017 21:31:06 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From sdenham at gmail.com Tue Oct 24 21:35:40 2017 From: sdenham at gmail.com (Scott D) Date: Tue, 24 Oct 2017 15:35:40 -0500 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: I have run nsdperf with RDMA enabled against an ESS ppc-64 server without any problems. I don't have access to that system at the moment, and it was running a fairly old (4.1.x) version of GPFS, not that that should matter for nsdperf unless that source code has changed since 4.1. Scott Denham Staff Engineer Cray, Inc On Tue, Oct 24, 2017 at 11:49 AM, Uwe Falke wrote: > Hi, > I am about to run nsdperf for testing the IB fabric in a new system > comprising ESS (BE) and Intel-based nodes. > nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one > test using RDMA: > > client server RDMA > x86-64 ppc-64 on crash > ppc-64 x86-64 on crash > x86-64 ppc-64 off success > x86-64 x86-64 on success > ppc-64 ppc-64 on success > > That implies that the nsdperf RDMA test might struggle with BE vs LE. > However, I learned from a talk given at a GPFS workshop in Germany in 2015 > that RDMA works between Power-BE and Intel boxes. Has anyone made similar > or contrary experiences? Is it an nsdperf issue or more general (I have > not yet attempted any GPFS mount)? > > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 25 09:52:29 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 25 Oct 2017 10:52:29 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: Hi, Scott, thanks, good to hear that it worked for you. I can at least confirm that GPFS RDMA itself does work between x86-64 clients the ESS here, it appears just nsdperf has an issue in my particular environment. I'll see what IBM support can do for me as Olaf suggested. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Scott D To: gpfsug main discussion list Date: 10/24/2017 10:35 PM Subject: Re: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org I have run nsdperf with RDMA enabled against an ESS ppc-64 server without any problems. I don't have access to that system at the moment, and it was running a fairly old (4.1.x) version of GPFS, not that that should matter for nsdperf unless that source code has changed since 4.1. Scott Denham Staff Engineer Cray, Inc On Tue, Oct 24, 2017 at 11:49 AM, Uwe Falke wrote: Hi, I am about to run nsdperf for testing the IB fabric in a new system comprising ESS (BE) and Intel-based nodes. nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one test using RDMA: client server RDMA x86-64 ppc-64 on crash ppc-64 x86-64 on crash x86-64 ppc-64 off success x86-64 x86-64 on success ppc-64 ppc-64 on success That implies that the nsdperf RDMA test might struggle with BE vs LE. However, I learned from a talk given at a GPFS workshop in Germany in 2015 that RDMA works between Power-BE and Intel boxes. Has anyone made similar or contrary experiences? Is it an nsdperf issue or more general (I have not yet attempted any GPFS mount)? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Tomasz.Wolski at ts.fujitsu.com Wed Oct 25 10:42:02 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 25 Oct 2017 09:42:02 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: <237580bb78cf4d9291c057926c90c265@R01UKEXCASM223.r01.fujitsu.local> Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" > To: gpfsug main discussion list > Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Oct 25 14:09:27 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 25 Oct 2017 09:09:27 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Oct 25 14:33:46 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 25 Oct 2017 13:33:46 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi Felipe On a related note, would everything in 4.2.3-4 efix2 be included in 4.2.3-5? In particular the previously discussed around defect 1020461? There was no APAR for this defect when I last looked on 30th August. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Felipe Knop Sent: 25 October 2017 14:09 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" > To: gpfsug main discussion list > Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Oct 25 16:23:42 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 25 Oct 2017 11:23:42 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Richard, I see that 4.2.3-4 efix2 has two defects, 1032655 (IV99796) and 1020461 (IV99675), and both these fixes are included in 4.2.3.5 . Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Sobey, Richard A" To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 09:34 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Felipe On a related note, would everything in 4.2.3-4 efix2 be included in 4.2.3-5? In particular the previously discussed around defect 1020461? There was no APAR for this defect when I last looked on 30th August. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Felipe Knop Sent: 25 October 2017 14:09 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org[ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=dhKhKiNBptpaDmggHSa8diP48O90VK2uzr-xo9C44uI&s=SCeTu6NeyjHm9D8S4VZVUnrALgCvNksAYTF9rfwD50g&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 25 17:17:09 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 25 Oct 2017 18:17:09 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Message-ID: Dear all, through some gpfsperf tests against an ESS block (config as is) I am seeing lots of waiters like NSDThread: on ThCond 0x3FFA800670A0 (FreePTrackCondvar), reason 'wait for free PTrack' That is not on file creation but on writing to an already existing file. what ressource is the system short of here? IMHO it cannot be physical data tracks on pdisks (the test does not allocate any space, just rewrites an existing file)? The only shortage in threads i could see might be Total server worker threads: running 3042, desired 3072, forNSD 2, forGNR 3070, nsdBigBufferSize 16777216 nsdMultiQueue: 512, nsdMultiQueueType: 1, nsdMinWorkerThreads: 3072, nsdMaxWorkerThreads: 3072 where a difference of 30 is between desired and running number of worker threads (but that is only 1% and 30 more would not necessarily make a big difference). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From vanfalen at mx1.ibm.com Wed Oct 25 22:26:50 2017 From: vanfalen at mx1.ibm.com (Emmanuel Barajas Gonzalez) Date: Wed, 25 Oct 2017 21:26:50 +0000 Subject: [gpfsug-discuss] Fileset quotas enforcement Message-ID: An HTML attachment was scrubbed... URL: From pinto at scinet.utoronto.ca Wed Oct 25 23:18:29 2017 From: pinto at scinet.utoronto.ca (Jaime Pinto) Date: Wed, 25 Oct 2017 18:18:29 -0400 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: <20171025181829.90173xxmr17nklo5@support.scinet.utoronto.ca> Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES http://www.scinethpc.ca/testimonials ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. From rohwedder at de.ibm.com Thu Oct 26 08:18:46 2017 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Thu, 26 Oct 2017 09:18:46 +0200 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: Please also note that when you write as root, you are not restricted by the quota limits. See example: Write as non root user and run into hard limit: [mr at home-11 limited]$ dd if=/dev/urandom of=testfile2 bs=1000 count=100000 dd: error writing ?testfile2?: Disk quota exceeded 29885+0 records in 29884+0 records out 29884000 bytes (30 MB) copied, 3.38491 s, 8.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 322176 214784 322304 128 7 days | 5 0 0 0 none Now write as root user and exceed hard limit: [root at home-11 limited]# dd if=/dev/urandom of=testfile2 bs=1000 count=100000 100000+0 records in 100000+0 records out 100000000 bytes (100 MB) copied, 12.8355 s, 7.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 390656 214784 322304 8939520 7 days | 5 0 0 40 none Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Martina K?deritz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Jaime Pinto" To: "gpfsug main discussion list" , "Emmanuel Barajas Gonzalez" Cc: gpfsug-discuss at spectrumscale.org Date: 10/26/2017 12:18 AM Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scinethpc.ca_testimonials&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=mno73q7nVmqF-YTzo7pAF50B_45epo6ZzccTAwcroRU&e= ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=F6chd4olb-U3RlL5pcqBPagIccRSJd963K0FFS7auko&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 18932891.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From TOMP at il.ibm.com Thu Oct 26 10:09:56 2017 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 26 Oct 2017 12:09:56 +0300 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: And this behavior can be changed using the enforceFilesetQuotaOnRoot options ( check mmchconfig man page) Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Markus Rohwedder" To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 26/10/2017 10:18 Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Please also note that when you write as root, you are not restricted by the quota limits. See example: Write as non root user and run into hard limit: [mr at home-11 limited]$ dd if=/dev/urandom of=testfile2 bs=1000 count=100000 dd: error writing ?testfile2?: Disk quota exceeded 29885+0 records in 29884+0 records out 29884000 bytes (30 MB) copied, 3.38491 s, 8.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 322176 214784 322304 128 7 days | 5 0 0 0 none Now write as root user and exceed hard limit: [root at home-11 limited]# dd if=/dev/urandom of=testfile2 bs=1000 count=100000 100000+0 records in 100000+0 records out 100000000 bytes (100 MB) copied, 12.8355 s, 7.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 390656 214784 322304 8939520 7 days | 5 0 0 40 none Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Martina K?deritz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 "Jaime Pinto" ---10/26/2017 12:18:45 AM---Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the l From: "Jaime Pinto" To: "gpfsug main discussion list" , "Emmanuel Barajas Gonzalez" Cc: gpfsug-discuss at spectrumscale.org Date: 10/26/2017 12:18 AM Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scinethpc.ca_testimonials&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=mno73q7nVmqF-YTzo7pAF50B_45epo6ZzccTAwcroRU&e= ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=F6chd4olb-U3RlL5pcqBPagIccRSJd963K0FFS7auko&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=RxLph-CHLj5Iq5-RYe9eqHId7vsI_uuX4W-Y145ETD8&s=3cgWIXnSFvb65_5JkJDygm3hnSOeeCfYnDnPJdX-hWY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Thu Oct 26 10:16:20 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 26 Oct 2017 09:16:20 +0000 Subject: [gpfsug-discuss] Windows [10] and Spectrum Scale Message-ID: Hi all In the FAQ I note that Windows 10 is not supported at all, and neither is encryption on Windows nodes generally. However the context here is Spectrum Scale v4. Can I take it to mean that this also applies to Scale 4.1/4.2/...? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From vanfalen at mx1.ibm.com Thu Oct 26 14:50:05 2017 From: vanfalen at mx1.ibm.com (Emmanuel Barajas Gonzalez) Date: Thu, 26 Oct 2017 13:50:05 +0000 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46716A4E467141C003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4642530E4641FB0003256C7C22581C5.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E463FD50E463FAF8003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46402D0E4640078003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4641128E4640ED0003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46416A8E4641450003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4644278E4643FF0003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E460E9D8E466F160003256C7C22581C5.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Thu Oct 26 18:03:58 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 26 Oct 2017 17:03:58 +0000 Subject: [gpfsug-discuss] Gartner 2017 - Distributed File systems and Object Storage Message-ID: Interesting read: https://www.gartner.com/doc/reprints?id=1-4IE870C&ct=171017&st=sb Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christian.Fey at sva.de Fri Oct 27 07:30:31 2017 From: Christian.Fey at sva.de (Fey, Christian) Date: Fri, 27 Oct 2017 06:30:31 +0000 Subject: [gpfsug-discuss] how to deal with custom samba options in ces Message-ID: <2ee0987ac21d409b948ae13fbe3d9e97@sva.de> Hi all, I'm just in the process of migration different samba clusters to ces and I recognized, that some clusters have options set like "strict locking = yes" and I'm not sure how to deal with this. From what I know, there is no "CES way" to set every samba option. It would be possible to set with "net" commands I think but probably this will lead to an unsupported state. Anyone came through this? Mit freundlichen Gr??en / Best Regards Christian Fey SVA System Vertrieb Alexander GmbH Borsigstra?e 14 65205 Wiesbaden Tel.: +49 6122 536-0 Fax: +49 6122 536-399 E-Mail: christian.fey at sva.de http://www.sva.de Gesch?ftsf?hrung: Philipp Alexander, Sven Eichelbaum Sitz der Gesellschaft: Wiesbaden Registergericht: Amtsgericht Wiesbaden, HRB 10315 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From sannaik2 at in.ibm.com Fri Oct 27 08:06:50 2017 From: sannaik2 at in.ibm.com (Sandeep Naik1) Date: Fri, 27 Oct 2017 12:36:50 +0530 Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk In-Reply-To: References: Message-ID: Hi Atmane, The missing path from old mmlspdisk (/dev/sdob) and the log file (/dev/sdge) do not match. This may be because server was booted after the old mmlspdisk was taken. The path name are not guarantied across reboot. The log is reporting problem with /dev/sdge. You should check if OS can see path /dev/sdge (use lsscsi). If the disk is accessible from other path than I don't believe it is problem with the disk. Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: atmane khiredine To: "gpfsug-discuss at spectrumscale.org" Date: 24/10/2017 02:50 PM Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=QsMCUxg_qSYCs6Joccb2Brey1phAF_tJFrEnVD6LNoc&s=eSulhfhE2jQnmMrmb9_eoomafxb5xI3KL5Y6n3rH5CE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Fri Oct 27 20:48:08 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 27 Oct 2017 19:48:08 +0000 Subject: [gpfsug-discuss] how to deal with custom samba options in ces In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From johnbent at gmail.com Sat Oct 28 05:15:59 2017 From: johnbent at gmail.com (John Bent) Date: Fri, 27 Oct 2017 22:15:59 -0600 Subject: [gpfsug-discuss] Announcing IO-500 and soliciting submissions Message-ID: Hello GPFS community, After BoFs at last year's SC and the last two ISC's, the IO-500 is formalized and is now accepting submissions in preparation for our first IO-500 list at this year's SC BoF: http://sc17.supercomputing.org/presentation/?id=bof108&sess=sess319 The goal of the IO-500 is simple: to improve parallel file systems by ensuring that sites publish results of both "hero" and "anti-hero" runs and by sharing the tuning and configuration they applied to achieve those results. After receiving feedback from a few trial users, the framework is significantly improved: > git clone https://github.com/VI4IO/io-500-dev > cd io-500-dev > ./utilities/prepare.sh > ./io500.sh > # tune and rerun > # email results to submit at io500.org This, perhaps with a bit of tweaking and please consult our 'doc' directory for troubleshooting, should get a very small toy problem up and running quickly. It then does become a bit challenging to tune the problem size as well as the underlying file system configuration (e.g. striping parameters) to get a valid, and impressive, result. The basic format of the benchmark is to run both a "hero" and "antihero" IOR test as well as a "hero" and "antihero" mdtest. The write/create phase of these tests must last for at least five minutes to ensure that the test is not measuring cache speeds. One of the more challenging aspects is that there is a requirement to search through the metadata of the files that this benchmark creates. Currently we provide a simple serial version of this test (i.e. the GNU find command) as well as a simple python MPI parallel tree walking program. Even with the MPI program, the find can take an extremely long amount of time to finish. You are encouraged to replace these provided tools with anything of your own devise that satisfies the required functionality. This is one area where we particularly hope to foster innovation as we have heard from many file system admins that metadata search in current parallel file systems can be painfully slow. Now is your chance to show the community just how awesome we all know GPFS to be. We are excited to introduce this benchmark and foster this community. We hope you give the benchmark a try and join our community if you haven't already. Please let us know right away in any of our various communications channels (as described in our documentation) if you encounter any problems with the benchmark or have questions about tuning or have suggestions for others. We hope to see your results in email and to see you in person at the SC BoF. Thanks, IO 500 Committee John Bent, Julian Kunkle, Jay Lofstead -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.khiredine at meteo.dz Sat Oct 28 08:29:49 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Sat, 28 Oct 2017 07:29:49 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 69, Issue 54 In-Reply-To: References: Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6340D83B@SDEB-EXC02.meteo.dz> dear Sandeep Naik, Thank you for that answer the OS can see all the path but gss sees only one path for one disk lssci indicates that I have 238 disk 6 SSD and 232 HDD but the gss indicates that it sees only one path with the cmd mmlspdisk all I think it's a disk problem but he sees it with another path if these a problem of SAS cable logically all the disk connect with the cable shows a single path Do you have any ideas ?? GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz ________________________________________ De : gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] de la part de gpfsug-discuss-request at spectrumscale.org [gpfsug-discuss-request at spectrumscale.org] Envoy? : vendredi 27 octobre 2017 08:06 ? : gpfsug-discuss at spectrumscale.org Objet : gpfsug-discuss Digest, Vol 69, Issue 54 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Gartner 2017 - Distributed File systems and Object Storage (Oesterlin, Robert) 2. how to deal with custom samba options in ces (Fey, Christian) 3. Re: GSS GPFS Storage Server show one path for one Disk (Sandeep Naik1) ---------------------------------------------------------------------- Message: 1 Date: Thu, 26 Oct 2017 17:03:58 +0000 From: "Oesterlin, Robert" To: gpfsug main discussion list Subject: [gpfsug-discuss] Gartner 2017 - Distributed File systems and Object Storage Message-ID: Content-Type: text/plain; charset="utf-8" Interesting read: https://www.gartner.com/doc/reprints?id=1-4IE870C&ct=171017&st=sb Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 27 Oct 2017 06:30:31 +0000 From: "Fey, Christian" To: gpfsug main discussion list Subject: [gpfsug-discuss] how to deal with custom samba options in ces Message-ID: <2ee0987ac21d409b948ae13fbe3d9e97 at sva.de> Content-Type: text/plain; charset="iso-8859-1" Hi all, I'm just in the process of migration different samba clusters to ces and I recognized, that some clusters have options set like "strict locking = yes" and I'm not sure how to deal with this. From what I know, there is no "CES way" to set every samba option. It would be possible to set with "net" commands I think but probably this will lead to an unsupported state. Anyone came through this? Mit freundlichen Gr??en / Best Regards Christian Fey SVA System Vertrieb Alexander GmbH Borsigstra?e 14 65205 Wiesbaden Tel.: +49 6122 536-0 Fax: +49 6122 536-399 E-Mail: christian.fey at sva.de http://www.sva.de Gesch?ftsf?hrung: Philipp Alexander, Sven Eichelbaum Sitz der Gesellschaft: Wiesbaden Registergericht: Amtsgericht Wiesbaden, HRB 10315 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: ------------------------------ Message: 3 Date: Fri, 27 Oct 2017 12:36:50 +0530 From: "Sandeep Naik1" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Message-ID: Content-Type: text/plain; charset="utf-8" Hi Atmane, The missing path from old mmlspdisk (/dev/sdob) and the log file (/dev/sdge) do not match. This may be because server was booted after the old mmlspdisk was taken. The path name are not guarantied across reboot. The log is reporting problem with /dev/sdge. You should check if OS can see path /dev/sdge (use lsscsi). If the disk is accessible from other path than I don't believe it is problem with the disk. Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: atmane khiredine To: "gpfsug-discuss at spectrumscale.org" Date: 24/10/2017 02:50 PM Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=QsMCUxg_qSYCs6Joccb2Brey1phAF_tJFrEnVD6LNoc&s=eSulhfhE2jQnmMrmb9_eoomafxb5xI3KL5Y6n3rH5CE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 69, Issue 54 ********************************************** From r.sobey at imperial.ac.uk Mon Oct 30 15:32:10 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 30 Oct 2017 15:32:10 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Message-ID: All, Since upgrading to Windows 10 build 1709 aka Autumn Creator's Update our Previous Versions is wonky... as in you just get a flat list of previous versions^^snapshots with no date associated with them. I am not able to test this on a server with the latest Samba release so at the moment I'm stuck reporting that GPFS 4.2.3-4 with smb 4.5.10 doesn't play nicely with Windows 10 1709. Screenshot is attached for an example. Can anyone corroborate my findings? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: prv-ver.png Type: image/png Size: 16452 bytes Desc: prv-ver.png URL: From christof.schmitt at us.ibm.com Mon Oct 30 20:25:26 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 30 Oct 2017 20:25:26 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From peter.smith at framestore.com Tue Oct 31 13:10:47 2017 From: peter.smith at framestore.com (Peter Smith) Date: Tue, 31 Oct 2017 13:10:47 +0000 Subject: [gpfsug-discuss] FreeBSD client? Message-ID: Hi Does such a thing exist? :-) TIA -- [image: Framestore] Peter Smith ? Senior Systems Engineer London ? New York ? Los Angeles ? Chicago ? Montr?al T +44 (0)20 7344 8000 ? M +44 (0)7816 123009 <+44%20%280%297816%20123009> 19-23 Wells Street, London W1T 3PQ Twitter ? Facebook ? framestore.com [image: https://www.framestore.com/] -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Oct 31 14:20:27 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 31 Oct 2017 14:20:27 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 In-Reply-To: References: Message-ID: Thanks Christof, will do. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: 30 October 2017 20:25 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Richard, in a quick test with Windows 10 Pro 1709 connecting to gpfs.smb 4.5.10_gpfs_21 i do not see the problem from the screenshot. All files reported in "Previous Versions" have a date associated. For debugging the problem on your system, i would suggest to enable traces and recreate the problem. Replace the x.x.x.x with the IP address of the Windows 10 client: mmprotocoltrace start network -c x.x.x.x mmprotocoltrace start smb -c x.x.x.x (open the "Previous Versions" dialog) mmprotocoltrace stop smb mmprotocoltrace stop network The best way to track the analysis would be through a PMR. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: "Sobey, Richard A" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" Cc: Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Date: Mon, Oct 30, 2017 8:32 AM All, Since upgrading to Windows 10 build 1709 aka Autumn Creator?s Update our Previous Versions is wonky? as in you just get a flat list of previous versions^^snapshots with no date associated with them. I am not able to test this on a server with the latest Samba release so at the moment I?m stuck reporting that GPFS 4.2.3-4 with smb 4.5.10 doesn?t play nicely with Windows 10 1709. Screenshot is attached for an example. Can anyone corroborate my findings? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=Bfd_a1yscUVzXzIRuwarah8UedH7U1Uln5AFFPQayR4&s=URMLuAJbrlEOj4xt3_7_Cm0Rj9DfFovuEUOGc4zQUUY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at u.washington.edu Tue Oct 31 14:41:58 2017 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Tue, 31 Oct 2017 07:41:58 -0700 Subject: [gpfsug-discuss] FreeBSD client? In-Reply-To: References: Message-ID: <20171031144158.GC17659@illiuin> I doubt it, since IBM would need to tailor a kernel layer for FreeBSD (not the kind of thing you can run with the x86 Linux userspace emulation in FreeBSD), which would be a lot of work for not a lot of demand. On Tue, Oct 31, 2017 at 01:10:47PM +0000, Peter Smith wrote: > Hi > > Does such a thing exist? :-) -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From j.ouwehand at vumc.nl Mon Oct 2 14:35:23 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Mon, 2 Oct 2017 13:35:23 +0000 Subject: [gpfsug-discuss] number of SMBD processes Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Hello, Since we use new "IBM Spectrum Scale SMB CES" nodes, we see that that the number of SMBD processes has increased significantly from ~ 4,000 to ~ 7,500. We also see that the SMBD processes are not closed. This is likely because the Samba global-parameter "deadtime" is missing. ------------ https://www.samba.org/samba/docs/using_samba/ch11.html This global option sets the number of minutes that Samba will wait for an inactive client before closing its session with the Samba server. A client is considered inactive when it has no open files and no data is being sent from it. The default value for this option is 0, which means that Samba never closes any connection, regardless of how long they have been inactive. This can lead to unnecessary consumption of the server's resources by inactive clients. We recommend that you override the default as follows: [global] deadtime = 10 ------------ Is this Samba parameter "deadtime" supported by IBM? Kindly regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6431 bytes Desc: image001.gif URL: From bbanister at jumptrading.com Mon Oct 2 15:10:24 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 14:10:24 +0000 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: References: Message-ID: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> Thanks for posting this Sandeep! As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn't know that this site was an available resource. Bookmarking is good, but there unfortunately is not a way to "watch" the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. Thanks again, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sandeep Ramesh Sent: Friday, September 29, 2017 11:02 PM To: gpfsug-discuss at spectrumscale.org Cc: Theodore Hoover Jr ; Doris Conti Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 Note: External Email ________________________________ Hi Folks I was asked by Doris Conti to send the below to our Spectrum Scale User group. Below is a consolidated link that list all the enablement on Spectrum Scale/ESS that was done in 1H 2017 - which have blogs and videos from development and offering management. https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/White%20Papers%20%26%20Media Do note, Spectrum Scale developers keep blogging on the below site which is worth bookmarking: https://developer.ibm.com/storage/blog/ (as recent as 4 new blogs in Sept) Thanks Sandeep Linkedin: https://www.linkedin.com/in/sandeeprpatil Spectrum Scale Dev. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Mon Oct 2 15:13:52 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 14:13:52 +0000 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: References: Message-ID: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Hi Kristy, I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? Thanks for any help here! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Thursday, September 21, 2017 1:49 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. Best, Kristy On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: Thanks Doug. If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. Cheers, Kristy On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. For more information http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ Doug Mobile _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Oct 2 15:23:25 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 2 Oct 2017 14:23:25 +0000 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> References: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> Message-ID: <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> I?d agree with Bryan here ? Depending on me to passively check this for new material isn?t a great mechanism. If it?s on the mailing list, I will probably see it in a timely manner. On a related note, IBM has too many places to look for ?timely? Scale information/tips/how-to?s. It really should be in central repository and then use links to this location. I?m not sure I have a great idea here, but I have a list of 6-8 placed bookmarked that I go looking for articles. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of Bryan Banister Reply-To: gpfsug main discussion list Date: Monday, October 2, 2017 at 9:11 AM To: gpfsug main discussion list Cc: Theodore Hoover Jr , Doris Conti Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 Thanks for posting this Sandeep! As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn?t know that this site was an available resource. Bookmarking is good, but there unfortunately is not a way to ?watch? the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. Thanks again, -Bryan -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Mon Oct 2 15:31:32 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Mon, 2 Oct 2017 10:31:32 -0400 Subject: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 In-Reply-To: <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> References: <74180fbd92374e5aa4b4da620d970732@jumptrading.com> <1EEE35A6-B872-4D2D-A948-A444324FAF7F@nuance.com> Message-ID: <8A33571E-905B-41D8-A934-C984A90EF6F9@ulmer.org> I?ve been told in the past that the Spectrum Scale Wiki is the place to watch for the most timely information, and there is a way to "follow" the wiki so you get notified of updates. That being said, I?ve not gotten "following" it to work yet so I don?t know what that actually *means*. I?d love to get a daily digest of all of the changes to that Wiki ? or even just a URL I would watch with IFTTT that would actually show me links to all of the updates. -- Stephen > On Oct 2, 2017, at 10:23 AM, Oesterlin, Robert > wrote: > > I?d agree with Bryan here ? Depending on me to passively check this for new material isn?t a great mechanism. If it?s on the mailing list, I will probably see it in a timely manner. > > On a related note, IBM has too many places to look for ?timely? Scale information/tips/how-to?s. It really should be in central repository and then use links to this location. I?m not sure I have a great idea here, but I have a list of 6-8 placed bookmarked that I go looking for articles. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > 507-269-0413 > > > From: > on behalf of Bryan Banister > > Reply-To: gpfsug main discussion list > > Date: Monday, October 2, 2017 at 9:11 AM > To: gpfsug main discussion list > > Cc: Theodore Hoover Jr >, Doris Conti > > Subject: [EXTERNAL] Re: [gpfsug-discuss] Spectrum Scale Enablement Material - 1H 2017 > > Thanks for posting this Sandeep! > > As Doris mentioned during the Spectrum Scale User Group meeting, IBM is doing a lot of good work to put this data out there, and many of us didn?t know that this site was an available resource. > > Bookmarking is good, but there unfortunately is not a way to ?watch? the space to get automatic notifications when new posts are available. I request that IBM announce these posts to this Spectrum Scale User Group list going forward to get the most impact from these offerings. Or post them to a space that can be watched so that we can get automatic notifications. > > Thanks again, > -Bryan > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Mon Oct 2 18:12:37 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 2 Oct 2017 17:12:37 +0000 Subject: [gpfsug-discuss] number of SMBD processes In-Reply-To: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> References: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.gif at 01D33B90.D2CAECC0.gif Type: image/gif Size: 6431 bytes Desc: not available URL: From ckerner at illinois.edu Mon Oct 2 19:20:39 2017 From: ckerner at illinois.edu (Chad Kerner) Date: Mon, 2 Oct 2017 13:20:39 -0500 Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image Message-ID: Has anyone tried building the GPL layer on the latest CentOS 7.4 image with the 3.10.0-693.2.2.el7.x86_64 kernel? This is trying to build 4.2.0.4 on Centos 7.4-1708. protector -Wformat=0 -Wno-format-security -I/usr/lpp/mmfs/src/gpl-linux -c kdump.c cc kdump.o kdump-kern.o kdump-kern-dwarfs.o -o kdump -lpthread kdump-kern.o: In function `GetOffset': kdump-kern.c:(.text+0x9): undefined reference to `page_offset_base' kdump-kern.o: In function `KernInit': kdump-kern.c:(.text+0x58): undefined reference to `page_offset_base' collect2: error: ld returned 1 exit status make[1]: *** [modules] Error 1 make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' make: *** [Modules] Error 1 -------------------------------------------------------- Thanks, Chad -- Chad Kerner, Senior Storage Engineer Storage Enabling Technologies National Center for Supercomputing Applications University of Illinois, Urbana-Champaign -------------- next part -------------- An HTML attachment was scrubbed... URL: From JRLang at uwyo.edu Mon Oct 2 20:31:59 2017 From: JRLang at uwyo.edu (Jeffrey R. Lang) Date: Mon, 2 Oct 2017 19:31:59 +0000 Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image In-Reply-To: References: Message-ID: Chad I asked this same question last week. The answer is to upgrade to Scpectrum 4.2.3.4 jeff From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Chad Kerner Sent: Monday, October 2, 2017 1:21 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Builing on the latest centos 7.4 image Has anyone tried building the GPL layer on the latest CentOS 7.4 image with the 3.10.0-693.2.2.el7.x86_64 kernel? This is trying to build 4.2.0.4 on Centos 7.4-1708. protector -Wformat=0 -Wno-format-security -I/usr/lpp/mmfs/src/gpl-linux -c kdump.c cc kdump.o kdump-kern.o kdump-kern-dwarfs.o -o kdump -lpthread kdump-kern.o: In function `GetOffset': kdump-kern.c:(.text+0x9): undefined reference to `page_offset_base' kdump-kern.o: In function `KernInit': kdump-kern.c:(.text+0x58): undefined reference to `page_offset_base' collect2: error: ld returned 1 exit status make[1]: *** [modules] Error 1 make[1]: Leaving directory `/usr/lpp/mmfs/src/gpl-linux' make: *** [Modules] Error 1 -------------------------------------------------------- Thanks, Chad -- Chad Kerner, Senior Storage Engineer Storage Enabling Technologies National Center for Supercomputing Applications University of Illinois, Urbana-Champaign -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Mon Oct 2 22:24:43 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 2 Oct 2017 14:24:43 -0700 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> References: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Message-ID: Trying to get details on availability. More when I hear back. -Kristy > On Oct 2, 2017, at 7:13 AM, Bryan Banister wrote: > > Hi Kristy, > > I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? > > Thanks for any help here! > -Bryan > > From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org ] On Behalf Of Kristy Kallback-Rose > Sent: Thursday, September 21, 2017 1:49 PM > To: gpfsug main discussion list > > Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC > > Note: External Email > Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. > > Best, > Kristy > > On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: > > Thanks Doug. > > If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. > > Cheers, > Kristy > > On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: > > > Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. > > > For more information > http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ > > Doug > > Mobile > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Mon Oct 2 22:26:57 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 2 Oct 2017 21:26:57 +0000 Subject: [gpfsug-discuss] User Meeting & SPXXL in NYC In-Reply-To: References: <2c95ef1ff4f3427495e6e5e95a756f00@jumptrading.com> Message-ID: Kristy, Thanks for the quick response. I did reach out to Karthik about the File System Corruption (MMFSCK) presentation, which was really what I lost. I?m sure he?ll get me the presentation, so please don?t rush at this point on my account! Sorry for the fire drill, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Monday, October 02, 2017 4:25 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Trying to get details on availability. More when I hear back. -Kristy On Oct 2, 2017, at 7:13 AM, Bryan Banister > wrote: Hi Kristy, I lost half of my notes from the meeting and need to recreate them while the memories are still somewhat fresh. Would you please get and post the slides from the Spectrum Scale User Group meeting as soon as possible? Thanks for any help here! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Thursday, September 21, 2017 1:49 PM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] User Meeting & SPXXL in NYC Note: External Email ________________________________ Registration space is getting tight. We decided on a room reconfiguration today to make a little more room. So if you tried to register and were told it was full try again. If it fills up again and you want to register, but can?t drop me an email and I?ll see what we can do. Best, Kristy On Sep 20, 2017, at 9:00 AM, Kristy Kallback-Rose > wrote: Thanks Doug. If you plan to go, *do register*. GPFS Day is free, but we need to know how many will attend. Register using the link on the HPCXXL event page below. Cheers, Kristy On Sep 20, 2017, at 1:28 AM, Douglas O'flaherty > wrote: Reminder that the SPXXL day on IBM Spectrum Scale in New York is open to all. It is Thursday the 28th. There is also a Power day on Wednesday. For more information http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ Doug Mobile _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From leslie.james.elliott at gmail.com Tue Oct 3 12:32:56 2017 From: leslie.james.elliott at gmail.com (leslie elliott) Date: Tue, 3 Oct 2017 21:32:56 +1000 Subject: [gpfsug-discuss] transparent cloud tiering Message-ID: hi I am trying to change the account for the cloud tier but am having some problems any hints would be appreciated I am not interested in the data locally or migrated but do not seem to be able to recall this so would just like to repurpose it with the new account I can see in the logs 2017-10-03_15:38:49.226+1000: [W] Snapshot quiesce of SG cloud01 snap -1/0 doing 'mmcrsnapshot :MCST.scan.6' timed out on node . Retrying if possible. which is no doubt the reason for the following mmcloudgateway account delete --cloud-nodeclass TCTNodeClass --cloud-name gpfscloud1234 mmcloudgateway: Sending the command to the first successful node starting with gpfs-dev02 mmcloudgateway: This may take a while... mmcloudgateway: Error detected on node gpfs-dev02 The return code is 94. The error is: MCSTG00084E: Command Failed with following reason: Unable to create snapshot for file system /gpfs/itscloud01, [Ljava.lang.String;@3353303e failed with: com.ibm.gpfsconnector.messages.GpfsConnectorException: Command [/usr/lpp/mmfs/bin/mmcrsnapshot, cloud01, MCST.scan.4] failed with the following return code: 78.. mmcloudgateway: Sending the command to the next node gpfs-dev04 mmcloudgateway: Error detected on node gpfs-dev04 The return code is 94. The error is: MCSTG00084E: Command Failed with following reason: Unable to create snapshot for file system /gpfs/cloud01, [Ljava.lang.String;@90a887ad failed with: com.ibm.gpfsconnector.messages.GpfsConnectorException: Command [/usr/lpp/mmfs/bin/mmcrsnapshot, cloud01, MCST.scan.6] failed with the following return code: 78.. mmcloudgateway: Command failed. Examine previous error messages to determine cause. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Oct 3 12:57:21 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 3 Oct 2017 07:57:21 -0400 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f@nasa.gov> Hi All, At the SSUG in NY there was mention of operating systems as well as feature deprecations that would occur in the lifecycle of the next major release of GPFS. I'm not sure if this is public knowledge yet so I haven't mentioned specifics but given the proposed release time frame of the next major release I thought customers may appreciate having access to this information so they could provide feedback about the potential impact to their environment if these deprecations do occur. Any chance someone from IBM could provide specifics here so folks can chime in? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From j.ouwehand at vumc.nl Wed Oct 4 12:59:45 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Wed, 4 Oct 2017 11:59:45 +0000 Subject: [gpfsug-discuss] number of SMBD processes In-Reply-To: References: <5594921EA5B3674AB44AD9276126AAF40170E41381@sp-mx-mbx4> Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E4185E@sp-mx-mbx4> Hello Christof, Thank you very much for the explanation. You have point us in the right direction. Vriendelijke groet, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Christof Schmitt Verzonden: maandag 2 oktober 2017 19:13 Aan: gpfsug-discuss at spectrumscale.org CC: gpfsug-discuss at spectrumscale.org Onderwerp: Re: [gpfsug-discuss] number of SMBD processes Hello, the short answer is that the "deadtime" parameter is not a supported parameter in Spectrum Scale. The longer answer is that setting "deadtime" likely does not solve any issue. "deadtime" was introduced in Samba mainly for older protocol versions. While it is implemented independent of protocol versions, not the statement about "no open files" for a connection to be closed. Spectrum Scale only supports SMB versions 2 and 3. Basically everything there is based on an open file handle. Most SMB 2/3 clients open at least the root directory of the export and register for change notifications there and the client then can wait for any time for changes. That is a valid case, and the open directory handle prevents the connection from being affected by any setting of the "deadtime" parameter. Clients that are no longer active and have not properly closed the connection are detected on the TCP level: # mmsmb config list | grep sock socket options TCP_NODELAY SO_KEEPALIVE TCP_KEEPCNT=4 TCP_KEEPIDLE=240 TCP_KEEPINTVL=15 Every client that no longer responds for 5 minutes will have the connection dropped (240s + 4x15s). On the other hand, if the SMB clients are still responding to TCP keep-alive packets, then the connection is considered valid. It might be interesting to look into the unwanted connections and possibly capture a network trace or look into the client systems to better understand the situation. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: "Ouwehand, JJ" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" > Cc: Subject: [gpfsug-discuss] number of SMBD processes Date: Mon, Oct 2, 2017 6:35 AM Hello, Since we use new ?IBM Spectrum Scale SMB CES? nodes, we see that that the number of SMBD processes has increased significantly from ~ 4,000 to ~ 7,500. We also see that the SMBD processes are not closed. This is likely because the Samba global-parameter ?deadtime? is missing. ------------ https://www.samba.org/samba/docs/using_samba/ch11.html This global option sets the number of minutes that Samba will wait for an inactive client before closing its session with the Samba server. A client is considered inactive when it has no open files and no data is being sent from it. The default value for this option is 0, which means that Samba never closes any connection, regardless of how long they have been inactive. This can lead to unnecessary consumption of the server's resources by inactive clients. We recommend that you override the default as follows: [global] deadtime = 10 ------------ Is this Samba parameter ?deadtime? supported by IBM? Kindly regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [VUmc_logo_samen_kiezen_voor_beter] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=LCAKWPxQj5PMUf5YKTH3Z0zW9cDW--1AO_mljWE3ni8&s=y0FjQ5P-9Q7YjxyvuNNa4kdzHZKfrsjW81pGDLMNuig&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 6431 bytes Desc: image001.gif URL: From heiner.billich at psi.ch Wed Oct 4 18:26:03 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Wed, 4 Oct 2017 17:26:03 +0000 Subject: [gpfsug-discuss] AFM - prefetch of many small files - tuning - storage latency required to increase max socket buffer size ... Message-ID: <0A9C5A40-221C-46B5-B7E3-72A9D5A6D483@psi.ch> Hello, A while ago I asked the list for advice on how to tune AFM to speed-up the prefetch of small files (~1MB). In the meantime, we got some results which I want to share. We had to increase the maximum socket buffer sizes to very high values of 40-80MB. Consider that we use IP over Infiniband and the bandwidth-delay-product is about 5MB (1-10us latency). How do we explain this? The reads on the nfs server have a latency of about 20ms. This is physics of disks. Hence a single client can get up to 50 requests/s. Each request is 1MB. To get 1GB/s we need 20 clients in parallel. At all times we have about 20 requests pending. Looks like the server does allocate the socket buffer space before it asks for the data. Hence it allocates/blocks about 20MB at all times. Surprisingly it?s storage latency and not network latency that required us to increase the max. socket buffer size. For large files prefetch works and reduces the latency of reads drastically and no special tuning is required. We did test with kernel-nfs and gpfs 4.2.3 on RHEL7. Whether ganesha shows a similar pattern would be interesting to know. Once we fixed the nfs issues afm did show a nice parallel prefetch up to ~1GB/s with 1MB sized files without any tuning. Still much below the 4GB/s measured with iperf between the two nodes ?. Kind regards, Heiner -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From kkr at lbl.gov Wed Oct 4 22:44:10 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Wed, 4 Oct 2017 14:44:10 -0700 Subject: [gpfsug-discuss] NYC Meeting Presos (Workaround) Message-ID: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> Hi, I?m having some trouble getting links added to the SS/GPFS UG page, but I want to share the presos I have so far, a couple more are coming soon. So, as a workaround (as storage people we can appreciate workarounds, right?!), here are the links to the slides I have thus far: Spectrum Scale Object at CSCS: http://files.gpfsug.org/presentations/2017/NYC/Day4-ss_object_cscs.pdf File System Corruptions & Best Practices: http://files.gpfsug.org/presentations/2017/NYC/Day4-hpcxxl2017_fsck_karthik.pdf Spectrum Scale Cloud Enablement: http://files.gpfsug.org/presentations/2017/NYC/Day4-SpectrumScaleCloudEnablement%20Sept28.pdf IBM Spectrum Scale 4.2.3 Security Overview: http://files.gpfsug.org/presentations/2017/NYC/Day4-Spectrum%20Scale%20Security-%20User%20Group%20-%20NYCSept2017.pdf What?s New in Spectrum Scale: http://files.gpfsug.org/presentations/2017/NYC/Day4-NYC%20User%20Group%202017%20What's%20New%20v2.pdf Cheers, Kristy -------------- next part -------------- An HTML attachment was scrubbed... URL: From jez.tucker at gpfsug.org Thu Oct 5 11:11:53 2017 From: jez.tucker at gpfsug.org (Jez Tucker) Date: Thu, 5 Oct 2017 11:11:53 +0100 Subject: [gpfsug-discuss] NYC Meeting Presos (Workaround) In-Reply-To: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> References: <6C68F9C7-6FC2-4826-802D-3AFE366F39C8@lbl.gov> Message-ID: *waves hands*? - I can help here if you have issues.? Same for anyone else. ping me 1::1 On 04/10/17 22:44, Kristy Kallback-Rose wrote: > Hi, > > I?m having some trouble getting links added to the SS/GPFS UG page, > but I want to share the presos I have so far, a couple more are coming > soon. So, as a workaround (as storage people we can appreciate > workarounds, right?!), here are the links to the slides I have thus far: > > Spectrum Scale Object at CSCS: > http://files.gpfsug.org/presentations/2017/NYC/Day4-ss_object_cscs.pdf > > File System Corruptions & Best Practices: > http://files.gpfsug.org/presentations/2017/NYC/Day4-hpcxxl2017_fsck_karthik.pdf > > Spectrum Scale Cloud Enablement: > http://files.gpfsug.org/presentations/2017/NYC/Day4-SpectrumScaleCloudEnablement%20Sept28.pdf > > IBM Spectrum Scale 4.2.3 Security Overview: > http://files.gpfsug.org/presentations/2017/NYC/Day4-Spectrum%20Scale%20Security-%20User%20Group%20-%20NYCSept2017.pdf > > What?s New in Spectrum Scale: > http://files.gpfsug.org/presentations/2017/NYC/Day4-NYC%20User%20Group%202017%20What's%20New%20v2.pdf > > > Cheers, > Kristy > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From listymclistfaces at gmail.com Fri Oct 6 13:56:04 2017 From: listymclistfaces at gmail.com (listy mclistface) Date: Fri, 6 Oct 2017 13:56:04 +0100 Subject: [gpfsug-discuss] Client power failure Message-ID: Hi, Although our NSD nodes are on UPS etc, we have some clients which aren't. Do we run the risk of FS corruption if we drop client nodes mid write? -------------- next part -------------- An HTML attachment was scrubbed... URL: From jez.tucker at gpfsug.org Fri Oct 6 14:14:59 2017 From: jez.tucker at gpfsug.org (Jez Tucker) Date: Fri, 6 Oct 2017 14:14:59 +0100 Subject: [gpfsug-discuss] Client power failure In-Reply-To: References: Message-ID: <61604124-ec28-c930-7ea3-a20a6223b779@gpfsug.org> Hi ? Can we please refrain from completely anonymous emails ListyMcListFaces ;-) Ta ListMasterMcListAdmin On 06/10/17 13:56, listy mclistface wrote: > Hi, > > Although our NSD nodes are on UPS etc, we have some clients which > aren't.? ?Do we run the risk of FS corruption if we drop client nodes > mid write? > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Oct 6 14:24:11 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 6 Oct 2017 13:24:11 +0000 Subject: [gpfsug-discuss] Client power failure Message-ID: I agree ? anonymous ones should be dropped from the list. Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of Jez Tucker Reply-To: "jez.tucker at gpfsug.org" , gpfsug main discussion list Date: Friday, October 6, 2017 at 8:17 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Client power failure Can we please refrain from completely anonymous emails ListyMcListFaces ;-) -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Fri Oct 6 14:45:38 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Fri, 6 Oct 2017 13:45:38 +0000 Subject: [gpfsug-discuss] Client power failure In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 6 21:39:28 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 6 Oct 2017 20:39:28 +0000 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release In-Reply-To: References: Message-ID: Hi Aaron, I appreciate your care with this. The user group are the first users to be briefed on this. We're not quite ready to put more in writing just yet, however I will be at SC17 and hope to be able to do so at that time. (I'll also take any other questions that people want to ask, including "where's my RFE?"...) I also want to add one note about the meaning of feature deprecation, because it's not well understood even within IBM: If we deprecate a feature with the next major release it does NOT mean we are dropping support there and then. It means we are announcing the INTENTION to drop support in some future release, and encourage you to (a) start making plans on migration to a supported alternative, and (b) chime in on what you need in order to be able to satisfactorily migrate if our proposed alternative is not adequate. regards, Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com ------------------------------ Message: 2 Date: Tue, 3 Oct 2017 07:57:21 -0400 From: Aaron Knister To: gpfsug main discussion list Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f at nasa.gov> Content-Type: text/plain; charset="utf-8"; format=flowed Hi All, At the SSUG in NY there was mention of operating systems as well as feature deprecations that would occur in the lifecycle of the next major release of GPFS. I'm not sure if this is public knowledge yet so I haven't mentioned specifics but given the proposed release time frame of the next major release I thought customers may appreciate having access to this information so they could provide feedback about the potential impact to their environment if these deprecations do occur. Any chance someone from IBM could provide specifics here so folks can chime in? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ------------------------------ From aaron.s.knister at nasa.gov Fri Oct 6 23:30:05 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 6 Oct 2017 18:30:05 -0400 Subject: [gpfsug-discuss] changing default configuration values Message-ID: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From scale at us.ibm.com Sat Oct 7 04:06:41 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 6 Oct 2017 23:06:41 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Hi Aaron, The default value applies to all nodes in the cluster. Thus changing it will change all nodes in the cluster. You need to run mmchconfig to customize the node override again. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 10/06/2017 06:30 PM Subject: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From john.hearns at asml.com Mon Oct 9 09:38:29 2017 From: john.hearns at asml.com (John Hearns) Date: Mon, 9 Oct 2017 08:38:29 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Aaron, The reply you just got her is absolutely the correct one. However, its worth contributing something here. I have recently bene dealing with the parameter verbsPorts - which is a list of the interfaces which verbs should use. I found on our cluyster it was set to use dual ports for all nodes, including servers, when only our servers have dual ports. I will follow the advice below and make a global change, then change back the configuration for the server. It is worth looking though at mmllnodeclass -all There is a rather rich set of nodeclasses, including clientNodes managerNodes nonNsdNodes nonQuorumNodes So if you want to make changes to a certain type of node in your cluster you will be able to achieve it using nodeclasses. Bond, James Bond commander.bond at mi6.gov.uk From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of IBM Spectrum Scale Sent: Saturday, October 07, 2017 5:07 AM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Hi Aaron, The default value applies to all nodes in the cluster. Thus changing it will change all nodes in the cluster. You need to run mmchconfig to customize the node override again. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. [Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is there a way to change the default value of a configurati]Aaron Knister ---10/06/2017 06:30:20 PM---Is there a way to change the default value of a configuration option without overriding any overrid From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM Subject: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Is there a way to change the default value of a configuration option without overriding any overrides in place? Take the following situation: - I set parameter foo=bar for all nodes (mmchconfig foo=bar) - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N n001,n002) Is there a way to then set the default value of foo to qux without changing the value of foo for nodes n001 and n002? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From john.hearns at asml.com Mon Oct 9 09:44:28 2017 From: john.hearns at asml.com (John Hearns) Date: Mon, 9 Oct 2017 08:44:28 +0000 Subject: [gpfsug-discuss] Setting fo verbsRdmaSend Message-ID: We have a GPFS setup which is completely Infiniband connected. Version 4.2.3.4 I see that verbsRdmaCm is set to Disabled. Reading up about this, I am inclined to leave this disabled. Can anyone comment on the likely effects of changing it, and if there are any real benefits in performance? commander.bond at mi6.gov.uk -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.ouwehand at vumc.nl Mon Oct 9 10:13:07 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Mon, 9 Oct 2017 09:13:07 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus Message-ID: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Hello, Currently we are urgently looking for an on-access antivirus solution for our IBM Spectrum Scale SMB CES Cluster. Unfortunately IBM has no such solution. Does anyone have a good supported solution? Kind regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [cid:image003.png at 01D340EF.9527A0C0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.png Type: image/png Size: 8437 bytes Desc: image003.png URL: From r.sobey at imperial.ac.uk Mon Oct 9 10:16:35 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 9 Oct 2017 09:16:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus In-Reply-To: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> References: <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Message-ID: According to one of the presentations posted on this list a few days ago, there is "bulk antivirus scanning with Symantec AV" "coming soon". From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ouwehand, JJ Sent: 09 October 2017 10:13 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus Hello, Currently we are urgently looking for an on-access antivirus solution for our IBM Spectrum Scale SMB CES Cluster. Unfortunately IBM has no such solution. Does anyone have a good supported solution? Kind regards, Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT [cid:image001.png at 01D340E7.AF732BA0] -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8437 bytes Desc: image001.png URL: From daniel.kidger at uk.ibm.com Mon Oct 9 10:27:57 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Mon, 9 Oct 2017 09:27:57 +0000 Subject: [gpfsug-discuss] Spectrum Scale SMB antivirus In-Reply-To: References: , <5594921EA5B3674AB44AD9276126AAF40170E4225C@sp-mx-mbx4> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.png at 01D340E7.AF732BA0.png Type: image/png Size: 8437 bytes Desc: not available URL: From a.khiredine at meteo.dz Mon Oct 9 13:47:09 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Mon, 9 Oct 2017 12:47:09 +0000 Subject: [gpfsug-discuss] how gpfs work when disk fail Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> dear all how gpfs work when disk fail this is a example scenario when disk fail 1 Server 2 Disk directly attached to the local node 100GB mmlscluster GPFS cluster information ======================== GPFS cluster name: test.gpfs GPFS cluster id: 174397273000001824 GPFS UID domain: test.gpfs Remote shell command: /usr/bin/ssh Remote file copy command: /usr/bin/scp Repository type: server-based GPFS cluster configuration servers: ----------------------------------- Primary server: gpfs Secondary server: (none) Node Daemon node name IP address Admin node name Designation ------------------------------------------------------------------- 1 gpfs 192.168.1.10 gpfs quorum-manager cat disk %nsd: device=/dev/sdb nsd=nsda servers=gpfs usage=dataAndMetadata pool=system %nsd: device=/dev/sdc nsd=nsdb servers=gpfs usage=dataAndMetadata pool=system mmcrnsd -F disk.txt mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------- nsdsdbgpfsa C0A8000F59DB69E2 /dev/sdb generic gpfsa-ib server node nsdsdcgpfsa C0A8000F59DB69E3 /dev/sdc generic gpfsa-ib server node mmcrfs gpfs -F disk.txt -B 1M -L 32M -T /gpfs -A no -m 2 -M 3 -r 2 -R 3 mmmount gpfs df -h gpfs 200G 3,8G 197G 2% /gpfs <-- The Disk Have 200GB my question is the following ?? if I write 180 GB of data in /gpfs and the disk /dev/sdb is fail how the disk and/or GPFS continues to support all my data Thanks Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From S.J.Thompson at bham.ac.uk Mon Oct 9 13:57:08 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 9 Oct 2017 12:57:08 +0000 Subject: [gpfsug-discuss] AFM fun (more!) Message-ID: Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon From janfrode at tanso.net Mon Oct 9 14:45:32 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 9 Oct 2017 15:45:32 +0200 Subject: [gpfsug-discuss] how gpfs work when disk fail In-Reply-To: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> References: <4B32CB5C696F2849BDEF7DF9EACE884B6332574E@SDEB-EXC3.meteo.dz> Message-ID: You don't have room to write 180GB of file data, only ~100GB. When you write f.ex. 90 GB of file data, each filesystem block will get one copy written to each of your disks, occuppying 180 GB on total disk space. So you can always read if from the other disks if one should fail. This is controlled by your "-m 2 -r 2" settings, and the default failureGroup -1 since you didn't specify a failure group in your disk descriptor. Normally I would always specify a failure group when doing replication. -jf On Mon, Oct 9, 2017 at 2:47 PM, atmane khiredine wrote: > dear all > > how gpfs work when disk fail > > this is a example scenario when disk fail > > 1 Server > > 2 Disk directly attached to the local node 100GB > > mmlscluster > > GPFS cluster information > ======================== > GPFS cluster name: test.gpfs > GPFS cluster id: 174397273000001824 > GPFS UID domain: test.gpfs > Remote shell command: /usr/bin/ssh > Remote file copy command: /usr/bin/scp > Repository type: server-based > > GPFS cluster configuration servers: > ----------------------------------- > Primary server: gpfs > Secondary server: (none) > > Node Daemon node name IP address Admin node name Designation > ------------------------------------------------------------------- > 1 gpfs 192.168.1.10 gpfs quorum-manager > > cat disk > > %nsd: > device=/dev/sdb > nsd=nsda > servers=gpfs > usage=dataAndMetadata > pool=system > > %nsd: > device=/dev/sdc > nsd=nsdb > servers=gpfs > usage=dataAndMetadata > pool=system > > mmcrnsd -F disk.txt > > mmlsnsd -X > > Disk name NSD volume ID Device Devtype Node name Remarks > ------------------------------------------------------------ > --------------- > nsdsdbgpfsa C0A8000F59DB69E2 /dev/sdb generic gpfsa-ib server node > nsdsdcgpfsa C0A8000F59DB69E3 /dev/sdc generic gpfsa-ib server node > > > mmcrfs gpfs -F disk.txt -B 1M -L 32M -T /gpfs -A no -m 2 -M 3 -r 2 -R 3 > > mmmount gpfs > > df -h > > gpfs 200G 3,8G 197G 2% /gpfs <-- The Disk Have 200GB > > my question is the following ?? > > if I write 180 GB of data in /gpfs > and the disk /dev/sdb is fail > how the disk and/or GPFS continues to support all my data > > Thanks > > Atmane Khiredine > HPC System Administrator | Office National de la M?t?orologie > T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : > a.khiredine at meteo.dz > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Oct 9 15:38:15 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 9 Oct 2017 14:38:15 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Oct 9 19:55:45 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 14:55:45 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: Thanks John! Funnily enough playing with node classes is what sent me down this path. I had a bunch of nodes defined (just over 1000) with a lower pagepool than the default. I then started using nodeclasses to clean up the config and I noticed that if you define a parameter with a nodeclass it doesn't override any previously set values for nodes in the node class. What I mean by that is if you do this: - mmchconfig pagepool=256M -N n001 - add node n001 to nodeclass mynodeclass - mmchconfig pagepool=256M -N mynodeclass after the 2nd chconfig there is still a definition for pagepool=256M for node n001. I tried to clean things up by doing "mmchconfig pagepool=DEFAULT -N n001" however the default value of the pagepool in our config is 1024M not the "1G" mmchconfig expects as the defualt value so I wasn't able to remove the explicit definition of pagepool for n001. What I ended up doing was an "mmchconfig pagepool=1024M -N n001" and that removed the explicit definitions. -Aaron On 10/9/17 4:38 AM, John Hearns wrote: > Aaron, > > The reply you just got her is absolutely the correct one. > > However, its worth contributing something here. I have recently bene > dealing with the parameter verbsPorts ? which is a list of the > interfaces which verbs should use. I found on our cluyster it was set to > use dual ports for all nodes, including servers, when only our servers > have dual ports.? I will follow the advice below and make a global > change, then change back the configuration for the server. > > It is worth looking though at? mmllnodeclass ?all > > There is a rather rich set of nodeclasses, including?? clientNodes > ??managerNodes nonNsdNodes? nonQuorumNodes > > So if you want to make changes to a certain type of node in your cluster > you will be able to achieve it using nodeclasses. > > Bond, James Bond > > commander.bond at mi6.gov.uk > > *From:* gpfsug-discuss-bounces at spectrumscale.org > [mailto:gpfsug-discuss-bounces at spectrumscale.org] *On Behalf Of *IBM > Spectrum Scale > *Sent:* Saturday, October 07, 2017 5:07 AM > *To:* gpfsug main discussion list > *Cc:* gpfsug-discuss-bounces at spectrumscale.org > *Subject:* Re: [gpfsug-discuss] changing default configuration values > > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 > . > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > > To: gpfsug main discussion list > > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > ------------------------------------------------------------------------ > > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > -- The information contained in this communication and any attachments > is confidential and may be privileged, and is for the sole use of the > intended recipient(s). Any unauthorized review, use, disclosure or > distribution is prohibited. Unless explicitly stated otherwise in the > body of this communication or the attachment thereto (if any), the > information is provided on an AS-IS basis without any express or implied > warranties or liabilities. To the extent you are relying on this > information, you are doing so at your own risk. If you are not the > intended recipient, please notify the sender immediately by replying to > this message and destroy all copies of this message and any attachments. > Neither the sender nor the company/group of companies he or she > represents shall be liable for the proper and complete transmission of > the information contained in this communication, or for any delay in its > receipt. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 19:56:25 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 14:56:25 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <83b44e14-015a-8806-8036-99384e0c9634@nasa.gov> Message-ID: <01c2a2bb-f332-e067-e7b5-6954df14c25d@nasa.gov> Thanks! Good to know. On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 20:00:02 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 15:00:02 -0400 Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major release In-Reply-To: References: Message-ID: <49283f9f-12b1-6381-6890-37d16aa87635@nasa.gov> Thanks Carl. Unfortunately I won't be at SC17 this year but thankfully a number of my colleagues will be so I'll send them with a list of questions on my behalf :) On 10/6/17 4:39 PM, Carl Zetie wrote: > Hi Aaron, > > I appreciate your care with this. The user group are the first users to be briefed on this. > > We're not quite ready to put more in writing just yet, however I will be at SC17 and hope > to be able to do so at that time. (I'll also take any other questions that people want to > ask, including "where's my RFE?"...) > > I also want to add one note about the meaning of feature deprecation, because it's not well > understood even within IBM: If we deprecate a feature with the next major release it does > NOT mean we are dropping support there and then. It means we are announcing the INTENTION > to drop support in some future release, and encourage you to (a) start making plans on > migration to a supported alternative, and (b) chime in on what you need in order to be > able to satisfactorily migrate if our proposed alternative is not adequate. > > regards, > > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > > ------------------------------ > > Message: 2 > Date: Tue, 3 Oct 2017 07:57:21 -0400 > From: Aaron Knister > To: gpfsug main discussion list > Subject: [gpfsug-discuss] OS/Feature deprecations in upcoming major > release > Message-ID: <0db1d387-9501-358c-2a97-681b0b9dfd4f at nasa.gov> > Content-Type: text/plain; charset="utf-8"; format=flowed > > Hi All, > > At the SSUG in NY there was mention of operating systems as well as > feature deprecations that would occur in the lifecycle of the next major > release of GPFS. I'm not sure if this is public knowledge yet so I > haven't mentioned specifics but given the proposed release time frame of > the next major release I thought customers may appreciate having access > to this information so they could provide feedback about the potential > impact to their environment if these deprecations do occur. Any chance > someone from IBM could provide specifics here so folks can chime in? > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Mon Oct 9 21:46:59 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 16:46:59 -0400 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> Message-ID: <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Hi Sven, Just wondering if you've had any additional thoughts/conversations about this. -Aaron On 9/8/17 5:21 PM, Sven Oehme wrote: > Hi, > > the code assumption is that the underlying device has no volatile write > cache, i was absolute sure we have that somewhere in the FAQ, but i > couldn't find it, so i will talk to somebody to correct this. > if i understand > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt?correct > one could enforce this by setting REQ_FUA, but thats not explicitly set > today, at least i can't see it. i will discuss this with one of our devs > who owns this code and come back. > > sven > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > wrote: > > Thanks Sven. I didn't think GPFS itself was caching anything on that > layer, but it's my understanding that O_DIRECT isn't sufficient to force > I/O to be flushed (e.g. the device itself might have a volatile caching > layer). Take someone using ZFS zvol's as NSDs. I can write() all day log > to that zvol (even with O_DIRECT) but there is absolutely no guarantee > those writes have been committed to stable storage and aren't just > sitting in RAM until an fsync() occurs (or some other bio function that > causes a flush). I also don't believe writing to a SATA drive with > O_DIRECT will force cache flushes of the drive's writeback cache.. > although I just tested that one and it seems to actually trigger a scsi > cache sync. Interesting. > > -Aaron > > On 9/7/17 10:55 PM, Sven Oehme wrote: > > I am not sure what exactly you are looking for but all > blockdevices are > > opened with O_DIRECT , we never cache anything on this layer . > > > > > > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > > > >> wrote: > > > >? ? ?Hi Everyone, > > > >? ? ?This is something that's come up in the past and has recently > resurfaced > >? ? ?with a project I've been working on, and that is-- it seems > to me as > >? ? ?though mmfsd never attempts to flush the cache of the block > devices its > >? ? ?writing to (looking at blktrace output seems to confirm > this). Is this > >? ? ?actually the case? I've looked at the gpl headers for linux > and I don't > >? ? ?see any sign of blkdev_fsync, blkdev_issue_flush, WRITE_FLUSH, or > >? ? ?REQ_FLUSH. I'm sure there's other ways to trigger this > behavior that > >? ? ?GPFS may very well be using that I've missed. That's why I'm > asking :) > > > >? ? ?I figure with FPO being pushed as an HDFS replacement using > commodity > >? ? ?drives this feature has *got* to be in the code somewhere. > > > >? ? ?-Aaron > > > >? ? ?-- > >? ? ?Aaron Knister > >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ?Goddard Space Flight Center > > (301) 286-2776 > >? ? ?_______________________________________________ > >? ? ?gpfsug-discuss mailing list > >? ? ?gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Mon Oct 9 22:07:10 2017 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 09 Oct 2017 21:07:10 +0000 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Message-ID: Hi, yeah sorry i intended to reply back before my vacation and forgot about it the the vacation flushed it all away :-D so right now the assumption in Scale/GPFS is that the underlying storage doesn't have any form of enabled volatile write cache. the problem seems to be that even if we set REQ_FUA some stacks or devices may not have implemented that at all or correctly, so even we would set it there is no guarantee that it will do what you think it does. the benefit of adding the flag at least would allow us to blame everything on the underlying stack/device , but i am not sure that will make somebody happy if bad things happen, therefore the requirement of a non-volatile device will still be required at all times underneath Scale. so if you think we should do this, please open a PMR with the details of your test so it can go its regular support path. you can mention me in the PMR as a reference as we already looked at the places the request would have to be added. Sven On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister wrote: > Hi Sven, > > Just wondering if you've had any additional thoughts/conversations about > this. > > -Aaron > > On 9/8/17 5:21 PM, Sven Oehme wrote: > > Hi, > > > > the code assumption is that the underlying device has no volatile write > > cache, i was absolute sure we have that somewhere in the FAQ, but i > > couldn't find it, so i will talk to somebody to correct this. > > if i understand > > > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt > correct > > one could enforce this by setting REQ_FUA, but thats not explicitly set > > today, at least i can't see it. i will discuss this with one of our devs > > who owns this code and come back. > > > > sven > > > > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > > wrote: > > > > Thanks Sven. I didn't think GPFS itself was caching anything on that > > layer, but it's my understanding that O_DIRECT isn't sufficient to > force > > I/O to be flushed (e.g. the device itself might have a volatile > caching > > layer). Take someone using ZFS zvol's as NSDs. I can write() all day > log > > to that zvol (even with O_DIRECT) but there is absolutely no > guarantee > > those writes have been committed to stable storage and aren't just > > sitting in RAM until an fsync() occurs (or some other bio function > that > > causes a flush). I also don't believe writing to a SATA drive with > > O_DIRECT will force cache flushes of the drive's writeback cache.. > > although I just tested that one and it seems to actually trigger a > scsi > > cache sync. Interesting. > > > > -Aaron > > > > On 9/7/17 10:55 PM, Sven Oehme wrote: > > > I am not sure what exactly you are looking for but all > > blockdevices are > > > opened with O_DIRECT , we never cache anything on this layer . > > > > > > > > > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > > > > > > >> wrote: > > > > > > Hi Everyone, > > > > > > This is something that's come up in the past and has recently > > resurfaced > > > with a project I've been working on, and that is-- it seems > > to me as > > > though mmfsd never attempts to flush the cache of the block > > devices its > > > writing to (looking at blktrace output seems to confirm > > this). Is this > > > actually the case? I've looked at the gpl headers for linux > > and I don't > > > see any sign of blkdev_fsync, blkdev_issue_flush, > WRITE_FLUSH, or > > > REQ_FLUSH. I'm sure there's other ways to trigger this > > behavior that > > > GPFS may very well be using that I've missed. That's why I'm > > asking :) > > > > > > I figure with FPO being pushed as an HDFS replacement using > > commodity > > > drives this feature has *got* to be in the code somewhere. > > > > > > -Aaron > > > > > > -- > > > Aaron Knister > > > NASA Center for Climate Simulation (Code 606.2) > > > Goddard Space Flight Center > > > (301) 286-2776 > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Oct 10 00:19:20 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 9 Oct 2017 19:19:20 -0400 Subject: [gpfsug-discuss] mmfsd write behavior In-Reply-To: References: <0f61621f-84d9-e249-0dd7-c1a4d50fea86@nasa.gov> <26ffdadd-beae-e174-fbcf-ee9cab8a8f67@nasa.gov> Message-ID: <7090f583-d021-dd98-e55c-23eac83849ef@nasa.gov> Thanks, Sven. I think my goal was for the REQ_FUA flag to be used in alignment with the consistency expectations of the filesystem. Meaning if I was writing to a file on a filesystem (e.g. dd if=/dev/zero of=/gpfs/fs0/file1) that the write requests to the disk addresses containing data on the file wouldn't be issued with REQ_FUA. However, once the file was closed the close() wouldn't return until a disk buffer flush had occurred. For more important operations (e.g. metadata updates, log operations) I would expect/suspect REQ_FUA would be issued more frequently. The advantage here is it would allow GPFS to run ontop of block devices that don't perform well with the present synchronous workload of mmfsd (e.g. ZFS, and various other software-defined storage or hardware appliances) but that can perform well when only periodically (e.g. every few seconds) asked to flush pending data to disk. I also think this would be *really* important in an FPO environment where individual drives will probably have caches on by default and I'm not sure direct I/O is sufficient to force linux to issue scsi synchronize cache commands to those devices. I'm guessing that this is far from easy but I figured I'd ask. -Aaron On 10/9/17 5:07 PM, Sven Oehme wrote: > Hi, > > yeah sorry i intended to reply back before my vacation and forgot about > it the the vacation flushed it all away :-D > so right now the assumption in Scale/GPFS is that the underlying storage > doesn't have any form of enabled volatile write cache. the problem seems > to be that even if we set?REQ_FUA some stacks or devices may not have > implemented that at all or correctly, so even we would set it there is > no guarantee that it will do what you think it does. the benefit of > adding the flag at least would allow us to blame everything on the > underlying stack/device , but i am not sure that will make somebody > happy if bad things happen, therefore the requirement of a non-volatile > device will still be required at all times underneath Scale. > so if you think we should do this, please open a PMR with the details of > your test so it can go its regular support path. you can mention me in > the PMR as a reference as we already looked at the places the request > would have to be added.?? > > Sven > > > On Mon, Oct 9, 2017 at 1:47 PM Aaron Knister > wrote: > > Hi Sven, > > Just wondering if you've had any additional thoughts/conversations about > this. > > -Aaron > > On 9/8/17 5:21 PM, Sven Oehme wrote: > > Hi, > > > > the code assumption is that the underlying device has no volatile > write > > cache, i was absolute sure we have that somewhere in the FAQ, but i > > couldn't find it, so i will talk to somebody to correct this. > > if i understand > > > https://www.kernel.org/doc/Documentation/block/writeback_cache_control.txt?correct > > one could enforce this by setting REQ_FUA, but thats not > explicitly set > > today, at least i can't see it. i will discuss this with one of > our devs > > who owns this code and come back. > > > > sven > > > > > > On Thu, Sep 7, 2017 at 8:05 PM Aaron Knister > > > >> wrote: > > > >? ? ?Thanks Sven. I didn't think GPFS itself was caching anything > on that > >? ? ?layer, but it's my understanding that O_DIRECT isn't > sufficient to force > >? ? ?I/O to be flushed (e.g. the device itself might have a > volatile caching > >? ? ?layer). Take someone using ZFS zvol's as NSDs. I can write() > all day log > >? ? ?to that zvol (even with O_DIRECT) but there is absolutely no > guarantee > >? ? ?those writes have been committed to stable storage and aren't just > >? ? ?sitting in RAM until an fsync() occurs (or some other bio > function that > >? ? ?causes a flush). I also don't believe writing to a SATA drive with > >? ? ?O_DIRECT will force cache flushes of the drive's writeback cache.. > >? ? ?although I just tested that one and it seems to actually > trigger a scsi > >? ? ?cache sync. Interesting. > > > >? ? ?-Aaron > > > >? ? ?On 9/7/17 10:55 PM, Sven Oehme wrote: > >? ? ? > I am not sure what exactly you are looking for but all > >? ? ?blockdevices are > >? ? ? > opened with O_DIRECT , we never cache anything on this layer . > >? ? ? > > >? ? ? > > >? ? ? > On Thu, Sep 7, 2017, 7:11 PM Aaron Knister > >? ? ? > > > >? ? ? > > >? ? ? >>> wrote: > >? ? ? > > >? ? ? >? ? ?Hi Everyone, > >? ? ? > > >? ? ? >? ? ?This is something that's come up in the past and has > recently > >? ? ?resurfaced > >? ? ? >? ? ?with a project I've been working on, and that is-- it seems > >? ? ?to me as > >? ? ? >? ? ?though mmfsd never attempts to flush the cache of the block > >? ? ?devices its > >? ? ? >? ? ?writing to (looking at blktrace output seems to confirm > >? ? ?this). Is this > >? ? ? >? ? ?actually the case? I've looked at the gpl headers for linux > >? ? ?and I don't > >? ? ? >? ? ?see any sign of blkdev_fsync, blkdev_issue_flush, > WRITE_FLUSH, or > >? ? ? >? ? ?REQ_FLUSH. I'm sure there's other ways to trigger this > >? ? ?behavior that > >? ? ? >? ? ?GPFS may very well be using that I've missed. That's > why I'm > >? ? ?asking :) > >? ? ? > > >? ? ? >? ? ?I figure with FPO being pushed as an HDFS replacement using > >? ? ?commodity > >? ? ? >? ? ?drives this feature has *got* to be in the code somewhere. > >? ? ? > > >? ? ? >? ? ?-Aaron > >? ? ? > > >? ? ? >? ? ?-- > >? ? ? >? ? ?Aaron Knister > >? ? ? >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ? >? ? ?Goddard Space Flight Center > >? ? ? > (301) 286-2776 > >? ? ? >? ? ?_______________________________________________ > >? ? ? >? ? ?gpfsug-discuss mailing list > >? ? ? >? ? ?gpfsug-discuss at spectrumscale.org > > >? ? ? > >? ? ? > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? ? ? > > >? ? ? > > >? ? ? > > >? ? ? > _______________________________________________ > >? ? ? > gpfsug-discuss mailing list > >? ? ? > gpfsug-discuss at spectrumscale.org > > >? ? ? > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >? ? ? > > > > >? ? ?-- > >? ? ?Aaron Knister > >? ? ?NASA Center for Climate Simulation (Code 606.2) > >? ? ?Goddard Space Flight Center > >? ? ?(301) 286-2776 > >? ? ?_______________________________________________ > >? ? ?gpfsug-discuss mailing list > >? ? ?gpfsug-discuss at spectrumscale.org > > >? ? ?http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From vpuvvada at in.ibm.com Tue Oct 10 05:56:21 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Tue, 10 Oct 2017 10:26:21 +0530 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Tue Oct 10 08:47:23 2017 From: john.hearns at asml.com (John Hearns) Date: Tue, 10 Oct 2017 07:47:23 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: > The queues are steadily rising and we've seen them over 1000000 ... There is definitely a song here... I see you playing the blues guitar... I can't answer your question directly. As I recall you are at the latest version? We recently had to update to 4.2.3.4 due to an AFM issue - where if the home NFS share was disconnected, a read operation would finish early and not re-start. One thing I would do is look at where the 'real' NFS mount is being done (apology - I assume an NFS home). Log on to bber-afmgw01 and find where the home filesystem is being mounted, which is below /var/mmfs/afm Have a ferret around in there - do you still have that filesystem mounted? -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: Monday, October 09, 2017 2:57 PM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] AFM fun (more!) Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-appsgpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homesgpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://emea01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=01%7C01%7Cjohn.hearns%40asml.com%7Caa732d9965f64983c2e508d50f15424e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=wVJhicLSj%2FWUjedvBKo6MG%2FYrtFAaWKxMeqiUrKRHfM%3D&reserved=0 -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. From john.hearns at asml.com Tue Oct 10 09:42:05 2017 From: john.hearns at asml.com (John Hearns) Date: Tue, 10 Oct 2017 08:42:05 +0000 Subject: [gpfsug-discuss] Recommended pagepool size on clients? Message-ID: May I ask how to size pagepool on clients? Somehow I hear an enormous tin can being opened behind me... and what sounds like lots of worms... Anyway, I currently have mmhealth reporting gpfs_pagepool_small. Pagepool is set to 1024M on clients, and I now note the documentation says you get this warning when pagepool is lower or equal to 1GB We did do some IOR benchmarking which shows better performance with an increased pagepool size. I am looking for some rules of thumb for sizing for an 128Gbyte RAM client. And yup, I know the answer will be 'depends on your workload' I agree though that 1024M is too low. Illya,kuryakin at uncle.int -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From scottg at emailhosting.com Tue Oct 10 10:49:54 2017 From: scottg at emailhosting.com (Scott Goldman) Date: Tue, 10 Oct 2017 05:49:54 -0400 Subject: [gpfsug-discuss] changing default configuration values Message-ID: So, I think brings up one of the slight frustrations I've always had with mmconfig.. If I have a cluster to which new nodes will eventually be added, OR, I have standard I always wish to apply, there is no way to say "all FUTURE" nodes need to have my defaults.. I just have to remember to extended the changes in as new nodes are brought into the cluster. Is there a way to accomplish this? Thanks ? Original Message ? From: aaron.s.knister at nasa.gov Sent: October 9, 2017 2:56 PM To: gpfsug-discuss at spectrumscale.org Reply-to: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Thanks! Good to know. On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: > Hi Aaron, > > The default value applies to all nodes in the cluster. Thus changing it > will change all nodes in the cluster. You need to run mmchconfig to > customize the node override again. > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > > If your query concerns a potential software error in Spectrum Scale > (GPFS) and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is > there a way to change the default value of a configuratiAaron Knister > ---10/06/2017 06:30:20 PM---Is there a way to change the default value > of a configuration option without overriding any overrid > > From: Aaron Knister > To: gpfsug main discussion list > Date: 10/06/2017 06:30 PM > Subject: [gpfsug-discuss] changing default configuration values > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Is there a way to change the default value of a configuration option > without overriding any overrides in place? > > Take the following situation: > > - I set parameter foo=bar for all nodes (mmchconfig foo=bar) > - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N > n001,n002) > > Is there a way to then set the default value of foo to qux without > changing the value of foo for nodes n001 and n002? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4zp2l8JqWmpHajoLXubAPSIS3K7GY&e= > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Tue Oct 10 13:02:14 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:02:14 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: Use mmchconfig and change the defaults, and then have a node class for "not the defaults"? Apply settings to a node class and add all new clients to the node class? Note there was some version of Scale where node classes were enumerated when the config was set for the node class, but in (4.2.3 at least), this works as expected, I.e. The node class is not expanded when doing mmchconfig -N Simon On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So, I think brings up one of the slight frustrations I've always had with >mmconfig.. > >If I have a cluster to which new nodes will eventually be added, OR, I >have standard I always wish to apply, there is no way to say "all FUTURE" >nodes need to have my defaults.. I just have to remember to extended the >changes in as new nodes are brought into the cluster. > >Is there a way to accomplish this? >Thanks > > Original Message >From: aaron.s.knister at nasa.gov >Sent: October 9, 2017 2:56 PM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Thanks! Good to know. > >On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >> Hi Aaron, >> >> The default value applies to all nodes in the cluster. Thus changing it >> will change all nodes in the cluster. You need to run mmchconfig to >> customize the node override again. >> >> >> Regards, The Spectrum Scale (GPFS) team >> >> >>------------------------------------------------------------------------- >>----------------------------------------- >> If you feel that your question can benefit other users of Spectrum >>Scale >> (GPFS), then please post it to the public IBM developerWroks Forum at >> >>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111111 >>1-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please contact >> 1-800-237-5511 in the United States or your local IBM Service Center in >> other countries. >> >> The forum is informally monitored as time permits and should not be >>used >> for priority messages to the Spectrum Scale (GPFS) team. >> >> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >> there a way to change the default value of a configuratiAaron Knister >> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >> of a configuration option without overriding any overrid >> >> From: Aaron Knister >> To: gpfsug main discussion list >> Date: 10/06/2017 06:30 PM >> Subject: [gpfsug-discuss] changing default configuration values >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Is there a way to change the default value of a configuration option >> without overriding any overrides in place? >> >> Take the following situation: >> >> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >> n001,n002) >> >> Is there a way to then set the default value of foo to qux without >> changing the value of foo for nodes n001 and n002? >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_li >>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sb >>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4 >>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scottg at emailhosting.com Tue Oct 10 13:04:30 2017 From: scottg at emailhosting.com (Scott Goldman) Date: Tue, 10 Oct 2017 08:04:30 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: Message-ID: So when a node is added to the node class, my defaults" will be applied? If so,excellent. Thanks ? Original Message ? From: S.J.Thompson at bham.ac.uk Sent: October 10, 2017 8:02 AM To: gpfsug-discuss at spectrumscale.org Reply-to: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] changing default configuration values Use mmchconfig and change the defaults, and then have a node class for "not the defaults"? Apply settings to a node class and add all new clients to the node class? Note there was some version of Scale where node classes were enumerated when the config was set for the node class, but in (4.2.3 at least), this works as expected, I.e. The node class is not expanded when doing mmchconfig -N Simon On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So, I think brings up one of the slight frustrations I've always had with >mmconfig.. > >If I have a cluster to which new nodes will eventually be added, OR, I >have standard I always wish to apply, there is no way to say "all FUTURE" >nodes need to have my defaults.. I just have to remember to extended the >changes in as new nodes are brought into the cluster. > >Is there a way to accomplish this? >Thanks > >? Original Message >From: aaron.s.knister at nasa.gov >Sent: October 9, 2017 2:56 PM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Thanks! Good to know. > >On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >> Hi Aaron, >> >> The default value applies to all nodes in the cluster. Thus changing it >> will change all nodes in the cluster. You need to run mmchconfig to >> customize the node override again. >> >> >> Regards, The Spectrum Scale (GPFS) team >> >> >>------------------------------------------------------------------------- >>----------------------------------------- >> If you feel that your question can benefit other users of Spectrum >>Scale >> (GPFS), then please post it to the public IBM developerWroks Forum at >> >>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111111 >>1-0000-0000-0000-000000000479. >> >> >> If your query concerns a potential software error in Spectrum Scale >> (GPFS) and you have an IBM software maintenance contract please contact >> 1-800-237-5511 in the United States or your local IBM Service Center in >> other countries. >> >> The forum is informally monitored as time permits and should not be >>used >> for priority messages to the Spectrum Scale (GPFS) team. >> >> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >> there a way to change the default value of a configuratiAaron Knister >> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >> of a configuration option without overriding any overrid >> >> From: Aaron Knister >> To: gpfsug main discussion list >> Date: 10/06/2017 06:30 PM >> Subject: [gpfsug-discuss] changing default configuration values >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Is there a way to change the default value of a configuration option >> without overriding any overrides in place? >> >> Take the following situation: >> >> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >> n001,n002) >> >> Is there a way to then set the default value of foo to qux without >> changing the value of foo for nodes n001 and n002? >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> >>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_li >>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sb >>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_4 >>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Robert.Oesterlin at nuance.com Tue Oct 10 13:27:45 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 10 Oct 2017 12:27:45 +0000 Subject: [gpfsug-discuss] changing default configuration values Message-ID: <1BFF991D-4ABD-4C3A-B6FB-41CEABFCD4FB@nuance.com> Yes, this is exactly what we do for our LROC enabled nodes. Add them to the node class and you're all set. Bob Oesterlin Sr Principal Storage Engineer, Nuance ?On 10/10/17, 7:03 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: Apply settings to a node class and add all new clients to the node class? From S.J.Thompson at bham.ac.uk Tue Oct 10 13:30:57 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:30:57 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: Yes, but obviously only when you recycle mmfsd on the node after adding it to the node class, e.g. page pool cannot be changed online. We do this all the time, e.g. We have nodes with different IB fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have classes for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different DCs etc) Simon On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on behalf of scottg at emailhosting.com" wrote: >So when a node is added to the node class, my defaults" will be applied? >If so,excellent. Thanks > > > Original Message >From: S.J.Thompson at bham.ac.uk >Sent: October 10, 2017 8:02 AM >To: gpfsug-discuss at spectrumscale.org >Reply-to: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] changing default configuration values > >Use mmchconfig and change the defaults, and then have a node class for >"not the defaults"? > >Apply settings to a node class and add all new clients to the node class? > >Note there was some version of Scale where node classes were enumerated >when the config was set for the node class, but in (4.2.3 at least), this >works as expected, I.e. The node class is not expanded when doing >mmchconfig -N > >Simon > >On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf >of scottg at emailhosting.com" behalf of scottg at emailhosting.com> wrote: > >>So, I think brings up one of the slight frustrations I've always had with >>mmconfig.. >> >>If I have a cluster to which new nodes will eventually be added, OR, I >>have standard I always wish to apply, there is no way to say "all FUTURE" >>nodes need to have my defaults.. I just have to remember to extended the >>changes in as new nodes are brought into the cluster. >> >>Is there a way to accomplish this? >>Thanks >> >> Original Message >>From: aaron.s.knister at nasa.gov >>Sent: October 9, 2017 2:56 PM >>To: gpfsug-discuss at spectrumscale.org >>Reply-to: gpfsug-discuss at spectrumscale.org >>Subject: Re: [gpfsug-discuss] changing default configuration values >> >>Thanks! Good to know. >> >>On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>> Hi Aaron, >>> >>> The default value applies to all nodes in the cluster. Thus changing it >>> will change all nodes in the cluster. You need to run mmchconfig to >>> customize the node override again. >>> >>> >>> Regards, The Spectrum Scale (GPFS) team >>> >>> >>>------------------------------------------------------------------------ >>>- >>>----------------------------------------- >>> If you feel that your question can benefit other users of Spectrum >>>Scale >>> (GPFS), then please post it to the public IBM developerWroks Forum at >>> >>>https://www.ibm.com/developerworks/community/forums/html/forum?id=111111 >>>1 >>>1-0000-0000-0000-000000000479. >>> >>> >>> If your query concerns a potential software error in Spectrum Scale >>> (GPFS) and you have an IBM software maintenance contract please contact >>> 1-800-237-5511 in the United States or your local IBM Service Center in >>> other countries. >>> >>> The forum is informally monitored as time permits and should not be >>>used >>> for priority messages to the Spectrum Scale (GPFS) team. >>> >>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >>> there a way to change the default value of a configuratiAaron Knister >>> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >>> of a configuration option without overriding any overrid >>> >>> From: Aaron Knister >>> To: gpfsug main discussion list >>> Date: 10/06/2017 06:30 PM >>> Subject: [gpfsug-discuss] changing default configuration values >>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>> >>> >>>------------------------------------------------------------------------ >>> >>> >>> >>> Is there a way to change the default value of a configuration option >>> without overriding any overrides in place? >>> >>> Take the following situation: >>> >>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>> n001,n002) >>> >>> Is there a way to then set the default value of foo to qux without >>> changing the value of foo for nodes n001 and n002? >>> >>> -Aaron >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> >>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_l >>>i >>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2S >>>b >>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_ >>>4 >>>zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> >>-- >>Aaron Knister >>NASA Center for Climate Simulation (Code 606.2) >>Goddard Space Flight Center >>(301) 286-2776 >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Tue Oct 10 13:32:25 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 10 Oct 2017 08:32:25 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: Message-ID: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Simon, Does that mean node classes don't work the way individual node names do with the "-i/-I" options? -Aaron On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: > Yes, but obviously only when you recycle mmfsd on the node after adding it > to the node class, e.g. page pool cannot be changed online. > > We do this all the time, e.g. We have nodes with different IB > fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have classes > for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different > DCs etc) > > Simon > > On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of scottg at emailhosting.com" behalf of scottg at emailhosting.com> wrote: > >> So when a node is added to the node class, my defaults" will be applied? >> If so,excellent. Thanks >> >> >> Original Message >> From: S.J.Thompson at bham.ac.uk >> Sent: October 10, 2017 8:02 AM >> To: gpfsug-discuss at spectrumscale.org >> Reply-to: gpfsug-discuss at spectrumscale.org >> Subject: Re: [gpfsug-discuss] changing default configuration values >> >> Use mmchconfig and change the defaults, and then have a node class for >> "not the defaults"? >> >> Apply settings to a node class and add all new clients to the node class? >> >> Note there was some version of Scale where node classes were enumerated >> when the config was set for the node class, but in (4.2.3 at least), this >> works as expected, I.e. The node class is not expanded when doing >> mmchconfig -N >> >> Simon >> >> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So, I think brings up one of the slight frustrations I've always had with >>> mmconfig.. >>> >>> If I have a cluster to which new nodes will eventually be added, OR, I >>> have standard I always wish to apply, there is no way to say "all FUTURE" >>> nodes need to have my defaults.. I just have to remember to extended the >>> changes in as new nodes are brought into the cluster. >>> >>> Is there a way to accomplish this? >>> Thanks >>> >>> Original Message >>> From: aaron.s.knister at nasa.gov >>> Sent: October 9, 2017 2:56 PM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Thanks! Good to know. >>> >>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>> Hi Aaron, >>>> >>>> The default value applies to all nodes in the cluster. Thus changing it >>>> will change all nodes in the cluster. You need to run mmchconfig to >>>> customize the node override again. >>>> >>>> >>>> Regards, The Spectrum Scale (GPFS) team >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> - >>>> ----------------------------------------- >>>> If you feel that your question can benefit other users of Spectrum >>>> Scale >>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>> >>>> https://www.ibm.com/developerworks/community/forums/html/forum?id=111111 >>>> 1 >>>> 1-0000-0000-0000-000000000479. >>>> >>>> >>>> If your query concerns a potential software error in Spectrum Scale >>>> (GPFS) and you have an IBM software maintenance contract please contact >>>> 1-800-237-5511 in the United States or your local IBM Service Center in >>>> other countries. >>>> >>>> The forum is informally monitored as time permits and should not be >>>> used >>>> for priority messages to the Spectrum Scale (GPFS) team. >>>> >>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 PM---Is >>>> there a way to change the default value of a configuratiAaron Knister >>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default value >>>> of a configuration option without overriding any overrid >>>> >>>> From: Aaron Knister >>>> To: gpfsug main discussion list >>>> Date: 10/06/2017 06:30 PM >>>> Subject: [gpfsug-discuss] changing default configuration values >>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>> >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> >>>> >>>> Is there a way to change the default value of a configuration option >>>> without overriding any overrides in place? >>>> >>>> Take the following situation: >>>> >>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>> n001,n002) >>>> >>>> Is there a way to then set the default value of foo to qux without >>>> changing the value of foo for nodes n001 and n002? >>>> >>>> -Aaron >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_l >>>> i >>>> stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2S >>>> b >>>> on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL11_ >>>> 4 >>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Tue Oct 10 13:36:14 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 12:36:14 +0000 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Tue Oct 10 15:45:32 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 10:45:32 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: It's always helpful to check and confirm that you get what you expected. mmlsconfig shows the value in the configuration and "mmfsadm dump config" shows the value in the GPFS daemon currently running. [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1073741824 c10f1n11.gpfs.net: pagepool 1073741824 [root at c10f1n11 gitr]# mmchconfig pagepool=1500M -i -N c69bc2xn03 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root at c10f1n11 gitr]# Tue Oct 10 10:36:49 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation started [root at c10f1n11 gitr]# Tue Oct 10 10:36:52 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G pagepool 1500M [c69bc2xn03] [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1572864000 c10f1n11.gpfs.net: pagepool 1073741824 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/10/2017 08:36 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From scale at us.ibm.com Tue Oct 10 15:51:37 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 10:51:37 -0400 Subject: [gpfsug-discuss] changing default configuration values In-Reply-To: References: <8ba78966-ee2f-6ebc-67fa-bc8de9e0a583@nasa.gov> Message-ID: For a customer production system, "mmdiag --config" rather than "mmfsadm dump config" should be used. The mmdiag command is meant for end users while the "mmfsadm dump" command is a service aid that carries greater risks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: IBM Spectrum Scale/Poughkeepsie/IBM To: gpfsug main discussion list Date: 10/10/2017 10:48 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: Enci Zhong It's always helpful to check and confirm that you get what you expected. mmlsconfig shows the value in the configuration and "mmfsadm dump config" shows the value in the GPFS daemon currently running. [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1073741824 c10f1n11.gpfs.net: pagepool 1073741824 [root at c10f1n11 gitr]# mmchconfig pagepool=1500M -i -N c69bc2xn03 mmchconfig: Command successfully completed mmchconfig: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. [root at c10f1n11 gitr]# Tue Oct 10 10:36:49 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation started [root at c10f1n11 gitr]# Tue Oct 10 10:36:52 EDT 2017: mmcommon pushSdr_async: mmsdrfs propagation completed; mmdsh rc=0 [root at c10f1n11 gitr]# mmlsconfig pagepool pagepool 1G pagepool 1500M [c69bc2xn03] [root at c69bc2xn03 gitr]# mmdsh -N all mmdiag --config |grep "pagepool " c69bc2xn03.gpfs.net: pagepool 1572864000 c10f1n11.gpfs.net: pagepool 1073741824 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/10/2017 08:36 AM Subject: Re: [gpfsug-discuss] changing default configuration values Sent by: gpfsug-discuss-bounces at spectrumscale.org They do, but ... I don't know what happens to a running node if its then added to a nodeclass. Ie.. Would it apply the options it can immediately, or only once the node is recycled? Pass... Making an mmchconfig change to the node class after its a member would work as expected. Simon On 10/10/2017, 13:32, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Aaron Knister" wrote: >Simon, > >Does that mean node classes don't work the way individual node names do >with the "-i/-I" options? > >-Aaron > >On 10/10/17 8:30 AM, Simon Thompson (IT Research Support) wrote: >> Yes, but obviously only when you recycle mmfsd on the node after adding >>it >> to the node class, e.g. page pool cannot be changed online. >> >> We do this all the time, e.g. We have nodes with different IB >> fabrics/cards in clusters, so use mlx4_0/... And mlx5/... And have >>classes >> for (e.g.) "FDR" and "EDR" nodes. (different fabric numbers in different >> DCs etc) >> >> Simon >> >> On 10/10/2017, 13:04, "gpfsug-discuss-bounces at spectrumscale.org on >>behalf >> of scottg at emailhosting.com" > behalf of scottg at emailhosting.com> wrote: >> >>> So when a node is added to the node class, my defaults" will be >>>applied? >>> If so,excellent. Thanks >>> >>> >>> Original Message >>> From: S.J.Thompson at bham.ac.uk >>> Sent: October 10, 2017 8:02 AM >>> To: gpfsug-discuss at spectrumscale.org >>> Reply-to: gpfsug-discuss at spectrumscale.org >>> Subject: Re: [gpfsug-discuss] changing default configuration values >>> >>> Use mmchconfig and change the defaults, and then have a node class for >>> "not the defaults"? >>> >>> Apply settings to a node class and add all new clients to the node >>>class? >>> >>> Note there was some version of Scale where node classes were enumerated >>> when the config was set for the node class, but in (4.2.3 at least), >>>this >>> works as expected, I.e. The node class is not expanded when doing >>> mmchconfig -N >>> >>> Simon >>> >>> On 10/10/2017, 10:49, "gpfsug-discuss-bounces at spectrumscale.org on >>>behalf >>> of scottg at emailhosting.com" >>on >>> behalf of scottg at emailhosting.com> wrote: >>> >>>> So, I think brings up one of the slight frustrations I've always had >>>>with >>>> mmconfig.. >>>> >>>> If I have a cluster to which new nodes will eventually be added, OR, I >>>> have standard I always wish to apply, there is no way to say "all >>>>FUTURE" >>>> nodes need to have my defaults.. I just have to remember to extended >>>>the >>>> changes in as new nodes are brought into the cluster. >>>> >>>> Is there a way to accomplish this? >>>> Thanks >>>> >>>> Original Message >>>> From: aaron.s.knister at nasa.gov >>>> Sent: October 9, 2017 2:56 PM >>>> To: gpfsug-discuss at spectrumscale.org >>>> Reply-to: gpfsug-discuss at spectrumscale.org >>>> Subject: Re: [gpfsug-discuss] changing default configuration values >>>> >>>> Thanks! Good to know. >>>> >>>> On 10/6/17 11:06 PM, IBM Spectrum Scale wrote: >>>>> Hi Aaron, >>>>> >>>>> The default value applies to all nodes in the cluster. Thus changing >>>>>it >>>>> will change all nodes in the cluster. You need to run mmchconfig to >>>>> customize the node override again. >>>>> >>>>> >>>>> Regards, The Spectrum Scale (GPFS) team >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> - >>>>> ----------------------------------------- >>>>> If you feel that your question can benefit other users of Spectrum >>>>> Scale >>>>> (GPFS), then please post it to the public IBM developerWroks Forum at >>>>> >>>>> >>>>>https://www.ibm.com/developerworks/community/forums/html/forum?id=1111 >>>>>11 >>>>> 1 >>>>> 1-0000-0000-0000-000000000479. >>>>> >>>>> >>>>> If your query concerns a potential software error in Spectrum Scale >>>>> (GPFS) and you have an IBM software maintenance contract please >>>>>contact >>>>> 1-800-237-5511 in the United States or your local IBM Service Center >>>>>in >>>>> other countries. >>>>> >>>>> The forum is informally monitored as time permits and should not be >>>>> used >>>>> for priority messages to the Spectrum Scale (GPFS) team. >>>>> >>>>> Inactive hide details for Aaron Knister ---10/06/2017 06:30:20 >>>>>PM---Is >>>>> there a way to change the default value of a configuratiAaron Knister >>>>> ---10/06/2017 06:30:20 PM---Is there a way to change the default >>>>>value >>>>> of a configuration option without overriding any overrid >>>>> >>>>> From: Aaron Knister >>>>> To: gpfsug main discussion list >>>>> Date: 10/06/2017 06:30 PM >>>>> Subject: [gpfsug-discuss] changing default configuration values >>>>> Sent by: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> >>>>> >>>>>---------------------------------------------------------------------- >>>>>-- >>>>> >>>>> >>>>> >>>>> Is there a way to change the default value of a configuration option >>>>> without overriding any overrides in place? >>>>> >>>>> Take the following situation: >>>>> >>>>> - I set parameter foo=bar for all nodes (mmchconfig foo=bar) >>>>> - I set parameter foo to baz for a few nodes (mmchconfig foo=baz -N >>>>> n001,n002) >>>>> >>>>> Is there a way to then set the default value of foo to qux without >>>>> changing the value of foo for nodes n001 and n002? >>>>> >>>>> -Aaron >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> >>>>> >>>>>https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman >>>>>_l >>>>> i >>>>> >>>>>stinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM >>>>>2S >>>>> b >>>>> >>>>>on4Lbbi4w&m=PvuGmTBHKi7CNLU4X2GzUXkHzzezwTSrL4EdgwI0wrk&s=ma-IogZTBRL1 >>>>>1_ >>>>> 4 >>>>> zp2l8JqWmpHajoLXubAPSIS3K7GY&e= >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>>> >>>> >>>> -- >>>> Aaron Knister >>>> NASA Center for Climate Simulation (Code 606.2) >>>> Goddard Space Flight Center >>>> (301) 286-2776 >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= >> > >-- >Aaron Knister >NASA Center for Climate Simulation (Code 606.2) >Goddard Space Flight Center >(301) 286-2776 >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=e65BFH8sbo9Rz_8O15Lp81iEl8c7sLX9XVoktigGCKQ&s=vx5MjLOzZyNXgWwOW65YgbWPFnu7xjYkfRqdXdj5_JE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From scale at us.ibm.com Tue Oct 10 16:09:20 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 10 Oct 2017 11:09:20 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Leo.Earl at uea.ac.uk Tue Oct 10 16:29:47 2017 From: Leo.Earl at uea.ac.uk (Leo Earl (ITCS - Staff)) Date: Tue, 10 Oct 2017 15:29:47 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: Hi Simon, (My first ever post - queue being shot down in flames) Whilst this doesn't answer any of your questions (directly) One thing we do tend to look at when we see (what appears) to be a static "Queue Length" value, is the data which is actually inflight from an AFM perspective, so that we can ascertain whether the reason is something like, a user writing huge files to the cache, which take time to sync with home, and thus remain in the queue, providing a static "Queue Length" [root at server ~]# mmafmctl gpfs getstate | awk ' $6 >=50 ' Fileset Name Fileset Target Fileset State Gateway Node Queue State Queue Length Queue numExec afmcancergenetics :/leo/res_hpc/afm-leo Dirty csgpfs01 Active 60 126157822 [root at server ~]# So for instance, by using the tsfindinode command, to have a look at the size of the file which is currently "inflight" from an AFM perspective: [root at server ~]# mmfsadm dump afm | more Fileset: afm-leo 12 (gpfs) mode: independent-writer queue: Normal myFileset MDS: home: proto: nfs port: 2049 lastCmd: 6 handler: Mounted Dirty refCount: 5 queue: delay 300 QLen 0+9 flushThds 4 maxFlushThds 4 numExec 436 qfs 0 err 0 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 1 i/o: pReadGWs 0 (All) pReadChunkSize 134217728 pReadThresh: -2 >> Disabled << i/o: prefetchThresh 0 (Prefetch) iw: afmIwTakeoverTime 0 Priority Queue: (listed by execution order) (state: Active) Write [601414711.601379492] inflight (377743888481 @ 0) chunks 0 bytes 0 0 thread_id 7630 Write [601717612.601465868] inflight (462997479227 @ 0) chunks 0 bytes 0 1 thread_id 10200 Write [601717612.601465870] inflight (391663667550 @ 0) chunks 0 bytes 0 2 thread_id 10287 Write [601717612.601465871] inflight (377743888481 @ 0) chunks 0 bytes 0 3 thread_id 10333 Write [601717612.601573418] queued (387002794104 @ 0) chunks 0 bytes 0 4 Write [601414711.601650195] queued (342305480538 @ 0) chunks 0 bytes 0 5 ResetDirty [538455296.-1] queued etype normal normal 19061213 ResetDirty [601623334.-1] queued etype normal normal 19061213 RecoveryMarker [-1.-1] queued etype normal normal 19061213 Normal Queue: Empty (state: Active) Fileset: afmdata 20 (gpfs) #Use the file inode ID to determine the actual file which is inflight between cache and home [root at server cancergenetics]# tsfindinode -i 601379492 /gpfs/afm/cancergenetics > inode.out [root at server ~]# cat /root/inode.out 601379492 0 0xCCB6 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server ~]# ls -l /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq -rw-r--r-- 1 bbn16cku MED_pg 377743888481 Mar 25 05:48 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server I am not sure if that helps and you probably already know about it inflight checking... Kind Regards, Leo Leo Earl | Head of Research & Specialist Computing Room ITCS 01.16, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ +44 (0) 1603 593856 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 October 2017 05:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM fun (more!) Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Oct 10 17:03:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 10 Oct 2017 16:03:35 +0000 Subject: [gpfsug-discuss] AFM fun (more!) In-Reply-To: References: Message-ID: So as you might expect, we've been poking at this all day. We'd typically get to ~1000 entries in the queue having taken access to the FS away from users (yeah its that bad), but the remaining items would stay for ever as far as we could see. By copying the file, removing and then moving the copied file, we're able to get it back into a clean state. But then we ran a sample user job, and instantly the next job hung up the queue (we're talking like <100MB files here). Interestingly we looked at the queue to see what was going on (with saferdump, always use saferdump!!!) Normal Queue: (listed by execution order) (state: Active) 95 Write [6060026.6060026] inflight (18 @ 0) thread_id 44812 96 Write [13808655.13808655] queued (18 @ 0) 97 Truncate [6060026] queued 98 Truncate [13808655] queued 124 Write [6060000.6060000] inflight (18 @ 0) thread_id 44835 125 Truncate [6060000] queued 159 Write [6060013.6060013] inflight (18 @ 0) thread_id 21329 160 Truncate [6060013] queued 171 Write [5953611.5953611] inflight (18 @ 0) thread_id 44837 172 Truncate [5953611] queued Note that each inode that is inflight is followed by a queued Truncate... We are running efix2, because there is an issue with truncate not working prior to this (it doesn't get sent to home), so this smells like an AFM bug to me. We have a PMR open... Simon From: > on behalf of "Leo Earl (ITCS - Staff)" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 10 October 2017 at 16:29 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] AFM fun (more!) Hi Simon, (My first ever post ? queue being shot down in flames) Whilst this doesn?t answer any of your questions (directly) One thing we do tend to look at when we see (what appears) to be a static ?Queue Length? value, is the data which is actually inflight from an AFM perspective, so that we can ascertain whether the reason is something like, a user writing huge files to the cache, which take time to sync with home, and thus remain in the queue, providing a static ?Queue Length? [root at server ~]# mmafmctl gpfs getstate | awk ' $6 >=50 ' Fileset Name Fileset Target Fileset State Gateway Node Queue State Queue Length Queue numExec afmcancergenetics :/leo/res_hpc/afm-leo Dirty csgpfs01 Active 60 126157822 [root at server ~]# So for instance, by using the tsfindinode command, to have a look at the size of the file which is currently ?inflight? from an AFM perspective: [root at server ~]# mmfsadm dump afm | more Fileset: afm-leo 12 (gpfs) mode: independent-writer queue: Normal myFileset MDS: home: proto: nfs port: 2049 lastCmd: 6 handler: Mounted Dirty refCount: 5 queue: delay 300 QLen 0+9 flushThds 4 maxFlushThds 4 numExec 436 qfs 0 err 0 i/o: readBuf: 33554432 writeBuf: 2097152 sparseReadThresh: 134217728 pReadThreads 1 i/o: pReadGWs 0 (All) pReadChunkSize 134217728 pReadThresh: -2 >> Disabled << i/o: prefetchThresh 0 (Prefetch) iw: afmIwTakeoverTime 0 Priority Queue: (listed by execution order) (state: Active) Write [601414711.601379492] inflight (377743888481 @ 0) chunks 0 bytes 0 0 thread_id 7630 Write [601717612.601465868] inflight (462997479227 @ 0) chunks 0 bytes 0 1 thread_id 10200 Write [601717612.601465870] inflight (391663667550 @ 0) chunks 0 bytes 0 2 thread_id 10287 Write [601717612.601465871] inflight (377743888481 @ 0) chunks 0 bytes 0 3 thread_id 10333 Write [601717612.601573418] queued (387002794104 @ 0) chunks 0 bytes 0 4 Write [601414711.601650195] queued (342305480538 @ 0) chunks 0 bytes 0 5 ResetDirty [538455296.-1] queued etype normal normal 19061213 ResetDirty [601623334.-1] queued etype normal normal 19061213 RecoveryMarker [-1.-1] queued etype normal normal 19061213 Normal Queue: Empty (state: Active) Fileset: afmdata 20 (gpfs) #Use the file inode ID to determine the actual file which is inflight between cache and home [root at server cancergenetics]# tsfindinode -i 601379492 /gpfs/afm/cancergenetics > inode.out [root at server ~]# cat /root/inode.out 601379492 0 0xCCB6 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server ~]# ls -l /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq -rw-r--r-- 1 bbn16cku MED_pg 377743888481 Mar 25 05:48 /gpfs/afm/cancergenetics/Claudia/fastq/PD7446i.fastq [root at server I am not sure if that helps and you probably already know about it inflight checking? Kind Regards, Leo Leo Earl | Head of Research & Specialist Computing Room ITCS 01.16, University of East Anglia, Norwich Research Park, Norwich NR4 7TJ +44 (0) 1603 593856 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 October 2017 05:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] AFM fun (more!) Simon, >Question 1. >Can we force the gateway node for the other file-sets to our "02" node. >I.e. So that we can get the queue services for the other filesets. AFM automatically maps the fileset to gateway node, and today there is no option available for users to assign fileset to a particular gateway node. This feature will be supported in future releases. >Question 2. >How can we make AFM actually work for the "facility" file-set. If we shut >down GPFS on the node, on the secondary node, we'll see log entires like: >2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove >operations... >So I'm assuming the massive queue is all file remove operations? These are the files which were created in cache, and were deleted before they get replicated to home. AFM recovery will delete them locally. Yes, it is possible that most of these operations are local remove operations.Try finding those operations using dump command. mmfsadm saferdump afm all | grep 'Remove\|Rmdir' | grep local | wc -l >Alarmingly, we are also seeing entires like: >2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache >fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote >error 5 Traces are needed to verify IO errors. Also try disabling the parallel IO and see if replication speed improves. mmchfileset device fileset -p afmParallelWriteThreshold=disable ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/09/2017 06:27 PM Subject: [gpfsug-discuss] AFM fun (more!) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, We're having fun (ok not fun ...) with AFM. We have a file-set where the queue length isn't shortening, watching it over 5 sec periods, the queue length increases by ~600-1000 items, and the numExec goes up by about 15k. The queues are steadily rising and we've seen them over 1000000 ... This is on one particular fileset e.g.: mmafmctl rds-cache getstate Mon Oct 9 08:43:58 2017 Fileset Name Fileset Target Cache State Gateway Node Queue Length Queue numExec ------------ -------------- ------------- ------------ ------------ ------------- rds-projects-facility gpfs:///rds/projects/facility Dirty bber-afmgw01 3068953 520504 rds-projects-2015 gpfs:///rds/projects/2015 Active bber-afmgw01 0 3 rds-projects-2016 gpfs:///rds/projects/2016 Dirty bber-afmgw01 1482 70 rds-projects-2017 gpfs:///rds/projects/2017 Dirty bber-afmgw01 713 9104 bear-apps gpfs:///rds/bear-apps Dirty bber-afmgw02 3 2472770871 user-homes gpfs:///rds/homes Active bber-afmgw02 0 19 bear-sysapps gpfs:///rds/bear-sysapps Active bber-afmgw02 0 4 This is having the effect that other filesets on the same "Gateway" are not getting their queues processed. Question 1. Can we force the gateway node for the other file-sets to our "02" node. I.e. So that we can get the queue services for the other filesets. Question 2. How can we make AFM actually work for the "facility" file-set. If we shut down GPFS on the node, on the secondary node, we'll see log entires like: 2017-10-09_13:35:30.330+0100: [I] AFM: Found 1069575 local remove operations... So I'm assuming the massive queue is all file remove operations? Alarmingly, we are also seeing entires like: 2017-10-09_13:54:26.591+0100: [E] AFM: WriteSplit file system rds-cache fileset rds-projects-2017 file IDs [5389550.5389550.-1.-1,R] name remote error 5 Anyone any suggestions? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=_THXlsTtzTaQQnCD5iwucKoQnoVZmXwtZksU6YDO5O8&s=LlIrCk36ptPJs1Oix2ekZdUAMcH7ZE7GRlKzRK1_NPI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Oct 10 19:00:55 2017 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 10 Oct 2017 18:00:55 +0000 Subject: [gpfsug-discuss] Recommended pagepool size on clients? In-Reply-To: References: Message-ID: if this is a new cluster and you use reasonable new HW, i probably would start with just the following settings on the clients : pagepool=4g,workerThreads=256,maxStatCache=0,maxFilesToCache=256k depending on what storage you use and what workload you have you may have to set a couple of other settings too, but that should be a good start. we plan to make this whole process significant easier in the future, The Next Major Scale release will eliminate the need for another ~20 parameters in special cases and we will simplify the communication setup a lot too. beyond that we started working on introducing tuning suggestions based on the running system environment but there is no release targeted for that yet. Sven On Tue, Oct 10, 2017 at 1:42 AM John Hearns wrote: > May I ask how to size pagepool on clients? Somehow I hear an enormous tin > can being opened behind me? and what sounds like lots of worms? > > > > Anyway, I currently have mmhealth reporting gpfs_pagepool_small. Pagepool > is set to 1024M on clients, > > and I now note the documentation says you get this warning when pagepool > is lower or equal to 1GB > > We did do some IOR benchmarking which shows better performance with an > increased pagepool size. > > > > I am looking for some rules of thumb for sizing for an 128Gbyte RAM client. > > And yup, I know the answer will be ?depends on your workload? > > I agree though that 1024M is too low. > > > > Illya,kuryakin at uncle.int > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bdeluca at gmail.com Tue Oct 10 19:51:28 2017 From: bdeluca at gmail.com (Ben De Luca) Date: Tue, 10 Oct 2017 20:51:28 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: > Bob, > > The problem may occur when the TCP connection is broken between two nodes. > While in the vast majority of the cases when data stops flowing through the > connection, the result is one of the nodes getting expelled, there are > cases where the TCP connection simply breaks -- that is relatively rare but > happens on occasion. There is logic in the mmfsd daemon to detect the > disconnection and attempt to reconnect to the destination in question. If > the reconnect is successful then steps are taken to recover the state kept > by the daemons, and that includes resending some RPCs that were in flight > when the disconnection took place. > > As the flash describes, a problem in the logic to resend some RPCs was > causing one of the RPC headers to be omitted, resulting in the RPC data to > be interpreted as the (missing) header. Normally the result is an assert on > the receiving end, like the "logAssertFailed: !"Request and queue size > mismatch" assert described in the flash. However, it's at least > conceivable (though expected to very rare) that the content of the RPC data > could be interpreted as a valid RPC header. In the case of an RPC which > involves data transfer between an NSD client and NSD server, that might > result in incorrect data being written to some NSD device. > > Disconnect/reconnect scenarios appear to be uncommon. An entry like > > [N] Reconnected to xxx.xxx.xxx.xxx nodename > > in mmfs.log would be an indication that a reconnect has occurred. By > itself, the reconnect will not imply that data or the file system was > corrupted, since that will depend on what RPCs were pending when the > connection happened. In the case the assert above is hit, no corruption is > expected, since the daemon will go down before incorrect data gets written. > > Reconnects involving an NSD server are those which present the highest > risk, given that NSD-related RPCs are used to write data into NSDs > > Even on clusters that have not been subjected to disconnects/reconnects > before, such events might still happen in the future in case of network > glitches. It's then recommended that an efix for the problem be applied in > a timely fashion. > > > Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 > > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------ > ------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/ > forums/html/forum?id=11111111-0000-0000-0000-000000000479. > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM > Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale > (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file > system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Can anyone from the Scale team comment? > > Anytime I see ?may result in file system corruption or undetected file > data corruption? it gets my attention. > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > *Storage * > IBM My Notifications > Check out the *IBM Electronic Support* > > > > IBM Spectrum Scale > *: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect > function may result in file system corruption or undetected file data > corruption* > > IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 > levels, in which resending an NSD RPC after a network reconnect function > may result in file system corruption or undetected file data corruption. > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r= > IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=- > AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Tue Oct 10 23:15:11 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 00:15:11 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bdeluca at gmail.com Wed Oct 11 05:40:21 2017 From: bdeluca at gmail.com (Ben De Luca) Date: Wed, 11 Oct 2017 06:40:21 +0200 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: > Hi, I understood the failure to occur requires that the RPC payload of > the RPC resent without actual header can be mistaken for a valid RPC > header. The resend mechanism is probably not considering what the actual > content/target the RPC has. > So, in principle, the RPC could be to update a data block, or a metadata > block - so it may hit just a single data file or corrupt your entire file > system. > However, I think the likelihood that the RPC content can go as valid RPC > header is very low. > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org > Date: 10/10/2017 08:52 PM > Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum > Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in > file system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > does this corrupt the entire filesystem or just the open files that are > being written too? > > One is horrific and the other is just mildly bad. > > On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: > Bob, > > The problem may occur when the TCP connection is broken between two nodes. > While in the vast majority of the cases when data stops flowing through > the connection, the result is one of the nodes getting expelled, there are > cases where the TCP connection simply breaks -- that is relatively rare > but happens on occasion. There is logic in the mmfsd daemon to detect the > disconnection and attempt to reconnect to the destination in question. If > the reconnect is successful then steps are taken to recover the state kept > by the daemons, and that includes resending some RPCs that were in flight > when the disconnection took place. > > As the flash describes, a problem in the logic to resend some RPCs was > causing one of the RPC headers to be omitted, resulting in the RPC data to > be interpreted as the (missing) header. Normally the result is an assert > on the receiving end, like the "logAssertFailed: !"Request and queue size > mismatch" assert described in the flash. However, it's at least > conceivable (though expected to very rare) that the content of the RPC > data could be interpreted as a valid RPC header. In the case of an RPC > which involves data transfer between an NSD client and NSD server, that > might result in incorrect data being written to some NSD device. > > Disconnect/reconnect scenarios appear to be uncommon. An entry like > > [N] Reconnected to xxx.xxx.xxx.xxx nodename > > in mmfs.log would be an indication that a reconnect has occurred. By > itself, the reconnect will not imply that data or the file system was > corrupted, since that will depend on what RPCs were pending when the > connection happened. In the case the assert above is hit, no corruption is > expected, since the daemon will go down before incorrect data gets > written. > > Reconnects involving an NSD server are those which present the highest > risk, given that NSD-related RPCs are used to write data into NSDs > > Even on clusters that have not been subjected to disconnects/reconnects > before, such events might still happen in the future in case of network > glitches. It's then recommended that an efix for the problem be applied in > a timely fashion. > > > Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 > > > > Regards, The Spectrum Scale (GPFS) team > > ------------------------------------------------------------ > ------------------------------------------------------ > If you feel that your question can benefit other users of Spectrum Scale > (GPFS), then please post it to the public IBM developerWroks Forum at > https://www.ibm.com/developerworks/community/ > forums/html/forum?id=11111111-0000-0000-0000-000000000479 > . > > If your query concerns a potential software error in Spectrum Scale (GPFS) > and you have an IBM software maintenance contract please contact > 1-800-237-5511 in the United States or your local IBM Service Center in > other countries. > > The forum is informally monitored as time permits and should not be used > for priority messages to the Spectrum Scale (GPFS) team. > > > > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM > Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale > (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file > system corruption or undetected file data corruption (2017.10.09) > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > Can anyone from the Scale team comment? > > Anytime I see ?may result in file system corruption or undetected file > data corruption? it gets my attention. > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > > > > > > > > Storage > IBM My Notifications > Check out the IBM Electronic Support > > > > > > > > IBM Spectrum Scale > > > > : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect > function may result in file system corruption or undetected file data > corruption > > > > IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 > levels, in which resending an NSD RPC after a network reconnect function > may result in file system corruption or undetected file data corruption. > > > > > > > > > > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r= > IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=- > AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Wed Oct 11 07:08:33 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 11 Oct 2017 06:08:33 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From arc at b4restore.com Wed Oct 11 08:46:03 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 07:46:03 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Oct 11 09:01:47 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 11 Oct 2017 09:01:47 +0100 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From arc at b4restore.com Wed Oct 11 09:18:01 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 08:18:01 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> Message-ID: <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> Hi Jonathan, Yes I thought about that but the system is located at a customer site and they are not willing to do that, unfortunately. That's why I was hoping there was a way around it Andi R. Christiansen -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:02 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 09:32:37 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 08:32:37 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Oct 11 09:46:46 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 08:46:46 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors Message-ID: I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon From jonathan.buzzard at strath.ac.uk Wed Oct 11 09:53:34 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 11 Oct 2017 09:53:34 +0100 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> Message-ID: <1507712014.9906.5.camel@strath.ac.uk> On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From arc at b4restore.com Wed Oct 11 09:59:20 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 08:59:20 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: <1507712014.9906.5.camel@strath.ac.uk> References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> <8b9180bf-0bef-4e42-020b-28a9610012a1@strath.ac.uk> <9fcbdf3fa2df4df5bd25f4e93d2a3e79@B4RWEX01.internal.b4restore.com> <1507712014.9906.5.camel@strath.ac.uk> Message-ID: Yes i think my last resort might be to go to customer with a separate switch and do the reconfiguration. Thanks ? -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:54 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From arc at b4restore.com Wed Oct 11 10:02:08 2017 From: arc at b4restore.com (Andi Rhod Christiansen) Date: Wed, 11 Oct 2017 09:02:08 +0000 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. In-Reply-To: References: <3e6e1727224143ac9b8488d16f40fcb3@B4RWEX01.internal.b4restore.com> Message-ID: <674e2c9b6c3f450b8f85b2d36a504597@B4RWEX01.internal.b4restore.com> Hi Simon, I will do that before I go to the customer with a separate switch as a last resort :) Thanks Venlig hilsen / Best Regards Andi Rhod Christiansen From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: 11. oktober 2017 10:33 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 11 11:19:13 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 12:19:13 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Hm , mmfsck will return not very reliable results in online mode, especially it will report many issues which are just due to the transient states in a files system in operation. It should however not find less issues than in off-line mode. mmrestripefs -c does not do any logical checks, it just checks for differences of multiple replicas of the same data/metadata. File system errors can be caused by such discrepancies (if an odd/corrupt replica is used by the GPFS), but can also be caused (probably more likely) by logical errors / bugs when metadata were modified in the file system. In those cases, all the replicas are identical nevertheless corrupt (cannot be found by mmrestripefs) So, mmrestripefs -c is like scrubbing for silent data corruption (on its own, it cannot decide which is the correct replica!), while mmfsck checks the filesystem structure for logical consistency. If the contents of the replicas of a data block differ, mmfsck won't see any problem (as long as the fs metadata are consistent), but mmrestripefs -c will. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Date: 10/11/2017 10:47 AM Subject: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 11:31:53 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 10:31:53 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Wed Oct 11 11:58:52 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 12:58:52 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: If you do both, you are on the safe side. I am not sure wether mmfsck reads both replica of the metadata (if it it does, than one could spare the mmrestripefs -c WRT metadata, but I don't think so), if not, one could still have luckily checked using valid metadata where maybe one (or more) MD block has (have) an invalid replica which might come up another time ... But the mmfsrestripefs -c is not only ensuring the sanity of the FS but also of the data stored within (which is not necessarily the same). Mostly, however, filesystem checks are only done if fs issues are indicated by errors in the logs. Do you have reason to assume your fs has probs? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 12:32 PM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 12:22:26 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 11:22:26 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Yes I get we should only be doing this if we think we have a problem. And the answer is, right now, we're not entirely clear. We have a couple of issues our users are reporting to us, and its not clear to us if they are related, an FS problem or ACLs getting in the way. We do have users who are trying to work on files getting IO error, and we have an AFM sync issue. The disks are all online, I poked the FS with tsdbfs and the files look OK - (small files, but content of the block matches). Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error reported to user when they access a file even if its not an offline file??) We have a PMR open with IBM on this already. But there's a wanting to be sure in our own minds that we don't have an underlying FS problem. I.e. I have confidence that I can tell my users, yes I know you are seeing weird stuff, but we have run checks and are not introducing data corruption. Simon On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Mostly, however, filesystem checks are only done if fs issues are >indicated by errors in the logs. Do you have reason to assume your fs has >probs? From stockf at us.ibm.com Wed Oct 11 12:55:18 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Wed, 11 Oct 2017 07:55:18 -0400 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Generally you should not run mmfsck unless you see MMFS_FSSTRUCT errors in your system logs. To my knowledge online mmfsck only checks for a subset of problems, notably lost blocks, but that situation does not indicate any problems with the file system. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 06:32 AM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org OK thanks, So if I run mmfsck in online mode and it says: "File system is clean. Exit status 0:10:0." Then I can assume there is no benefit to running in offline mode? But it would also be prudent to run "mmrestripefs -c" to be sure my filesystem is happy? Thanks Simon On 11/10/2017, 11:19, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Hm , mmfsck will return not very reliable results in online mode, >especially it will report many issues which are just due to the transient >states in a files system in operation. >It should however not find less issues than in off-line mode. > >mmrestripefs -c does not do any logical checks, it just checks for >differences of multiple replicas of the same data/metadata. >File system errors can be caused by such discrepancies (if an odd/corrupt >replica is used by the GPFS), but can also be caused (probably more >likely) by logical errors / bugs when metadata were modified in the file >system. In those cases, all the replicas are identical nevertheless >corrupt (cannot be found by mmrestripefs) > >So, mmrestripefs -c is like scrubbing for silent data corruption (on its >own, it cannot decide which is the correct replica!), while mmfsck checks >the filesystem structure for logical consistency. >If the contents of the replicas of a data block differ, mmfsck won't see >any problem (as long as the fs metadata are consistent), but mmrestripefs >-c will. > > >Mit freundlichen Gr????en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch??ftsf??hrung: >Thomas Wolter, Sven Schoo?? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: "gpfsug-discuss at spectrumscale.org" > >Date: 10/11/2017 10:47 AM >Subject: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >I'm just wondering if anyone could share any views on checking a >file-system for errors. > >For example, we could use mmfsck in online and offline mode. Does online >mode detect errors (but not fix) things that would be found in offline >mode? > >And then were does mmrestripefs -c fit into this? > >"-c > Scans the file system and compares replicas of > metadata and data for conflicts. When conflicts > are found, the -c option attempts to fix > the replicas. >" > >Which sorta sounds like fix things in the file-system, so how does that >intersect (if at all) with mmfsck? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIFAw&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=V8K9eELGXftg3ELG2jV1OYptzOZ-j9OdBkpgvJXV_IM&s=MdJhKZ9vW4uhTesz1LqKiEWo6gZAEXjtgw0RXnlJSgY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Wed Oct 11 13:30:49 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 11 Oct 2017 08:30:49 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 11 15:01:54 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 11 Oct 2017 16:01:54 +0200 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: Usually, IO errors point to some basic problem reading/writing data . if there are repoducible errors, it's IMHO always a nice thing to trace GPFS for such an access. Often that reveals already the area where the cause lies and maybe even the details of it. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 10/11/2017 01:22 PM Subject: Re: [gpfsug-discuss] Checking a file-system for errors Sent by: gpfsug-discuss-bounces at spectrumscale.org Yes I get we should only be doing this if we think we have a problem. And the answer is, right now, we're not entirely clear. We have a couple of issues our users are reporting to us, and its not clear to us if they are related, an FS problem or ACLs getting in the way. We do have users who are trying to work on files getting IO error, and we have an AFM sync issue. The disks are all online, I poked the FS with tsdbfs and the files look OK - (small files, but content of the block matches). Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error reported to user when they access a file even if its not an offline file??) We have a PMR open with IBM on this already. But there's a wanting to be sure in our own minds that we don't have an underlying FS problem. I.e. I have confidence that I can tell my users, yes I know you are seeing weird stuff, but we have run checks and are not introducing data corruption. Simon On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Mostly, however, filesystem checks are only done if fs issues are >indicated by errors in the logs. Do you have reason to assume your fs has >probs? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Oct 11 15:13:03 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 11 Oct 2017 14:13:03 +0000 Subject: [gpfsug-discuss] Checking a file-system for errors In-Reply-To: References: Message-ID: So with the help of IBM support and Venkat (thanks guys!), we think its a problem with DMAPI. As we initially saw this as an issue with AFM replication, we had traces from there, and had entries like: gpfsWrite exit: failed err 688 Now apparently err 688 relates to "DMAPI disposition", once we had this we were able to get someone to take a look at the HSM dsmrecalld, it was running, but had failed over to a node that wasn't able to service requests properly. (multiple NSD servers with different file-systems each running dsmrecalld, but I don't think you can scope nods XYZ to filesystem ABC but not DEF). Anyway once we got that fixed, a bunch of stuff in the AFM cache popped out (and a little poke for some stuff that hadn't updated metadata cache probably). So hopefully its now also solved for our other users. What is complicated here is that a DMAPI issue was giving intermittent IO errors, people could write into new folders, but not existing files, though I could (some sort of Schr?dinger's cat IO issue??). So hopefully we are fixed... Simon On 11/10/2017, 15:01, "gpfsug-discuss-bounces at spectrumscale.org on behalf of UWEFALKE at de.ibm.com" wrote: >Usually, IO errors point to some basic problem reading/writing data . >if there are repoducible errors, it's IMHO always a nice thing to trace >GPFS for such an access. Often that reveals already the area where the >cause lies and maybe even the details of it. > > > > >Mit freundlichen Gr??en / Kind regards > > >Dr. Uwe Falke > >IT Specialist >High Performance Computing Services / Integrated Technology Services / >Data Center Services >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland >Rathausstr. 7 >09111 Chemnitz >Phone: +49 371 6978 2165 >Mobile: +49 175 575 2877 >E-Mail: uwefalke at de.ibm.com >-------------------------------------------------------------------------- >----------------------------------------------------------------- >IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: >Thomas Wolter, Sven Schoo? >Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, >HRB 17122 > > > > >From: "Simon Thompson (IT Research Support)" >To: gpfsug main discussion list >Date: 10/11/2017 01:22 PM >Subject: Re: [gpfsug-discuss] Checking a file-system for errors >Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > >Yes I get we should only be doing this if we think we have a problem. > >And the answer is, right now, we're not entirely clear. > >We have a couple of issues our users are reporting to us, and its not >clear to us if they are related, an FS problem or ACLs getting in the way. > >We do have users who are trying to work on files getting IO error, and we >have an AFM sync issue. The disks are all online, I poked the FS with >tsdbfs and the files look OK - (small files, but content of the block >matches). > >Maybe we have a problem with DMAPI and TSM/HSM (could that cause IO error >reported to user when they access a file even if its not an offline >file??) > >We have a PMR open with IBM on this already. > >But there's a wanting to be sure in our own minds that we don't have an >underlying FS problem. I.e. I have confidence that I can tell my users, >yes I know you are seeing weird stuff, but we have run checks and are not >introducing data corruption. > >Simon > >On 11/10/2017, 11:58, "gpfsug-discuss-bounces at spectrumscale.org on behalf >of UWEFALKE at de.ibm.com" behalf of UWEFALKE at de.ibm.com> wrote: > >>Mostly, however, filesystem checks are only done if fs issues are >>indicated by errors in the logs. Do you have reason to assume your fs has >>probs? > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From truongv at us.ibm.com Wed Oct 11 17:14:21 2017 From: truongv at us.ibm.com (Truong Vu) Date: Wed, 11 Oct 2017 12:14:21 -0400 Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to the network In-Reply-To: References: Message-ID: What you can do is create network alias to the old IP. Run mmchnode to change hostname/IP for non-quorum nodes first. Make one (or more) of the nodes you just change a quorum node. Change all of the quorum nodes that still on old IPs to non-quorum. Then change IPs on them. Thanks, Tru. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 10/11/2017 04:53 AM Subject: gpfsug-discuss Digest, Vol 69, Issue 26 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Changing ip on spectrum scale cluster with every node down and not connected to network. (Andi Rhod Christiansen) 2. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Jonathan Buzzard) 3. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Andi Rhod Christiansen) 4. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Simon Thompson (IT Research Support)) 5. Checking a file-system for errors (Simon Thompson (IT Research Support)) 6. Re: Changing ip on spectrum scale cluster with every node down and not connected to network. (Jonathan Buzzard) ---------------------------------------------------------------------- Message: 1 Date: Wed, 11 Oct 2017 07:46:03 +0000 From: Andi Rhod Christiansen To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <3e6e1727224143ac9b8488d16f40fcb3 at B4RWEX01.internal.b4restore.com> Content-Type: text/plain; charset="us-ascii" Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: < https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20171011_820adb01_attachment-2D0001.html&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=NrezaW_ayd5u-bE6ppJ6p3FBluuDTtv6KHqb4TwaGsY&e= > ------------------------------ Message: 2 Date: Wed, 11 Oct 2017 09:01:47 +0100 From: Jonathan Buzzard To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <8b9180bf-0bef-4e42-020b-28a9610012a1 at strath.ac.uk> Content-Type: text/plain; charset=windows-1252; format=flowed On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ------------------------------ Message: 3 Date: Wed, 11 Oct 2017 08:18:01 +0000 From: Andi Rhod Christiansen To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <9fcbdf3fa2df4df5bd25f4e93d2a3e79 at B4RWEX01.internal.b4restore.com> Content-Type: text/plain; charset="us-ascii" Hi Jonathan, Yes I thought about that but the system is located at a customer site and they are not willing to do that, unfortunately. That's why I was hoping there was a way around it Andi R. Christiansen -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathan Buzzard Sent: 11. oktober 2017 10:02 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. On 11/10/17 08:46, Andi Rhod Christiansen wrote: [SNIP] > It is not possible to turn on the switch ports as the cluster has the > same ips right now as another cluster on the new switch. > Er, yes it is. Spin up a new temporary VLAN, drop all the ports for the cluster in the new temporary VLAN and then bring them up. Basically any switch on which you can remotely down the ports is going to support VLAN's. Even the crappy 16 port GbE switch I have at home supports them. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= ------------------------------ Message: 4 Date: Wed, 11 Oct 2017 08:32:37 +0000 From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: Content-Type: text/plain; charset="us-ascii" I think you really want a PMR for this. There are some files you could potentially edit and copy around, but given its cluster configuration, I wouldn't be doing this on a cluster I cared about with explicit instruction from IBM support. So I suggest log a ticket with IBM. Simon From: > on behalf of "arc at b4restore.com" > Reply-To: "gpfsug-discuss at spectrumscale.org< mailto:gpfsug-discuss at spectrumscale.org>" > Date: Wednesday, 11 October 2017 at 08:46 To: "gpfsug-discuss at spectrumscale.org< mailto:gpfsug-discuss at spectrumscale.org>" > Subject: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Hi, Does anyone know how to change the ips on all the nodes within a cluster when gpfs and interfaces are down? Right now the cluster has been shutdown and all ports disconnected(ports has been shut down on new switch) The problem is that when I try to execute any mmchnode command(as the ibm documentation states) the command fails, and that makes sense as the ip on the interface has been changed without the deamon knowing.. But is there a way to do it manually within the configuration files so that the gpfs daemon updates the ips of all nodes within the cluster or does anyone know of a hack around to do it without having network access. It is not possible to turn on the switch ports as the cluster has the same ips right now as another cluster on the new switch. Hope you understand, relatively new to gpfs/spectrum scale Venlig hilsen / Best Regards Andi R. Christiansen -------------- next part -------------- An HTML attachment was scrubbed... URL: < https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_pipermail_gpfsug-2Ddiscuss_attachments_20171011_cd962e6b_attachment-2D0001.html&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=Iy6NQR-GJD1Hkc0A0C96Jkesrs6h-6HpOnnw3MOQmi4&e= > ------------------------------ Message: 5 Date: Wed, 11 Oct 2017 08:46:46 +0000 From: "Simon Thompson (IT Research Support)" To: "gpfsug-discuss at spectrumscale.org" Subject: [gpfsug-discuss] Checking a file-system for errors Message-ID: Content-Type: text/plain; charset="us-ascii" I'm just wondering if anyone could share any views on checking a file-system for errors. For example, we could use mmfsck in online and offline mode. Does online mode detect errors (but not fix) things that would be found in offline mode? And then were does mmrestripefs -c fit into this? "-c Scans the file system and compares replicas of metadata and data for conflicts. When conflicts are found, the -c option attempts to fix the replicas. " Which sorta sounds like fix things in the file-system, so how does that intersect (if at all) with mmfsck? Thanks Simon ------------------------------ Message: 6 Date: Wed, 11 Oct 2017 09:53:34 +0100 From: Jonathan Buzzard To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Changing ip on spectrum scale cluster with every node down and not connected to network. Message-ID: <1507712014.9906.5.camel at strath.ac.uk> Content-Type: text/plain; charset="UTF-8" On Wed, 2017-10-11 at 08:18 +0000, Andi Rhod Christiansen wrote: > Hi Jonathan, > > Yes I thought about that but the system is located at a customer site > and they are not willing to do that, unfortunately. > > That's why I was hoping there was a way around it > I would go back to them saying it's either a temporary VLAN, we have to down the other cluster to make the change, or re-cable it to a new unconnected switch. If the customer continues to be completely unreasonable then it's their lookout. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HQmkdQWQHoc1Nu6Mg_g8NVugim3OiUUy5n0QgLQcbkM&m=3xds8LVU2TdfiaqkM91LA06caiYHJleBqSwOZ6ff81M&s=21OH1KjxVbfDBz9Kdr0USitreLsyXEbP9rHC7Vxmhw0&e= End of gpfsug-discuss Digest, Vol 69, Issue 26 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Thu Oct 12 18:41:49 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 12 Oct 2017 17:41:49 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> SC17 is only 1 month away! Here are the GPFS/Scale User group meeting details. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Look for signs when you arrive Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From sandeep.patil at in.ibm.com Fri Oct 13 09:20:56 2017 From: sandeep.patil at in.ibm.com (Sandeep Ramesh) Date: Fri, 13 Oct 2017 13:50:56 +0530 Subject: [gpfsug-discuss] New Redpapers on Spectrum Scale/ESS GUI Published Message-ID: Dear Spectrum Scale User Group Members, New Redpapers on Spectrum Scale GUI and ESS GUI has been published yesterday. To help keep the community informed. Monitoring and Managing IBM Spectrum Scale Using the GUI http://www.redbooks.ibm.com/redpieces/abstracts/redp5458.html?Open Monitoring and Managing the IBM Elastic Storage Server Using the GUI http://www.redbooks.ibm.com/redpieces/abstracts/redp5471.html?Open thx Spectrum Scale Dev -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Fri Oct 13 10:47:39 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 13 Oct 2017 09:47:39 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> References: <4766234D-55ED-49C8-B443-BDD94EE1785A@nuance.com> Message-ID: I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: 12 October 2017 18:42 To: gpfsug main discussion list Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details SC17 is only 1 month away! Here are the GPFS/Scale User group meeting details. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Look for signs when you arrive Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 13 13:12:59 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 12:12:59 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: >I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? I believe (hope?) that's just a placeholder, and we'll actually use the time for something more engaging... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 09:47:39 +0000 From: "Sobey, Richard A" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="utf-8" I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? From r.sobey at imperial.ac.uk Fri Oct 13 13:45:43 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Fri, 13 Oct 2017 12:45:43 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: Actually, I was being 100% serious :) Although it's a boring topic, it's nonetheless fairly crucial and I'd like to see more about it. I won't be at SC17 unless you're livestreaming it anyway. Richard -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie Sent: 13 October 2017 13:13 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details >I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? I believe (hope?) that's just a placeholder, and we'll actually use the time for something more engaging... Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 09:47:39 +0000 From: "Sobey, Richard A" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="utf-8" I *need* to see the presentation from the licensing session ? everyone?s favourite topic ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From john.hearns at asml.com Fri Oct 13 13:56:18 2017 From: john.hearns at asml.com (John Hearns) Date: Fri, 13 Oct 2017 12:56:18 +0000 Subject: [gpfsug-discuss] How to simulate an NSD failure? Message-ID: I have set up a small testbed, consisting of three nodes. Two of the nodes have a disk which is being used as an NSD. This is being done for some preparation for fun and games with some whizzy new servers. The testbed has spinning drives. I have created two NSDs and have set the data replication to 1 (this is deliberate). I am trying to fail an NSD and find which files have parts on the failed NSD. A first test with 'mmdeldisk' didn't have much effect as SpectrumScale is smart enough to copy the data off the drive. I now take the drive offline and delete it by echo offline > /sys/block/sda/device/state echo 1 > /sys/block/sda/delete Short of going to the data centre and physically pulling the drive that's a pretty final way of stopping access to a drive. I then wrote 100 files to the filesystem, the node with the NSD did log "rejecting I/O to offline device" However mmlsdisk says that this disk is status 'ready' I am going to stop that NSD and run an mmdeldisk - at which point I do expect things to go south rapidly. I just am not understanding at what point a failed write would be detected? Or once a write fails are all the subsequent writes Routed off to the active NSD(s) ?? Sorry if I am asking an idiot question. Inspector.clouseau at surete.fr -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Oct 13 14:38:26 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Fri, 13 Oct 2017 13:38:26 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: Message-ID: The slides from the Manchester meeting are at: http://files.gpfsug.org/presentations/2017/Manchester/09_licensing-update.p df We moved all of our socket licenses to per TB DME earlier this year, and then also have DME per drive for our Lenovo DSS-G system, which for various reasons is in a different cluster There are certainly people in IBM UK who understand this process if that was something you wanted to look at. Simon On 13/10/2017, 13:45, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Sobey, Richard A" wrote: >Actually, I was being 100% serious :) Although it's a boring topic, it's >nonetheless fairly crucial and I'd like to see more about it. I won't be >at SC17 unless you're livestreaming it anyway. > >Richard > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie >Sent: 13 October 2017 13:13 >To: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration >and program details > >>I *need* to see the presentation from the licensing session ? everyone?s >>favourite topic ? > > >I believe (hope?) that's just a placeholder, and we'll actually use the >time for something more engaging... > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > >Message: 3 >Date: Fri, 13 Oct 2017 09:47:39 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > utlook.com> > >Content-Type: text/plain; charset="utf-8" > >I *need* to see the presentation from the licensing session ? everyone?s >favourite topic ? > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From heiner.billich at psi.ch Fri Oct 13 15:15:53 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Fri, 13 Oct 2017 14:15:53 +0000 Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home Message-ID: <94041E4C-3978-4D39-86EA-79629FC17AB8@psi.ch> Hello, Running an AFM IW cache we noticed that AFM starts the flushing of data from cache to home rather slow, say at 20MB/s, and only slowly increases to several 100MB/s after a few minutes. As soon as the pending queue gets no longer filled the data rate drops, again. I assume that this is a good behavior for WAN traffic where you don?t want to use the full bandwidth from the beginning but only if really needed. For our local setup with dedicated links I would prefer a much more aggressive behavior to get data transferred asap to home. Am I right, does AFM implement such a ?slow startup?, and is there a way to change this behavior? We did increase afmNumFlushThreads to 128. Currently we measure with many small files (1MB). For large files the behavior is different, we get a stable data rate from the beginning, but I did not yet try with a continuous write on the cache to see whether I see an increase after a while, too. Thank you, Heiner Billich -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From carlz at us.ibm.com Fri Oct 13 15:46:47 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 14:46:47 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 -Registrationandprogram details In-Reply-To: References: Message-ID: Hi Richard, I'm always happy to have a separate conversation if you have any questions about licensing. Ping me on my email address below. Same goes for anybody else who won't be at SC17. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com >------------------------------ > >Message: 2 >Date: Fri, 13 Oct 2017 12:45:43 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > rod.outlook.com> > >Content-Type: text/plain; charset="us-ascii" > >Actually, I was being 100% serious :) Although it's a boring topic, >it's nonetheless fairly crucial and I'd like to see more about it. I >won't be at SC17 unless you're livestreaming it anyway. > >Richard > >won't be >>at SC17 unless you're livestreaming it anyway. >> >>Richard >> From sfadden at us.ibm.com Fri Oct 13 16:56:56 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Fri, 13 Oct 2017 15:56:56 +0000 Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Fri Oct 13 17:32:35 2017 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Fri, 13 Oct 2017 16:32:35 +0000 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From alex at calicolabs.com Fri Oct 13 17:53:40 2017 From: alex at calicolabs.com (Alex Chekholko) Date: Fri, 13 Oct 2017 09:53:40 -0700 Subject: [gpfsug-discuss] How to simulate an NSD failure? In-Reply-To: References: Message-ID: John, I think a "philosophical" difference between GPFS code and newer filesystems which were written later, in the age of "commodity hardware", is that GPFS expects the underlying hardware to be very reliable. So "disks" are typically RAID arrays available via multiple paths. And network links should have no errors, and be highly reliable, etc. GPFS does not detect these things well as it does not expect them to fail. That's why you see some discussions around "improving network diagnostics" and "improving troubleshooting tools" and things like that. Having a failed NSD is highly unusual for a GPFS system and you should design your system so that situation does not happen. In your example here, if data is striped across two NSDs and one of them becomes inaccessible, when a client tries to write, it should get an I/O error, and perhaps even unmount the filesystem (depending on where you metadata lives). Regards, Alex On Fri, Oct 13, 2017 at 5:56 AM, John Hearns wrote: > I have set up a small testbed, consisting of three nodes. Two of the nodes > have a disk which is being used as an NSD. > > This is being done for some preparation for fun and games with some whizzy > new servers. The testbed has spinning drives. > > I have created two NSDs and have set the data replication to 1 (this is > deliberate). > > I am trying to fail an NSD and find which files have parts on the failed > NSD. > > A first test with ?mmdeldisk? didn?t have much effect as SpectrumScale is > smart enough to copy the data off the drive. > > > > I now take the drive offline and delete it by > > echo offline > /sys/block/sda/device/state > > echo 1 > /sys/block/sda/delete > > > > Short of going to the data centre and physically pulling the drive that?s > a pretty final way of stopping access to a drive. > > I then wrote 100 files to the filesystem, the node with the NSD did log > ?rejecting I/O to offline device? > > However mmlsdisk says that this disk is status ?ready? > > > > I am going to stop that NSD and run an mmdeldisk ? at which point I do > expect things to go south rapidly. > > I just am not understanding at what point a failed write would be > detected? Or once a write fails are all the subsequent writes > > Routed off to the active NSD(s) ?? > > > > Sorry if I am asking an idiot question. > > > > Inspector.clouseau at surete.fr > > > > > > > > > > > > > > > > > > > > > > > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mhabib73 at gmail.com Fri Oct 13 18:48:57 2017 From: mhabib73 at gmail.com (Muhammad Habib) Date: Fri, 13 Oct 2017 13:48:57 -0400 Subject: [gpfsug-discuss] How to simulate an NSD failure? In-Reply-To: References: Message-ID: If your devices/disks are multipath , make sure you remove all paths in order for disk to go offline. Also following line does not see correct: echo 1 > /sys/block/sda/delete , it should rather be echo 1 > /sys/block/sda/device/delete Further after you removed the disks , did you run the fdisk -l , to make sure its completely gone , also if the /var/log/messages confirms the disk is offline. Once all this confirmed then GPFS should take disks down and logs should tell you as well. Thanks M.Habib On Fri, Oct 13, 2017 at 8:56 AM, John Hearns wrote: > I have set up a small testbed, consisting of three nodes. Two of the nodes > have a disk which is being used as an NSD. > > This is being done for some preparation for fun and games with some whizzy > new servers. The testbed has spinning drives. > > I have created two NSDs and have set the data replication to 1 (this is > deliberate). > > I am trying to fail an NSD and find which files have parts on the failed > NSD. > > A first test with ?mmdeldisk? didn?t have much effect as SpectrumScale is > smart enough to copy the data off the drive. > > > > I now take the drive offline and delete it by > > echo offline > /sys/block/sda/device/state > > echo 1 > /sys/block/sda/delete > > > > Short of going to the data centre and physically pulling the drive that?s > a pretty final way of stopping access to a drive. > > I then wrote 100 files to the filesystem, the node with the NSD did log > ?rejecting I/O to offline device? > > However mmlsdisk says that this disk is status ?ready? > > > > I am going to stop that NSD and run an mmdeldisk ? at which point I do > expect things to go south rapidly. > > I just am not understanding at what point a failed write would be > detected? Or once a write fails are all the subsequent writes > > Routed off to the active NSD(s) ?? > > > > Sorry if I am asking an idiot question. > > > > Inspector.clouseau at surete.fr > > > > > > > > > > > > > > > > > > > > > > > -- The information contained in this communication and any attachments is > confidential and may be privileged, and is for the sole use of the intended > recipient(s). Any unauthorized review, use, disclosure or distribution is > prohibited. Unless explicitly stated otherwise in the body of this > communication or the attachment thereto (if any), the information is > provided on an AS-IS basis without any express or implied warranties or > liabilities. To the extent you are relying on this information, you are > doing so at your own risk. If you are not the intended recipient, please > notify the sender immediately by replying to this message and destroy all > copies of this message and any attachments. Neither the sender nor the > company/group of companies he or she represents shall be liable for the > proper and complete transmission of the information contained in this > communication, or for any delay in its receipt. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- This communication contains confidential information intended only for the persons to whom it is addressed. Any other distribution, copying or disclosure is strictly prohibited. If you have received this communication in error, please notify the sender and delete this e-mail message immediately. Le pr?sent message contient des renseignements de nature confidentielle r?serv?s uniquement ? l'usage du destinataire. Toute diffusion, distribution, divulgation, utilisation ou reproduction de la pr?sente communication, et de tout fichier qui y est joint, est strictement interdite. Si vous avez re?u le pr?sent message ?lectronique par erreur, veuillez informer imm?diatement l'exp?diteur et supprimer le message de votre ordinateur et de votre serveur. -------------- next part -------------- An HTML attachment was scrubbed... URL: From gcorneau at us.ibm.com Fri Oct 13 19:50:05 2017 From: gcorneau at us.ibm.com (Glen Corneau) Date: Fri, 13 Oct 2017 13:50:05 -0500 Subject: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details In-Reply-To: References: , Message-ID: The original announcement letter for Spectrum Scale Data Management Edition (US version reference below) re-defines a terabyte (nice eh?, should be a tebibyte). My math agrees with yours, assuming your file system size is actually 1PB versus 1PiB Terabyte Terabyte is a unit of measure by which the Program can be licensed. A terabyte is 2 to the 40th power bytes. Licensee must obtain an entitlement for each terabyte available to the Program. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpateam&supplier=897&letternum=ENUS216-158 ------------------ Glen Corneau Power Systems Washington Systems Center gcorneau at us.ibm.com From: "Daniel Kidger" To: gpfsug-discuss at spectrumscale.org Date: 10/13/2017 11:32 AM Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Sent by: gpfsug-discuss-bounces at spectrumscale.org All, For me the URL looks to have got mangled. Should be: http://files.gpfsug.org/presentations/2017/Manchester/09_licensing-update.pdf with the index page that points to it here: http://www.spectrumscale.org/presentations/ Also a personal note, I have found increasing confusion of decimal v. binary units for storage capacity. I understand that Spectrum Scale uses binary TiB, but say an 1TB drive is in decimal so c. 10% difference. So a 1 Petabyte filesystem needs only 909 TiB of Spectrum Scale licenses. Any comments from others? Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: "Simon Thompson (IT Research Support)" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Date: Fri, Oct 13, 2017 2:38 PM The slides from the Manchester meeting are at: https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2017_Manchester_09-5Flicensing-2Dupdate.p&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=f6qsuSorl92LShV92TTaXNyG3KU0VvuFN4YhT_LTTFc&e= df We moved all of our socket licenses to per TB DME earlier this year, and then also have DME per drive for our Lenovo DSS-G system, which for various reasons is in a different cluster There are certainly people in IBM UK who understand this process if that was something you wanted to look at. Simon On 13/10/2017, 13:45, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Sobey, Richard A" wrote: >Actually, I was being 100% serious :) Although it's a boring topic, it's >nonetheless fairly crucial and I'd like to see more about it. I won't be >at SC17 unless you're livestreaming it anyway. > >Richard > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Carl Zetie >Sent: 13 October 2017 13:13 >To: gpfsug-discuss at spectrumscale.org >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration >and program details > >>I *need* to see the presentation from the licensing session ? everyone?s >>favourite topic ? > > >I believe (hope?) that's just a placeholder, and we'll actually use the >time for something more engaging... > > > > Carl Zetie > Offering Manager for Spectrum Scale, IBM > > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > >Message: 3 >Date: Fri, 13 Oct 2017 09:47:39 +0000 >From: "Sobey, Richard A" >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - > Registration and program details >Message-ID: > utlook.com> > >Content-Type: text/plain; charset="utf-8" > >I *need* to see the presentation from the licensing session ? everyone?s >favourite topic ? > > > > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=HlQDuUjgJx4p54QzcXd0_zTwf4Cr2t3NINalNhLTA2E&m=KYrJfwSF_szhZLoeymYYs56zMYadeNvFnz8Mybi9cz8&s=qLRlo57JZyTw3gzwHJuEFhGyYqjmQGS6fc3h9lfWT_0&e= Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=d-vphLEe_UlGazP6RdYAyyAA3Qv5S9IRVNuO1i9vjJc&m=rOPfwzvHMD3_MRZy2WHgOGtmYQya-jWx5d_s92EeJRk&s=LkQ4lwnC-ATFnHjydppCXDasUDijS9DUh0p-cFaM0NM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 13 20:10:56 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 13 Oct 2017 19:10:56 +0000 Subject: [gpfsug-discuss] Scale per TB (was: User group Meeting at SC17 - Registration and program details) In-Reply-To: References: Message-ID: Yeah, I know... It's actually an IBM thing, not just a Scale thing. Some time in the distant past, IBM decided that too few people were familiar with the term "tebibyte" or its official abbreviation "TiB", so in the IBM licensing catalog there is the "Terabyte" (really a tebibyte) and the "Decimal Terabyte" (an actual terabyte). When we made the capacity license we had to decide which one to use, and we decided to err on the side of giving people the larger amount. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com Message: 3 Date: Fri, 13 Oct 2017 13:50:05 -0500 From: "Glen Corneau" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Message-ID: Content-Type: text/plain; charset="us-ascii" The original announcement letter for Spectrum Scale Data Management Edition (US version reference below) re-defines a terabyte (nice eh?, should be a tebibyte). My math agrees with yours, assuming your file system size is actually 1PB versus 1PiB Terabyte Terabyte is a unit of measure by which the Program can be licensed. A terabyte is 2 to the 40th power bytes. Licensee must obtain an entitlement for each terabyte available to the Program. https://www-01.ibm.com/common/ssi/cgi-bin/ssialias?infotype=an&subtype=ca&appname=gpateam&supplier=897&letternum=ENUS216-158 ------------------ Glen Corneau Power Systems Washington Systems Center gcorneau at us.ibm.com From: "Daniel Kidger" To: gpfsug-discuss at spectrumscale.org Date: 10/13/2017 11:32 AM Subject: Re: [gpfsug-discuss] User group Meeting at SC17 - Registration and program details Sent by: gpfsug-discuss-bounces at spectrumscale.org All, For me the URL looks to have got mangled. Should be: https://urldefense.proofpoint.com/v2/url?u=http-3A__files.gpfsug.org_presentations_2017_Manchester_09-5Flicensing-2Dupdate.pdf&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=obB2s7QQTgU9QMn1708Vpg&m=6slj4_SM9ZLHQtguBkVK7Xg2UN1RlDFCjKJoGhWn2wU&s=NU2Hs398IPSytPh8bYplXjFChhaF9G21Pt4YoHvbrPY&e= with the index page that points to it here: https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spectrumscale.org_presentations_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=obB2s7QQTgU9QMn1708Vpg&m=6slj4_SM9ZLHQtguBkVK7Xg2UN1RlDFCjKJoGhWn2wU&s=CLN7JkpjQsfPdvOapYPGX3o7gHZj8AOh7tYSusTZJPE&e= Also a personal note, I have found increasing confusion of decimal v. binary units for storage capacity. I understand that Spectrum Scale uses binary TiB, but say an 1TB drive is in decimal so c. 10% difference. So a 1 Petabyte filesystem needs only 909 TiB of Spectrum Scale licenses. Any comments from others? Daniel Dr Daniel Kidger IBM Technical Sales Specialist Software Defined Solution Sales +44-(0)7818 522 266 daniel.kidger at uk.ibm.com From a.khiredine at meteo.dz Sun Oct 15 13:44:42 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Sun, 15 Oct 2017 12:44:42 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Dear All, Is there a way to save the GPS configuration? OR how backup all GSS no backup of data or metadata only configuration for disaster recovery for example: stanza vdisk pdisk RAID code recovery group array Thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From skylar2 at u.washington.edu Mon Oct 16 14:29:33 2017 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Mon, 16 Oct 2017 13:29:33 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server In-Reply-To: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> References: <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Message-ID: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu> I'm not familiar with GSS, but we have a script that executes the following before backing up a GPFS filesystem so that we have human-readable configuration information: mmlsconfig mmlsnsd mmlscluster mmlsnode mmlsdisk ${FS_NAME} -L mmlsfileset ${FS_NAME} -L mmlspool ${FS_NAME} all -L mmlslicense -L mmlspolicy ${FS_NAME} -L And then executes this for the benefit of GPFS: mmbackupconfig Of course there's quite a bit of overlap for clusters that have more than one filesystem, and even more for filesystems that we backup at the fileset level, but disk is cheap and the hope is it'll make a DR scenario a little bit less harrowing. On Sun, Oct 15, 2017 at 12:44:42PM +0000, atmane khiredine wrote: > Dear All, > > Is there a way to save the GPS configuration? > > OR how backup all GSS > > no backup of data or metadata only configuration for disaster recovery > > for example: > stanza > vdisk > pdisk > RAID code > recovery group > array > > Thank you > > Atmane Khiredine > HPC System Administrator | Office National de la M?t?orologie > T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From heiner.billich at psi.ch Mon Oct 16 14:36:09 2017 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Mon, 16 Oct 2017 13:36:09 +0000 Subject: [gpfsug-discuss] slow startup of AFM flush to home Message-ID: Hello Scott, Thank you. I did set afmFlushThreadDelay = 1 and did get a much faster startup. Setting to 0 didn?t improve further. I?m not sure how much we?ll need this in production when most of the time the queue is full. But for benchmarking during setup it?s helps a lot. (we run 4.2.3-4 on RHEL7) Kind regards, Heiner Scott Fadden did write: When an AFM gateway is flushing data to the target (home) it starts flushing with a few threads (Don't remember the number) and ramps up to afmNumFlushThreads. How quickly this ramp up occurs is controlled by afmFlushThreadDealy. The default is 5 seconds. So flushing only adds threads once every 5 seconds. This was an experimental parameter so your milage may vary. Scott Fadden Spectrum Scale - Technical Marketing Phone: (503) 880-5833 sfadden at us.ibm.com http://www.ibm.com/systems/storage/spectrum/scale ----- Original message ----- From: "Billich Heinrich Rainer (PSI)" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [gpfsug-discuss] AFM: Slow startup of flush from cache to home Date: Fri, Oct 13, 2017 10:16 AM Hello, Running an AFM IW cache we noticed that AFM starts the flushing of data from cache to home rather slow, say at 20MB/s, and only slowly increases to several 100MB/s after a few minutes. As soon as the pending queue gets no longer filled the data rate drops, again. I assume that this is a good behavior for WAN traffic where you don???t want to use the full bandwidth from the beginning but only if really needed. For our local setup with dedicated links I would prefer a much more aggressive behavior to get data transferred asap to home. Am I right, does AFM implement such a ???slow startup???, and is there a way to change this behavior? We did increase afmNumFlushThreads to 128. Currently we measure with many small files (1MB). For large files the behavior is different, we get a stable data rate from the beginning, but I did not yet try with a continuous write on the cache to see whether I see an increase after a while, too. Thank you, Heiner Billich -- Paul Scherrer Institut Science IT Heiner Billich WHGA 106 CH 5232 Villigen PSI 056 310 36 02 https://www.psi.ch From sfadden at us.ibm.com Mon Oct 16 16:34:33 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Mon, 16 Oct 2017 15:34:33 +0000 Subject: [gpfsug-discuss] Backup All Cluster GSS GPFS Storage Server In-Reply-To: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu> References: <20171016132932.g5j7vep2frxnsvpf@utumno.gs.washington.edu>, <4B32CB5C696F2849BDEF7DF9EACE884B633F4ACF@SDEB-EXC01.meteo.dz> Message-ID: An HTML attachment was scrubbed... URL: From er.a.ross at gmail.com Fri Oct 20 03:15:38 2017 From: er.a.ross at gmail.com (Eric Ross) Date: Thu, 19 Oct 2017 21:15:38 -0500 Subject: [gpfsug-discuss] file auditing capabilities Message-ID: I'm researching the file auditing capabilities possible with GPFS; I found this paper on the GPFS wiki: https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/f0cc9b82-a133-41b4-83fe-3f560e95b35a/attachment/0ab62645-e0ab-4377-81e7-abd11879bb75/media/Spectrum_Scale_Varonis_Audit_Logging.pdf I haven't found anything else on the subject, however. While I like the idea of being able to do this logging on the protocol node level, I'm also interested in the possibility of auditing files from native GPFS mounts. Additional digging uncovered references to Lightweight Events (LWE): http://files.gpfsug.org/presentations/2016/SC16/04_Scott_Fadden_Spectrum_Scale_Update.pdf Specifically, this references being able to use the policy engine to detect things like file opens, reads, and writes. Searching through the official GPFS documentation, I see references to these events in the transparent cloud tiering section: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_define_cloud_storage_tier.htm but, I don't see, or possibly have missed, the other section(s) defining what other EVENT parameters I can use. I'm curious to know more about these events, could anyone point me in the right direction? I'm wondering if I could use them to perform rudimentary auditing of the file system (e.g. a default policy in place to log a message of say user foo either wrote to and/or read from file bar). Thanks, -Eric From richardb+gpfsUG at ellexus.com Fri Oct 20 15:47:57 2017 From: richardb+gpfsUG at ellexus.com (Richard Booth) Date: Fri, 20 Oct 2017 15:47:57 +0100 Subject: [gpfsug-discuss] file auditing capabilities Message-ID: Hi Eric The company I work for could possibly help with this, Ellexus . Please feel free to get in touch if you need some help with this. Cheers Richard ---------------------------------------------------------------------- >> >> Message: 1 >> Date: Thu, 19 Oct 2017 21:15:38 -0500 >> From: Eric Ross >> To: gpfsug-discuss at spectrumscale.org >> Subject: [gpfsug-discuss] file auditing capabilities >> Message-ID: >> > ail.com> >> Content-Type: text/plain; charset="UTF-8" >> >> I'm researching the file auditing capabilities possible with GPFS; I >> found this paper on the GPFS wiki: >> >> https://www.ibm.com/developerworks/community/wikis/form/anon >> ymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ >> f0cc9b82-a133-41b4-83fe-3f560e95b35a/attachment/0ab62645- >> e0ab-4377-81e7-abd11879bb75/media/Spectrum_Scale_Varonis_ >> Audit_Logging.pdf >> >> I haven't found anything else on the subject, however. >> >> While I like the idea of being able to do this logging on the protocol >> node level, I'm also interested in the possibility of auditing files >> from native GPFS mounts. >> >> Additional digging uncovered references to Lightweight Events (LWE): >> >> http://files.gpfsug.org/presentations/2016/SC16/04_Scott_Fad >> den_Spectrum_Scale_Update.pdf >> >> Specifically, this references being able to use the policy engine to >> detect things like file opens, reads, and writes. >> >> Searching through the official GPFS documentation, I see references to >> these events in the transparent cloud tiering section: >> >> https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/ >> com.ibm.spectrum.scale.v4r22.doc/bl1adm_define_cloud_storage_tier.htm >> >> but, I don't see, or possibly have missed, the other section(s) >> defining what other EVENT parameters I can use. >> >> I'm curious to know more about these events, could anyone point me in >> the right direction? >> >> I'm wondering if I could use them to perform rudimentary auditing of >> the file system (e.g. a default policy in place to log a message of >> say user foo either wrote to and/or read from file bar). >> >> Thanks, >> -Eric >> >> >> ------------------------------ >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> End of gpfsug-discuss Digest, Vol 69, Issue 38 >> ********************************************** >> > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Fri Oct 20 20:54:38 2017 From: carlz at us.ibm.com (Carl Zetie) Date: Fri, 20 Oct 2017 19:54:38 +0000 Subject: [gpfsug-discuss] file auditing capabilities (Eric Ross) Message-ID: Disclaimer: all statements about future functionality are subject to change, and represent intentions only. That being said: Yes, we are working on File Audit Logging native to Spectrum Scale. The intention is to provide auditing capabilities in a protocol agnostic manner that will capture not only audit events that come through protocols but also GPFS/Scale native file system access events. The audit logs are written to a specified GPFS/Scale fileset in a format that is both human=-readable and easily parsable for automated consumption, reporting, or whatever else you might want to do with it. Currently, we intend to release this capability with Scale 5.0. The underlying technology for this is indeed LWE, which as some of you know is also underneath some other Scale features. The use of LWE allows us to do auditing very efficiently to minimize performance impact while also allowing scalability. We do not at this time have plans to expose LWE directly for end-user consumption -- it needs to be "packaged" in a more consumable way in order to be generally supportable. However, we do have intentions to expose other functionality on top of the LWE capability in the future. Carl Zetie Offering Manager for Spectrum Scale, IBM (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From Stephan.Peinkofer at lrz.de Mon Oct 23 11:41:23 2017 From: Stephan.Peinkofer at lrz.de (Peinkofer, Stephan) Date: Mon, 23 Oct 2017 10:41:23 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management Message-ID: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Oct 23 12:00:50 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 23 Oct 2017 11:00:50 +0000 Subject: [gpfsug-discuss] el7.4 compatibility In-Reply-To: References: Message-ID: Just picking up this old thread, but... October updates: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html# linux 7.4 is now listed as supported with min scale version of 4.1.1.17 or 4.2.3.4 (incidentally 4.2.3.5 looks to have been released today). Simon On 27/09/2017, 09:16, "gpfsug-discuss-bounces at spectrumscale.org on behalf of kenneth.waegeman at ugent.be" wrote: >Hi, > >Is there already some information available of gpfs (and protocols) on >el7.4 ? > >Thanks! > >Kenneth > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Mon Oct 23 12:09:17 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 23 Oct 2017 13:09:17 +0200 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> References: <4E7DC602-508E-45F3-A4F3-3839B69DB4BA@lrz.de> Message-ID: You can lower LEASE_LIFETIME and GRACE_PERIOD to shorten the time it's in grace, to make it more bearable. Making export changes dynamic is something that's fixed in newer versions of nfs-ganesha than what's shipped with Scale: https://github.com/nfs-ganesha/nfs-ganesha/releases/tag/V2.4.0: "dynamic EXPORT configuration update (via dBus and SIGHUP)" Hopefully someone can comment on when we'll see nfs-ganesha v2.4+ included with Scale. -jf On Mon, Oct 23, 2017 at 12:41 PM, Peinkofer, Stephan < Stephan.Peinkofer at lrz.de> wrote: > Dear List, > > I?m currently working on a self service portal for managing NFS exports of > ISS. Basically something very similar to OpenStack Manila but tailored to > our specific needs. > While it was very easy to do this using the great REST API of ISS, I > stumbled across a fact that may be even a show stopper: According to the > documentation for mmnfs, each time we > create/change/delete a NFS export via mmnfs, ganesha service is restarted > on all nodes. > > I assume that this behaviour may cause problems (at least IO stalls) on > clients mounted the filesystem. So my question is, what is your experience > with CES NFS export management. > Do you see any problems when you add/change/delete exports and ganesha > gets restarted? > > Are there any (supported) workarounds for this problem? > > PS: As I think in 2017 CES Exports should be manageable without service > disruptions (and ganesha provides facilities to do so), I filed an RFE for > this: https://www.ibm.com/developerworks/rfe/execute? > use_case=viewRfe&CR_ID=111918 > > Many thanks in advance. > Best Regards, > Stephan Peinkofer > -- > Stephan Peinkofer > Dipl. Inf. (FH), M. Sc. (TUM) > > Leibniz Supercomputing Centre > Data and Storage Division > Boltzmannstra?e 1, 85748 Garching b. M?nchen > Tel: +49(0)89 35831-8715 <+49%2089%20358318715> Fax: +49(0)89 > 35831-9700 <+49%2089%20358319700> > URL: http://www.lrz.de > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Mon Oct 23 12:56:07 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Mon, 23 Oct 2017 17:26:07 +0530 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: Hi Stephan, I observed ganesha service getting restarted only after adding first nfs export. For rest of the operations (e.g. adding more nfs exports, changing nfs exports, removing nfs exports); ganesha service doesn't restart. My observations are based on following simple tests. I ran them against rhel7.3 test cluster having nfs-ganesha-2.5.2. tests: 1. created 1st nfs export - ganesha service was restarted 2. created 4 more nfs exports (mmnfs export add path) 3. changed 2 nfs exports (mmnfs export change path --nfschange); 4. removed all 5 exports one by one (mmnfs export remove path) 5. no nfs exports after step 4 on my test system. So, created a new nfs export (which will be the 1st nfs export). 6. change nfs export created in step 5 results observed: ganesha service restarted for test 1 and test 5. For rest tests (2,3,4,6); ganesha service didn't restart. Thanks, Chetan. From: "Peinkofer, Stephan" To: "gpfsug-discuss at spectrumscale.org" Date: 10/23/2017 04:11 PM Subject: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=uic-29lyJ5TCiTRi0FyznYhKJx5I7Vzu80WyYuZ4_iM&m=ghcZYswqgF3beYOogGGLsT1RyDRZrbLXdzp3Fbjmfrg&s=TUm7BM3sY75Nc20gOfhz9lvDgYJse0TM6-tIW8I1QiI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Mon Oct 23 13:16:17 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 23 Oct 2017 12:16:17 +0000 Subject: [gpfsug-discuss] Reminder: User group Meeting at SC17 - Registration and program details Message-ID: Reminder: Register for the SC17 User Group meeting if you are heading to SC17. Thanks to IBM for finding the space, providing the refreshments, and help organizing the speakers. NOTE: You MUST register to attend the user group meeting here: https://www.ibm.com/it-infrastructure/us-en/supercomputing-2017/ Date: Sunday November 12th, 2017 Time: 12:30PM ? 6PM Location: Hyatt Regency Denver at Colorado Convention Center Room Location: Centennial E Ballroom followed by reception in Centennial D Ballroom at 5:30pm Agenda Start End Duration Title 12:30 13:00 30 Welcome, Kristy Kallback-Rose, Bob Oesterlin (User Group), Doris Conti (IBM) 13:00 13:30 30 Spectrum Scale Update (including blueprints and AWS), Scott Fadden 13:30 13:40 10 ESS Update, Puneet Chaudhary 13:40 13:50 10 Licensing 13:50 14:05 15 Customer talk - Children's Mercy Hospital 14:05 14:20 15 Customer talk - Max Dellbr?ck Center, Alf Wachsmann 14:20 14:35 15 Customer talk - University of Pennsylvania Medical 14:35 15:00 25 Live Debug Session, Tomer Perry 15:00 15:30 30 Break 15:30 15:50 20 Customer talk - DESY / European XFEL, Martin Gasthuber 15:50 16:05 15 Customer talk - University of Birmingham, Simon Thompson 16:05 16:25 20 Fast Restripe FS, Hai Zhong Zhou 16:25 16:45 20 TCT - 1 Bio Files Live Demo, Rob Basham 16:45 17:05 20 High Performance Data Analysis (incl customer), Piyush Chaudhary 17:05 17:30 25 Performance Enhancements for CORAL, Sven Oehme 17:30 18:00 30 Ask the developers Note: Refreshments will be served during this meeting Refreshments will be served at 5.30pm Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Stephan.Peinkofer at lrz.de Mon Oct 23 13:20:47 2017 From: Stephan.Peinkofer at lrz.de (Peinkofer, Stephan) Date: Mon, 23 Oct 2017 12:20:47 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: <5BBED5D7-5E06-453F-B839-BC199EC74720@lrz.de> Dear Chetan, interesting. I?m running ISS 4.2.3-4 and it seems to ship with nfs-ganesha-2.3.2. So are you already using a future ISS version? Here is what I see: [root at datdsst102 pr74cu-dss-0002]# mmnfs export list Path Delegations Clients ---------------------------------------------------------- /dss/dsstestfs01/pr74cu-dss-0002 NONE 10.156.29.73 /dss/dsstestfs01/pr74cu-dss-0002 NONE 10.156.29.72 [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfschange "10.156.29.72(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfschange "10.156.29.72(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 pr74cu-dss-0002]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfsadd "10.156.29.74(access_type=RW,squash=no_root_squash,protocols=4,transports=tcp,sectype=sys,manage_gids=true)" datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. [root at datdsst102 ~]# mmnfs export change /dss/dsstestfs01/pr74cu-dss-0002 --nfsremove 10.156.29.74 datdsst102.dss.lrz.de: Redirecting to /bin/systemctl stop nfs-ganesha.service datdsst102.dss.lrz.de: Redirecting to /bin/systemctl start nfs-ganesha.service NFS Configuration successfully changed. NFS server restarted on all NFS nodes on which NFS server is running. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de On 23. Oct 2017, at 13:56, Chetan R Kulkarni > wrote: Hi Stephan, I observed ganesha service getting restarted only after adding first nfs export. For rest of the operations (e.g. adding more nfs exports, changing nfs exports, removing nfs exports); ganesha service doesn't restart. My observations are based on following simple tests. I ran them against rhel7.3 test cluster having nfs-ganesha-2.5.2. tests: 1. created 1st nfs export - ganesha service was restarted 2. created 4 more nfs exports (mmnfs export add path) 3. changed 2 nfs exports (mmnfs export change path --nfschange); 4. removed all 5 exports one by one (mmnfs export remove path) 5. no nfs exports after step 4 on my test system. So, created a new nfs export (which will be the 1st nfs export). 6. change nfs export created in step 5 results observed: ganesha service restarted for test 1 and test 5. For rest tests (2,3,4,6); ganesha service didn't restart. Thanks, Chetan. "Peinkofer, Stephan" ---10/23/2017 04:11:33 PM---Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically From: "Peinkofer, Stephan" > To: "gpfsug-discuss at spectrumscale.org" > Date: 10/23/2017 04:11 PM Subject: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Dear List, I?m currently working on a self service portal for managing NFS exports of ISS. Basically something very similar to OpenStack Manila but tailored to our specific needs. While it was very easy to do this using the great REST API of ISS, I stumbled across a fact that may be even a show stopper: According to the documentation for mmnfs, each time we create/change/delete a NFS export via mmnfs, ganesha service is restarted on all nodes. I assume that this behaviour may cause problems (at least IO stalls) on clients mounted the filesystem. So my question is, what is your experience with CES NFS export management. Do you see any problems when you add/change/delete exports and ganesha gets restarted? Are there any (supported) workarounds for this problem? PS: As I think in 2017 CES Exports should be manageable without service disruptions (and ganesha provides facilities to do so), I filed an RFE for this: https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=111918 Many thanks in advance. Best Regards, Stephan Peinkofer -- Stephan Peinkofer Dipl. Inf. (FH), M. Sc. (TUM) Leibniz Supercomputing Centre Data and Storage Division Boltzmannstra?e 1, 85748 Garching b. M?nchen Tel: +49(0)89 35831-8715 Fax: +49(0)89 35831-9700 URL: http://www.lrz.de _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=uic-29lyJ5TCiTRi0FyznYhKJx5I7Vzu80WyYuZ4_iM&m=ghcZYswqgF3beYOogGGLsT1RyDRZrbLXdzp3Fbjmfrg&s=TUm7BM3sY75Nc20gOfhz9lvDgYJse0TM6-tIW8I1QiI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Oct 23 14:42:51 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 23 Oct 2017 13:42:51 +0000 Subject: [gpfsug-discuss] Rainy days and Mondays and GPFS lying to me always get me down... Message-ID: Hi All, And I?m not really down, but it is a rainy Monday morning here and GPFS did give me a scare in the last hour, so I thought that was a funny subject line. So I have a >1 PB filesystem with 3 pools: 1) the system pool, which contains metadata only, 2) the data pool, which is where all I/O goes to by default, and 3) the capacity pool, which is where old crap gets migrated to. I logged on this morning to see an alert that my data pool was 100% full. I ran an mmdf from the cluster manager and, sure enough: (pool total) 509.3T 0 ( 0%) 0 ( 0%) I immediately tried copying a file to there and it worked, so I figured GPFS must be failing writes over to the capacity pool, but an mmlsattr on the file I copied showed it being in the data pool. Hmmm. I also noticed that ?df -h? said that the filesystem had 399 TB free, while mmdf said it only had 238 TB free. Hmmm. So after some fruitless poking around I decided that whatever was going to happen, I should kill the mmrestripefs I had running on the capacity pool ? let me emphasize that ? I had a restripe running on the capacity pool only (via the ?-P? option to mmrestripefs) but it was the data pool that said it was 100% full. I?m sure many of you have already figured out where this is going ? after killing the restripe I ran mmdf again and: (pool total) 509.3T 159T ( 31%) 1.483T ( 0%) I have never seen anything like this before ? any ideas, anyone? PMR time? Thanks! Kevin From valdis.kletnieks at vt.edu Mon Oct 23 19:13:05 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 23 Oct 2017 14:13:05 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: Message-ID: <32917.1508782385@turing-police.cc.vt.edu> On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From bbanister at jumptrading.com Mon Oct 23 19:23:33 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 23 Oct 2017 18:23:33 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: <32917.1508782385@turing-police.cc.vt.edu> References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: This becomes very disruptive when you have to add or remove many NFS exports. Is it possible to add and remove multiple entries at a time or is this YARFE time? -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu Sent: Monday, October 23, 2017 1:13 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Experience with CES NFS export management On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From stefan.dietrich at desy.de Mon Oct 23 19:34:02 2017 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Mon, 23 Oct 2017 20:34:02 +0200 (CEST) Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <2146307210.3678055.1508783642716.JavaMail.zimbra@desy.de> Hello Bryan, at least changing multiple entries at once is possible. You can copy /var/mmfs/ces/nfs-config/gpfs.ganesha.exports.conf to e.g. /tmp, modify the export (remove/add nodes or options) and load the changed config via "mmnfs export load " That way, only a single restart is issued for Ganesha on the CES nodes. Adding/removing I did not try so far, to be honest for use-cases this is rather static. Regards, Stefan ----- Original Message ----- > From: "Bryan Banister" > To: "gpfsug main discussion list" > Sent: Monday, October 23, 2017 8:23:33 PM > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > This becomes very disruptive when you have to add or remove many NFS exports. > Is it possible to add and remove multiple entries at a time or is this YARFE > time? > -Bryan > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org > [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of > valdis.kletnieks at vt.edu > Sent: Monday, October 23, 2017 1:13 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > > On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > >> tests: >> 1. created 1st nfs export - ganesha service was restarted >> 2. created 4 more nfs exports (mmnfs export add path) >> 3. changed 2 nfs exports (mmnfs export change path --nfschange); >> 4. removed all 5 exports one by one (mmnfs export remove path) >> 5. no nfs exports after step 4 on my test system. So, created a new nfs >> export (which will be the 1st nfs export). >> 6. change nfs export created in step 5 > > mmnfs export change --nfsadd seems to generate a restart as well. > Particularly annoying when the currently running nfs.ganesha fails to > stop rpc.statd on the way down, and then bringing it back up fails because > the port is in use.... > > ________________________________ > > Note: This email is for the confidential use of the named addressee(s) only and > may contain proprietary, confidential or privileged information. If you are not > the intended recipient, you are hereby notified that any review, dissemination > or copying of this email is strictly prohibited, and to please notify the > sender immediately and destroy this email and any attachments. Email > transmission cannot be guaranteed to be secure or error-free. The Company, > therefore, does not make any guarantees as to the completeness or accuracy of > this email or any attachments. This email is for informational purposes only > and does not constitute a recommendation, offer, request or solicitation of any > kind to buy, sell, subscribe, redeem or perform any type of transaction of a > financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From valdis.kletnieks at vt.edu Mon Oct 23 19:54:35 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 23 Oct 2017 14:54:35 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <53227.1508784875@turing-police.cc.vt.edu> On Mon, 23 Oct 2017 18:23:33 -0000, Bryan Banister said: > This becomes very disruptive when you have to add or remove many NFS exports. > Is it possible to add and remove multiple entries at a time or is this YARFE time? On the one hand, 'mmnfs export change [path] --nfsadd 'client1(options);client2(options);...)' is supported. On the other hand, after the initial install's rush of new NFS exports, the chances of having more than one client to change at a time are rather low. On the gripping hand, if a client later turns up an entire cluster that needs access, you can also say --nfsadd '172.28.40.0/23(options)' and get the whole cluster in one shot. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From oehmes at gmail.com Tue Oct 24 01:28:33 2017 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 24 Oct 2017 00:28:33 +0000 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: we can not commit on timelines on mailing lists, but this is a known issue and will be addressed in a future release. sven On Mon, Oct 23, 2017, 11:23 AM Bryan Banister wrote: > This becomes very disruptive when you have to add or remove many NFS > exports. Is it possible to add and remove multiple entries at a time or is > this YARFE time? > -Bryan > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org [mailto: > gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of > valdis.kletnieks at vt.edu > Sent: Monday, October 23, 2017 1:13 PM > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Experience with CES NFS export management > > On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > > > tests: > > 1. created 1st nfs export - ganesha service was restarted > > 2. created 4 more nfs exports (mmnfs export add path) > > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > > 4. removed all 5 exports one by one (mmnfs export remove path) > > 5. no nfs exports after step 4 on my test system. So, created a new nfs > > export (which will be the 1st nfs export). > > 6. change nfs export created in step 5 > > mmnfs export change --nfsadd seems to generate a restart as well. > Particularly annoying when the currently running nfs.ganesha fails to > stop rpc.statd on the way down, and then bringing it back up fails because > the port is in use.... > > ________________________________ > > Note: This email is for the confidential use of the named addressee(s) > only and may contain proprietary, confidential or privileged information. > If you are not the intended recipient, you are hereby notified that any > review, dissemination or copying of this email is strictly prohibited, and > to please notify the sender immediately and destroy this email and any > attachments. Email transmission cannot be guaranteed to be secure or > error-free. The Company, therefore, does not make any guarantees as to the > completeness or accuracy of this email or any attachments. This email is > for informational purposes only and does not constitute a recommendation, > offer, request or solicitation of any kind to buy, sell, subscribe, redeem > or perform any type of transaction of a financial product. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnaineni at in.ibm.com Tue Oct 24 08:57:29 2017 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Tue, 24 Oct 2017 13:27:29 +0530 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: As others have answered, 4.2.3 spectrum can add or remove exports without restarting nfs-ganesha service. Changing an existing export does need nfs-ganesha restart though. If you want to change multiple existing exports, you could use undocumented option "--nfsnorestart" to mmnfs. This should add export changes to NFS configuration but it won't restart nfs-ganesha service, so you will not see immediate results of your changes in the running server. Whenever you want your changes reflected, you could manually restart the service using "mmces" command. Regards, Malahal. From: Bryan Banister To: gpfsug main discussion list Date: 10/23/2017 11:53 PM Subject: Re: [gpfsug-discuss] Experience with CES NFS export management Sent by: gpfsug-discuss-bounces at spectrumscale.org This becomes very disruptive when you have to add or remove many NFS exports. Is it possible to add and remove multiple entries at a time or is this YARFE time? -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of valdis.kletnieks at vt.edu Sent: Monday, October 23, 2017 1:13 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Experience with CES NFS export management On Mon, 23 Oct 2017 17:26:07 +0530, "Chetan R Kulkarni" said: > tests: > 1. created 1st nfs export - ganesha service was restarted > 2. created 4 more nfs exports (mmnfs export add path) > 3. changed 2 nfs exports (mmnfs export change path --nfschange); > 4. removed all 5 exports one by one (mmnfs export remove path) > 5. no nfs exports after step 4 on my test system. So, created a new nfs > export (which will be the 1st nfs export). > 6. change nfs export created in step 5 mmnfs export change --nfsadd seems to generate a restart as well. Particularly annoying when the currently running nfs.ganesha fails to stop rpc.statd on the way down, and then bringing it back up fails because the port is in use.... ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oaQVLOYto6Ftb8wAbynvIiIdh2UEjHxQByDz70-6a_0&m=dhIJJ5KI4U6ZUia7OPi_-AC3qBrYV9n93ww8Ffhl468&s=K4ii44lk1_auA_3g7SN-E1zmMZNtc1PqBSiQJVudc_w&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.khiredine at meteo.dz Tue Oct 24 10:20:25 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Tue, 24 Oct 2017 09:20:25 +0000 Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6340C7B0@SDEB-EXC02.meteo.dz> Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz From valdis.kletnieks at vt.edu Tue Oct 24 15:36:46 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 24 Oct 2017 10:36:46 -0400 Subject: [gpfsug-discuss] Experience with CES NFS export management In-Reply-To: References: <32917.1508782385@turing-police.cc.vt.edu> Message-ID: <16412.1508855806@turing-police.cc.vt.edu> On Tue, 24 Oct 2017 13:27:29 +0530, "Malahal R Naineni" said: > If you want to change multiple existing exports, you could use > undocumented option "--nfsnorestart" to mmnfs. This should add export > changes to NFS configuration but it won't restart nfs-ganesha service, so > you will not see immediate results of your changes in the running server. > Whenever you want your changes reflected, you could manually restart the > service using "mmces" command. I owe you a beverage of your choice if we ever are in the same place at the same time - the fact that Ganesha got restarted on all nodes at once thus preventing a rolling restart and avoiding service interruption was the single biggest Ganesha wart we've encountered. :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From UWEFALKE at de.ibm.com Tue Oct 24 17:49:19 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Tue, 24 Oct 2017 18:49:19 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Message-ID: Hi, I am about to run nsdperf for testing the IB fabric in a new system comprising ESS (BE) and Intel-based nodes. nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one test using RDMA: client server RDMA x86-64 ppc-64 on crash ppc-64 x86-64 on crash x86-64 ppc-64 off success x86-64 x86-64 on success ppc-64 ppc-64 on success That implies that the nsdperf RDMA test might struggle with BE vs LE. However, I learned from a talk given at a GPFS workshop in Germany in 2015 that RDMA works between Power-BE and Intel boxes. Has anyone made similar or contrary experiences? Is it an nsdperf issue or more general (I have not yet attempted any GPFS mount)? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From olaf.weiser at de.ibm.com Tue Oct 24 20:31:06 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 24 Oct 2017 21:31:06 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From sdenham at gmail.com Tue Oct 24 21:35:40 2017 From: sdenham at gmail.com (Scott D) Date: Tue, 24 Oct 2017 15:35:40 -0500 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: I have run nsdperf with RDMA enabled against an ESS ppc-64 server without any problems. I don't have access to that system at the moment, and it was running a fairly old (4.1.x) version of GPFS, not that that should matter for nsdperf unless that source code has changed since 4.1. Scott Denham Staff Engineer Cray, Inc On Tue, Oct 24, 2017 at 11:49 AM, Uwe Falke wrote: > Hi, > I am about to run nsdperf for testing the IB fabric in a new system > comprising ESS (BE) and Intel-based nodes. > nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one > test using RDMA: > > client server RDMA > x86-64 ppc-64 on crash > ppc-64 x86-64 on crash > x86-64 ppc-64 off success > x86-64 x86-64 on success > ppc-64 ppc-64 on success > > That implies that the nsdperf RDMA test might struggle with BE vs LE. > However, I learned from a talk given at a GPFS workshop in Germany in 2015 > that RDMA works between Power-BE and Intel boxes. Has anyone made similar > or contrary experiences? Is it an nsdperf issue or more general (I have > not yet attempted any GPFS mount)? > > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------ > ------------------------------------------------------------ > ------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Thomas Wolter, Sven Schoo? > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 25 09:52:29 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 25 Oct 2017 10:52:29 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes In-Reply-To: References: Message-ID: Hi, Scott, thanks, good to hear that it worked for you. I can at least confirm that GPFS RDMA itself does work between x86-64 clients the ESS here, it appears just nsdperf has an issue in my particular environment. I'll see what IBM support can do for me as Olaf suggested. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Scott D To: gpfsug main discussion list Date: 10/24/2017 10:35 PM Subject: Re: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org I have run nsdperf with RDMA enabled against an ESS ppc-64 server without any problems. I don't have access to that system at the moment, and it was running a fairly old (4.1.x) version of GPFS, not that that should matter for nsdperf unless that source code has changed since 4.1. Scott Denham Staff Engineer Cray, Inc On Tue, Oct 24, 2017 at 11:49 AM, Uwe Falke wrote: Hi, I am about to run nsdperf for testing the IB fabric in a new system comprising ESS (BE) and Intel-based nodes. nsdperf crashes reliably when invoking ESS nodes and x86-64 nodes in one test using RDMA: client server RDMA x86-64 ppc-64 on crash ppc-64 x86-64 on crash x86-64 ppc-64 off success x86-64 x86-64 on success ppc-64 ppc-64 on success That implies that the nsdperf RDMA test might struggle with BE vs LE. However, I learned from a talk given at a GPFS workshop in Germany in 2015 that RDMA works between Power-BE and Intel boxes. Has anyone made similar or contrary experiences? Is it an nsdperf issue or more general (I have not yet attempted any GPFS mount)? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Tomasz.Wolski at ts.fujitsu.com Wed Oct 25 10:42:02 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 25 Oct 2017 09:42:02 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: <237580bb78cf4d9291c057926c90c265@R01UKEXCASM223.r01.fujitsu.local> Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" > To: gpfsug main discussion list > Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Oct 25 14:09:27 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 25 Oct 2017 09:09:27 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Oct 25 14:33:46 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 25 Oct 2017 13:33:46 +0000 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Hi Felipe On a related note, would everything in 4.2.3-4 efix2 be included in 4.2.3-5? In particular the previously discussed around defect 1020461? There was no APAR for this defect when I last looked on 30th August. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Felipe Knop Sent: 25 October 2017 14:09 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" > To: gpfsug main discussion list > Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke > wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca > To: gpfsug main discussion list > Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale > wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" > To: gpfsug main discussion list > Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Oct 25 16:23:42 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 25 Oct 2017 11:23:42 -0400 Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) In-Reply-To: References: <1344123299.1552.1507558135492.JavaMail.webinst@w30112> Message-ID: Richard, I see that 4.2.3-4 efix2 has two defects, 1032655 (IV99796) and 1020461 (IV99675), and both these fixes are included in 4.2.3.5 . Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Sobey, Richard A" To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 09:34 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Felipe On a related note, would everything in 4.2.3-4 efix2 be included in 4.2.3-5? In particular the previously discussed around defect 1020461? There was no APAR for this defect when I last looked on 30th August. Thanks Richard From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Felipe Knop Sent: 25 October 2017 14:09 To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, The fix (APAR 1IJ00398) has been included in 4.2.3.5, despite the APAR number having been omitted from the list of fixes in the PTF. Regards, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Tomasz.Wolski at ts.fujitsu.com" To: IBM Spectrum Scale , gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Date: 10/25/2017 05:42 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Thank you for the information. I was checking changelog for GPFS 4.2.3.5 standard edition, but it does not mention APAR IJ00398: https://www-01.ibm.com/support/docview.wss?rs=0&uid=isg400003555 This update addresses the following APARs: IJ00031 IJ00094 IJ00397 IV99611 IV99675 IV99676 IV99677 IV99678 IV99679 IV99680 IV99709 IV99796. On the other hand Flash Notification advices upgrading to GPFS 4.2.3.5: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E Could you please verify if that version contains the fix? Best regards, Tomasz Wolski From: Felipe Knop [mailto:knop at us.ibm.com] On Behalf Of IBM Spectrum Scale Sent: Wednesday, October 11, 2017 2:31 PM To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org; Wolski, Tomasz Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Tomasz, Though the error in the NSD RPC has been the most visible manifestation of the problem, this issue could end up affecting other RPCs as well, so the fix should be applied even for SAN configurations. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Tomasz.Wolski at ts.fujitsu.com" To: gpfsug main discussion list Date: 10/11/2017 02:09 AM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, From what I understand this does not affect FC SAN cluster configuration, but mostly NSD IO communication? Best regards, Tomasz Wolski From: gpfsug-discuss-bounces at spectrumscale.org[ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ben De Luca Sent: Wednesday, October 11, 2017 6:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Lyle thanks for the update, has this issue always existed, or just in v4.1 and 4.2? It seems that the likely hood of this event is very low but of course you encourage people to update asap. On 11 October 2017 at 00:15, Uwe Falke wrote: Hi, I understood the failure to occur requires that the RPC payload of the RPC resent without actual header can be mistaken for a valid RPC header. The resend mechanism is probably not considering what the actual content/target the RPC has. So, in principle, the RPC could be to update a data block, or a metadata block - so it may hit just a single data file or corrupt your entire file system. However, I think the likelihood that the RPC content can go as valid RPC header is very low. Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Ben De Luca To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 10/10/2017 08:52 PM Subject: Re: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org does this corrupt the entire filesystem or just the open files that are being written too? One is horrific and the other is just mildly bad. On 10 October 2017 at 17:09, IBM Spectrum Scale wrote: Bob, The problem may occur when the TCP connection is broken between two nodes. While in the vast majority of the cases when data stops flowing through the connection, the result is one of the nodes getting expelled, there are cases where the TCP connection simply breaks -- that is relatively rare but happens on occasion. There is logic in the mmfsd daemon to detect the disconnection and attempt to reconnect to the destination in question. If the reconnect is successful then steps are taken to recover the state kept by the daemons, and that includes resending some RPCs that were in flight when the disconnection took place. As the flash describes, a problem in the logic to resend some RPCs was causing one of the RPC headers to be omitted, resulting in the RPC data to be interpreted as the (missing) header. Normally the result is an assert on the receiving end, like the "logAssertFailed: !"Request and queue size mismatch" assert described in the flash. However, it's at least conceivable (though expected to very rare) that the content of the RPC data could be interpreted as a valid RPC header. In the case of an RPC which involves data transfer between an NSD client and NSD server, that might result in incorrect data being written to some NSD device. Disconnect/reconnect scenarios appear to be uncommon. An entry like [N] Reconnected to xxx.xxx.xxx.xxx nodename in mmfs.log would be an indication that a reconnect has occurred. By itself, the reconnect will not imply that data or the file system was corrupted, since that will depend on what RPCs were pending when the connection happened. In the case the assert above is hit, no corruption is expected, since the daemon will go down before incorrect data gets written. Reconnects involving an NSD server are those which present the highest risk, given that NSD-related RPCs are used to write data into NSDs Even on clusters that have not been subjected to disconnects/reconnects before, such events might still happen in the future in case of network glitches. It's then recommended that an efix for the problem be applied in a timely fashion. Reference: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010668 Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 10/09/2017 10:38 AM Subject: [gpfsug-discuss] FW: [EXTERNAL] FLASH: IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption (2017.10.09) Sent by: gpfsug-discuss-bounces at spectrumscale.org Can anyone from the Scale team comment? Anytime I see ?may result in file system corruption or undetected file data corruption? it gets my attention. Bob Oesterlin Sr Principal Storage Engineer, Nuance Storage IBM My Notifications Check out the IBM Electronic Support IBM Spectrum Scale : IBM Spectrum Scale (GPFS) V4.1 and 4.2 levels: network reconnect function may result in file system corruption or undetected file data corruption IBM has identified a problem with IBM Spectrum Scale (GPFS) V4.1 and V4.2 levels, in which resending an NSD RPC after a network reconnect function may result in file system corruption or undetected file data corruption. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=xzMAvLVkhyTD1vOuTRa4PJfiWgFQ6VHBQgr1Gj9LPDw&s=-AQv2Qlt2IRW2q9kNgnj331p8D631Zp0fHnxOuVR0pA&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=sYR0tp1p1aSwQn0RM9QP4zPQxUppP2L6GbiIIiozJUI&s=3ZjFj8wR05Z0PLErKreAAKH50vaxxvbM1H-NngJWJwI&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=6Cn5SWezFnXqdilAZWuwqHSTl02jHaLfB0EAjtCdj08&s=k9ELfnZlXmHnLuRR0_ltTav1nm-VsmcC6nhgggynBEo&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=dhKhKiNBptpaDmggHSa8diP48O90VK2uzr-xo9C44uI&s=SCeTu6NeyjHm9D8S4VZVUnrALgCvNksAYTF9rfwD50g&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Wed Oct 25 17:17:09 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Wed, 25 Oct 2017 18:17:09 +0200 Subject: [gpfsug-discuss] nsdperf crash testing RDMA between Power BE and Intel nodes Message-ID: Dear all, through some gpfsperf tests against an ESS block (config as is) I am seeing lots of waiters like NSDThread: on ThCond 0x3FFA800670A0 (FreePTrackCondvar), reason 'wait for free PTrack' That is not on file creation but on writing to an already existing file. what ressource is the system short of here? IMHO it cannot be physical data tracks on pdisks (the test does not allocate any space, just rewrites an existing file)? The only shortage in threads i could see might be Total server worker threads: running 3042, desired 3072, forNSD 2, forGNR 3070, nsdBigBufferSize 16777216 nsdMultiQueue: 512, nsdMultiQueueType: 1, nsdMinWorkerThreads: 3072, nsdMaxWorkerThreads: 3072 where a difference of 30 is between desired and running number of worker threads (but that is only 1% and 30 more would not necessarily make a big difference). Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From vanfalen at mx1.ibm.com Wed Oct 25 22:26:50 2017 From: vanfalen at mx1.ibm.com (Emmanuel Barajas Gonzalez) Date: Wed, 25 Oct 2017 21:26:50 +0000 Subject: [gpfsug-discuss] Fileset quotas enforcement Message-ID: An HTML attachment was scrubbed... URL: From pinto at scinet.utoronto.ca Wed Oct 25 23:18:29 2017 From: pinto at scinet.utoronto.ca (Jaime Pinto) Date: Wed, 25 Oct 2017 18:18:29 -0400 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: <20171025181829.90173xxmr17nklo5@support.scinet.utoronto.ca> Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES http://www.scinethpc.ca/testimonials ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. From rohwedder at de.ibm.com Thu Oct 26 08:18:46 2017 From: rohwedder at de.ibm.com (Markus Rohwedder) Date: Thu, 26 Oct 2017 09:18:46 +0200 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: Please also note that when you write as root, you are not restricted by the quota limits. See example: Write as non root user and run into hard limit: [mr at home-11 limited]$ dd if=/dev/urandom of=testfile2 bs=1000 count=100000 dd: error writing ?testfile2?: Disk quota exceeded 29885+0 records in 29884+0 records out 29884000 bytes (30 MB) copied, 3.38491 s, 8.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 322176 214784 322304 128 7 days | 5 0 0 0 none Now write as root user and exceed hard limit: [root at home-11 limited]# dd if=/dev/urandom of=testfile2 bs=1000 count=100000 100000+0 records in 100000+0 records out 100000000 bytes (100 MB) copied, 12.8355 s, 7.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 390656 214784 322304 8939520 7 days | 5 0 0 40 none Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Martina K?deritz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Jaime Pinto" To: "gpfsug main discussion list" , "Emmanuel Barajas Gonzalez" Cc: gpfsug-discuss at spectrumscale.org Date: 10/26/2017 12:18 AM Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scinethpc.ca_testimonials&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=mno73q7nVmqF-YTzo7pAF50B_45epo6ZzccTAwcroRU&e= ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=F6chd4olb-U3RlL5pcqBPagIccRSJd963K0FFS7auko&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 18932891.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From TOMP at il.ibm.com Thu Oct 26 10:09:56 2017 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 26 Oct 2017 12:09:56 +0300 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: Message-ID: And this behavior can be changed using the enforceFilesetQuotaOnRoot options ( check mmchconfig man page) Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Markus Rohwedder" To: gpfsug main discussion list Cc: gpfsug-discuss-bounces at spectrumscale.org Date: 26/10/2017 10:18 Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Please also note that when you write as root, you are not restricted by the quota limits. See example: Write as non root user and run into hard limit: [mr at home-11 limited]$ dd if=/dev/urandom of=testfile2 bs=1000 count=100000 dd: error writing ?testfile2?: Disk quota exceeded 29885+0 records in 29884+0 records out 29884000 bytes (30 MB) copied, 3.38491 s, 8.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 322176 214784 322304 128 7 days | 5 0 0 0 none Now write as root user and exceed hard limit: [root at home-11 limited]# dd if=/dev/urandom of=testfile2 bs=1000 count=100000 100000+0 records in 100000+0 records out 100000000 bytes (100 MB) copied, 12.8355 s, 7.8 MB/s [root at home-11 limited]# mmrepquota gpfs0 Block Limits | File Limits Name type KB quota limit in_doubt grace | files quota limit in_doubt grace ... limited FILESET 390656 214784 322304 8939520 7 days | 5 0 0 40 none Mit freundlichen Gr??en / Kind regards Dr. Markus Rohwedder Spectrum Scale GUI Development Phone: +49 7034 6430190 IBM Deutschland E-Mail: rohwedder at de.ibm.com Am Weiher 24 65451 Kelsterbach Germany IBM Deutschland Research & Development GmbH / Vorsitzender des Aufsichtsrats: Martina K?deritz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 "Jaime Pinto" ---10/26/2017 12:18:45 AM---Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the l From: "Jaime Pinto" To: "gpfsug main discussion list" , "Emmanuel Barajas Gonzalez" Cc: gpfsug-discuss at spectrumscale.org Date: 10/26/2017 12:18 AM Subject: Re: [gpfsug-discuss] Fileset quotas enforcement Sent by: gpfsug-discuss-bounces at spectrumscale.org Did you try to run mmcheckquota on the device I observed that in the most recent versions (for the last 3 years) there is a real long lag for GPFS to process the internal accounting. So there is a slippage effects that skews quota operations. mmcheckquota is supposed to reset and zero all those cumulative deltas effective immediately. Jaime Quoting "Emmanuel Barajas Gonzalez" : > Hello spectrum scale team! I'm working on the implementation of > quotas per fileset and I followed the basic instructions described in > the documentation. Currently the gpfs device has per-fileset quotas > and there is one fileset with a block soft and a hard limit set. My > problem is that I'm being able to write more and more files beyond > the quota (the grace period has expired as well). How can I make > sure quotas will be enforced and that no user will be able to consume > more space than specified? mmrepquota smfslv0 > Block Limits > | > Name fileset type KB quota limit > in_doubt grace | > root root USR 512 0 0 > 0 none | > root cp1 USR 64128 0 0 > 0 none | > system root GRP 512 0 0 > 0 none | > system cp1 GRP 64128 0 0 > 0 none | > valid root GRP 0 0 0 > 0 none | > root root FILESET 512 0 0 > 0 none | > cp1 root FILESET 64128 2048 2048 > 0 expired | > Thanks in advance ! Best regards, > __________________________________________________________________________________ > Emmanuel Barajas Gonzalez TRANSPARENT CLOUD TIERING FOR DS8000 > > Phone: > 52-33-3669-7000 x5547 E-mail: vanfalen at mx1.ibm.com[1] Follow me: > @van_falen > 2200 Camino A El Castillo > El Salto, JAL 45680 > Mexico > > > > > Links: > ------ > [1] mailto:vanfalen at mx1.ibm.com > ************************************ TELL US ABOUT YOUR SUCCESS STORIES https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scinethpc.ca_testimonials&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=mno73q7nVmqF-YTzo7pAF50B_45epo6ZzccTAwcroRU&e= ************************************ --- Jaime Pinto SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 ---------------------------------------------------------------- This message was sent using IMP at SciNet Consortium, University of Toronto. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=2KW8at7r2duccqfXKnq1b4-HmXZvC48q9hZ-RpiIFZ0&s=F6chd4olb-U3RlL5pcqBPagIccRSJd963K0FFS7auko&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=RxLph-CHLj5Iq5-RYe9eqHId7vsI_uuX4W-Y145ETD8&s=3cgWIXnSFvb65_5JkJDygm3hnSOeeCfYnDnPJdX-hWY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Thu Oct 26 10:16:20 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 26 Oct 2017 09:16:20 +0000 Subject: [gpfsug-discuss] Windows [10] and Spectrum Scale Message-ID: Hi all In the FAQ I note that Windows 10 is not supported at all, and neither is encryption on Windows nodes generally. However the context here is Spectrum Scale v4. Can I take it to mean that this also applies to Scale 4.1/4.2/...? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: From vanfalen at mx1.ibm.com Thu Oct 26 14:50:05 2017 From: vanfalen at mx1.ibm.com (Emmanuel Barajas Gonzalez) Date: Thu, 26 Oct 2017 13:50:05 +0000 Subject: [gpfsug-discuss] Fileset quotas enforcement In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46716A4E467141C003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4642530E4641FB0003256C7C22581C5.gif Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E463FD50E463FAF8003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46402D0E4640078003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4641128E4640ED0003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E46416A8E4641450003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E4644278E4643FF0003256C7C22581C5.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image._1_E460E9D8E466F160003256C7C22581C5.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Thu Oct 26 18:03:58 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 26 Oct 2017 17:03:58 +0000 Subject: [gpfsug-discuss] Gartner 2017 - Distributed File systems and Object Storage Message-ID: Interesting read: https://www.gartner.com/doc/reprints?id=1-4IE870C&ct=171017&st=sb Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christian.Fey at sva.de Fri Oct 27 07:30:31 2017 From: Christian.Fey at sva.de (Fey, Christian) Date: Fri, 27 Oct 2017 06:30:31 +0000 Subject: [gpfsug-discuss] how to deal with custom samba options in ces Message-ID: <2ee0987ac21d409b948ae13fbe3d9e97@sva.de> Hi all, I'm just in the process of migration different samba clusters to ces and I recognized, that some clusters have options set like "strict locking = yes" and I'm not sure how to deal with this. From what I know, there is no "CES way" to set every samba option. It would be possible to set with "net" commands I think but probably this will lead to an unsupported state. Anyone came through this? Mit freundlichen Gr??en / Best Regards Christian Fey SVA System Vertrieb Alexander GmbH Borsigstra?e 14 65205 Wiesbaden Tel.: +49 6122 536-0 Fax: +49 6122 536-399 E-Mail: christian.fey at sva.de http://www.sva.de Gesch?ftsf?hrung: Philipp Alexander, Sven Eichelbaum Sitz der Gesellschaft: Wiesbaden Registergericht: Amtsgericht Wiesbaden, HRB 10315 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From sannaik2 at in.ibm.com Fri Oct 27 08:06:50 2017 From: sannaik2 at in.ibm.com (Sandeep Naik1) Date: Fri, 27 Oct 2017 12:36:50 +0530 Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk In-Reply-To: References: Message-ID: Hi Atmane, The missing path from old mmlspdisk (/dev/sdob) and the log file (/dev/sdge) do not match. This may be because server was booted after the old mmlspdisk was taken. The path name are not guarantied across reboot. The log is reporting problem with /dev/sdge. You should check if OS can see path /dev/sdge (use lsscsi). If the disk is accessible from other path than I don't believe it is problem with the disk. Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: atmane khiredine To: "gpfsug-discuss at spectrumscale.org" Date: 24/10/2017 02:50 PM Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=QsMCUxg_qSYCs6Joccb2Brey1phAF_tJFrEnVD6LNoc&s=eSulhfhE2jQnmMrmb9_eoomafxb5xI3KL5Y6n3rH5CE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From christof.schmitt at us.ibm.com Fri Oct 27 20:48:08 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Fri, 27 Oct 2017 19:48:08 +0000 Subject: [gpfsug-discuss] how to deal with custom samba options in ces In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From johnbent at gmail.com Sat Oct 28 05:15:59 2017 From: johnbent at gmail.com (John Bent) Date: Fri, 27 Oct 2017 22:15:59 -0600 Subject: [gpfsug-discuss] Announcing IO-500 and soliciting submissions Message-ID: Hello GPFS community, After BoFs at last year's SC and the last two ISC's, the IO-500 is formalized and is now accepting submissions in preparation for our first IO-500 list at this year's SC BoF: http://sc17.supercomputing.org/presentation/?id=bof108&sess=sess319 The goal of the IO-500 is simple: to improve parallel file systems by ensuring that sites publish results of both "hero" and "anti-hero" runs and by sharing the tuning and configuration they applied to achieve those results. After receiving feedback from a few trial users, the framework is significantly improved: > git clone https://github.com/VI4IO/io-500-dev > cd io-500-dev > ./utilities/prepare.sh > ./io500.sh > # tune and rerun > # email results to submit at io500.org This, perhaps with a bit of tweaking and please consult our 'doc' directory for troubleshooting, should get a very small toy problem up and running quickly. It then does become a bit challenging to tune the problem size as well as the underlying file system configuration (e.g. striping parameters) to get a valid, and impressive, result. The basic format of the benchmark is to run both a "hero" and "antihero" IOR test as well as a "hero" and "antihero" mdtest. The write/create phase of these tests must last for at least five minutes to ensure that the test is not measuring cache speeds. One of the more challenging aspects is that there is a requirement to search through the metadata of the files that this benchmark creates. Currently we provide a simple serial version of this test (i.e. the GNU find command) as well as a simple python MPI parallel tree walking program. Even with the MPI program, the find can take an extremely long amount of time to finish. You are encouraged to replace these provided tools with anything of your own devise that satisfies the required functionality. This is one area where we particularly hope to foster innovation as we have heard from many file system admins that metadata search in current parallel file systems can be painfully slow. Now is your chance to show the community just how awesome we all know GPFS to be. We are excited to introduce this benchmark and foster this community. We hope you give the benchmark a try and join our community if you haven't already. Please let us know right away in any of our various communications channels (as described in our documentation) if you encounter any problems with the benchmark or have questions about tuning or have suggestions for others. We hope to see your results in email and to see you in person at the SC BoF. Thanks, IO 500 Committee John Bent, Julian Kunkle, Jay Lofstead -------------- next part -------------- An HTML attachment was scrubbed... URL: From a.khiredine at meteo.dz Sat Oct 28 08:29:49 2017 From: a.khiredine at meteo.dz (atmane khiredine) Date: Sat, 28 Oct 2017 07:29:49 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 69, Issue 54 In-Reply-To: References: Message-ID: <4B32CB5C696F2849BDEF7DF9EACE884B6340D83B@SDEB-EXC02.meteo.dz> dear Sandeep Naik, Thank you for that answer the OS can see all the path but gss sees only one path for one disk lssci indicates that I have 238 disk 6 SSD and 232 HDD but the gss indicates that it sees only one path with the cmd mmlspdisk all I think it's a disk problem but he sees it with another path if these a problem of SAS cable logically all the disk connect with the cable shows a single path Do you have any ideas ?? GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz ________________________________________ De : gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] de la part de gpfsug-discuss-request at spectrumscale.org [gpfsug-discuss-request at spectrumscale.org] Envoy? : vendredi 27 octobre 2017 08:06 ? : gpfsug-discuss at spectrumscale.org Objet : gpfsug-discuss Digest, Vol 69, Issue 54 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Gartner 2017 - Distributed File systems and Object Storage (Oesterlin, Robert) 2. how to deal with custom samba options in ces (Fey, Christian) 3. Re: GSS GPFS Storage Server show one path for one Disk (Sandeep Naik1) ---------------------------------------------------------------------- Message: 1 Date: Thu, 26 Oct 2017 17:03:58 +0000 From: "Oesterlin, Robert" To: gpfsug main discussion list Subject: [gpfsug-discuss] Gartner 2017 - Distributed File systems and Object Storage Message-ID: Content-Type: text/plain; charset="utf-8" Interesting read: https://www.gartner.com/doc/reprints?id=1-4IE870C&ct=171017&st=sb Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 27 Oct 2017 06:30:31 +0000 From: "Fey, Christian" To: gpfsug main discussion list Subject: [gpfsug-discuss] how to deal with custom samba options in ces Message-ID: <2ee0987ac21d409b948ae13fbe3d9e97 at sva.de> Content-Type: text/plain; charset="iso-8859-1" Hi all, I'm just in the process of migration different samba clusters to ces and I recognized, that some clusters have options set like "strict locking = yes" and I'm not sure how to deal with this. From what I know, there is no "CES way" to set every samba option. It would be possible to set with "net" commands I think but probably this will lead to an unsupported state. Anyone came through this? Mit freundlichen Gr??en / Best Regards Christian Fey SVA System Vertrieb Alexander GmbH Borsigstra?e 14 65205 Wiesbaden Tel.: +49 6122 536-0 Fax: +49 6122 536-399 E-Mail: christian.fey at sva.de http://www.sva.de Gesch?ftsf?hrung: Philipp Alexander, Sven Eichelbaum Sitz der Gesellschaft: Wiesbaden Registergericht: Amtsgericht Wiesbaden, HRB 10315 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: ------------------------------ Message: 3 Date: Fri, 27 Oct 2017 12:36:50 +0530 From: "Sandeep Naik1" To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Message-ID: Content-Type: text/plain; charset="utf-8" Hi Atmane, The missing path from old mmlspdisk (/dev/sdob) and the log file (/dev/sdge) do not match. This may be because server was booted after the old mmlspdisk was taken. The path name are not guarantied across reboot. The log is reporting problem with /dev/sdge. You should check if OS can see path /dev/sdge (use lsscsi). If the disk is accessible from other path than I don't believe it is problem with the disk. Thanks, Sandeep Naik Elastic Storage server / GPFS Test ETZ-B, Hinjewadi Pune India (+91) 8600994314 From: atmane khiredine To: "gpfsug-discuss at spectrumscale.org" Date: 24/10/2017 02:50 PM Subject: [gpfsug-discuss] GSS GPFS Storage Server show one path for one Disk Sent by: gpfsug-discuss-bounces at spectrumscale.org Dear All we owning a solution for our HPC a GSS gpfs ??storage server native raid I noticed 3 days ago that a disk shows a single path my configuration is as follows GSS configuration: 4 enclosures, 6 SSDs, 2 empty slots, 238 total disks, 0 NVRAM partitions if I search with fdisk I have the following result 476 disk in GSS0 and GSS1 with an old file cat mmlspdisk.old ##### replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt,/dev/sdob" << - recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 2 activates 4 total << - while the disk contains the 2 paths ##### ls /dev/sdob /Dev/ sdob ls /dev/sdkt /Dev/sdkt mmlspdisk all >> mmlspdisk.log vi mmlspdisk.log replacementPriority = 1000 name = "e3d5s05" device = "/dev/sdkt" << --- the disk contains 1 path recoveryGroup = "BB1RGL" declusteredArray = "DA2" state = "ok" userLocation = "Enclosure 2021-20E-SV25262728 Drawer 5 Slot 5" userCondition = "normal" nPaths = 1 active 3 total here is the result of the log file in GSS1 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################## START LOG GSS1 ##################### 0 result ################# END LOG GSS 1 ##################### here is the result of the log file in GSS0 grep e3d5s05 /var/adm/ras/mmfs.log.latest ################# START LOG GSS 0 ##################### Thu Sep 14 16:35:01.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4673959648 length 4112 err 5. Thu Sep 14 16:35:01.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 16:35:01.787 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 16:35:01.788 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 16:35:03.709 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:53:13.209 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3658399408 length 4112 err 5. Thu Sep 14 17:53:13.210 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:53:15.685 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:56:10.410 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 796658640 length 4112 err 5. Thu Sep 14 17:56:10.411 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Thu Sep 14 17:56:10.593 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 738304 length 512 err 5. Thu Sep 14 17:56:11.236 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 17:56:11.237 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Thu Sep 14 17:56:13.127 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Thu Sep 14 17:59:14.322 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Thu Sep 14 18:02:16.580 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:08:01.464 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 682228176 length 4112 err 5. Fri Sep 15 00:08:01.465 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:08:03.391 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:21:41.785 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4063038688 length 4112 err 5. Fri Sep 15 00:21:41.786 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:21:42.559 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:21:42.560 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:21:44.336 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:36:11.899 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2503485424 length 4112 err 5. Fri Sep 15 00:36:11.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:36:12.676 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:36:12.677 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:36:14.458 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:40:16.038 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4113538928 length 4112 err 5. Fri Sep 15 00:40:16.039 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:40:16.801 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:40:16.802 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:40:18.307 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:47:11.468 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4185195728 length 4112 err 5. Fri Sep 15 00:47:11.469 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:47:12.238 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:47:12.239 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:47:13.995 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:51:01.323 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 1637135520 length 4112 err 5. Fri Sep 15 00:51:01.324 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:51:01.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:51:01.487 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:51:03.437 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:55:27.595 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 3646618336 length 4112 err 5. Fri Sep 15 00:55:27.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 00:55:27.749 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 00:55:27.750 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 00:55:29.675 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 00:58:29.900 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:15:44.428 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 768931040 length 4112 err 5. Fri Sep 15 02:15:44.429 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Fri Sep 15 02:15:44.596 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:15:44.597 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Fri Sep 15 02:15:46.486 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Fri Sep 15 02:18:46.826 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:21:47.317 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:24:47.723 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Fri Sep 15 02:27:48.152 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Fri Sep 15 02:30:48.392 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:18.434 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 2733386136 length 264 err 5. Sun Sep 24 15:40:18.435 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:19.326 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:41.619 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 3021316920 length 520 err 5. Sun Sep 24 15:40:41.620 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:42.446 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:40:57.977 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on read: sector 4939800712 length 264 err 5. Sun Sep 24 15:40:57.978 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Sun Sep 24 15:40:58.133 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:40:58.134 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to error. Sun Sep 24 15:40:58.984 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. Sun Sep 24 15:44:00.932 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b03 (ACK/NAK timeout). Sun Sep 24 15:47:02.352 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Sun Sep 24 15:50:03.149 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x28 Read (10): Check Condition: skey=0x0b (aborted command) asc/ascq=0x4b04 (NAK received). Mon Sep 25 08:31:07.906 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge: I/O error on write: sector 942033152 length 264 err 5. Mon Sep 25 08:31:07.907 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from ok to diagnosing. Mon Sep 25 08:31:07.908 2017: [D] Pdisk e3d5s05 of RG BB1RGL path //gss0-ib0/dev/sdge: SCSI op=0x00 Test Unit Ready: Ioctl or RPC Failed: err=19. Mon Sep 25 08:31:07.909 2017: [D] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge status changed from active to noDevice. Mon Sep 25 08:31:07.910 2017: [E] Pdisk e3d5s05 of RG BB1RGL path /dev/sdge failed; location 'SV25262728-5-5'. Mon Sep 25 08:31:08.770 2017: [D] Pdisk e3d5s05 of RG BB1RGL state changed from diagnosing to ok. ################## END LOG ##################### is it a HW or SW problem? thank you Atmane Khiredine HPC System Administrator | Office National de la M?t?orologie T?l : +213 21 50 73 93 # 303 | Fax : +213 21 50 79 40 | E-mail : a.khiredine at meteo.dz _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=DXkezTwrVXsEOfvoqY7_DLS86P5FtQszjm9zok6upRU&m=QsMCUxg_qSYCs6Joccb2Brey1phAF_tJFrEnVD6LNoc&s=eSulhfhE2jQnmMrmb9_eoomafxb5xI3KL5Y6n3rH5CE&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 69, Issue 54 ********************************************** From r.sobey at imperial.ac.uk Mon Oct 30 15:32:10 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 30 Oct 2017 15:32:10 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Message-ID: All, Since upgrading to Windows 10 build 1709 aka Autumn Creator's Update our Previous Versions is wonky... as in you just get a flat list of previous versions^^snapshots with no date associated with them. I am not able to test this on a server with the latest Samba release so at the moment I'm stuck reporting that GPFS 4.2.3-4 with smb 4.5.10 doesn't play nicely with Windows 10 1709. Screenshot is attached for an example. Can anyone corroborate my findings? Thanks Richard -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: prv-ver.png Type: image/png Size: 16452 bytes Desc: prv-ver.png URL: From christof.schmitt at us.ibm.com Mon Oct 30 20:25:26 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 30 Oct 2017 20:25:26 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From peter.smith at framestore.com Tue Oct 31 13:10:47 2017 From: peter.smith at framestore.com (Peter Smith) Date: Tue, 31 Oct 2017 13:10:47 +0000 Subject: [gpfsug-discuss] FreeBSD client? Message-ID: Hi Does such a thing exist? :-) TIA -- [image: Framestore] Peter Smith ? Senior Systems Engineer London ? New York ? Los Angeles ? Chicago ? Montr?al T +44 (0)20 7344 8000 ? M +44 (0)7816 123009 <+44%20%280%297816%20123009> 19-23 Wells Street, London W1T 3PQ Twitter ? Facebook ? framestore.com [image: https://www.framestore.com/] -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Oct 31 14:20:27 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 31 Oct 2017 14:20:27 +0000 Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 In-Reply-To: References: Message-ID: Thanks Christof, will do. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Christof Schmitt Sent: 30 October 2017 20:25 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Richard, in a quick test with Windows 10 Pro 1709 connecting to gpfs.smb 4.5.10_gpfs_21 i do not see the problem from the screenshot. All files reported in "Previous Versions" have a date associated. For debugging the problem on your system, i would suggest to enable traces and recreate the problem. Replace the x.x.x.x with the IP address of the Windows 10 client: mmprotocoltrace start network -c x.x.x.x mmprotocoltrace start smb -c x.x.x.x (open the "Previous Versions" dialog) mmprotocoltrace stop smb mmprotocoltrace stop network The best way to track the analysis would be through a PMR. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) ----- Original message ----- From: "Sobey, Richard A" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug-discuss at spectrumscale.org'" Cc: Subject: [gpfsug-discuss] Snapshots / previous versions issues in Windows10 1709 Date: Mon, Oct 30, 2017 8:32 AM All, Since upgrading to Windows 10 build 1709 aka Autumn Creator?s Update our Previous Versions is wonky? as in you just get a flat list of previous versions^^snapshots with no date associated with them. I am not able to test this on a server with the latest Samba release so at the moment I?m stuck reporting that GPFS 4.2.3-4 with smb 4.5.10 doesn?t play nicely with Windows 10 1709. Screenshot is attached for an example. Can anyone corroborate my findings? Thanks Richard _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=5Nn7eUPeYe291x8f39jKybESLKv_W_XtkTkS8fTR-NI&m=Bfd_a1yscUVzXzIRuwarah8UedH7U1Uln5AFFPQayR4&s=URMLuAJbrlEOj4xt3_7_Cm0Rj9DfFovuEUOGc4zQUUY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at u.washington.edu Tue Oct 31 14:41:58 2017 From: skylar2 at u.washington.edu (Skylar Thompson) Date: Tue, 31 Oct 2017 07:41:58 -0700 Subject: [gpfsug-discuss] FreeBSD client? In-Reply-To: References: Message-ID: <20171031144158.GC17659@illiuin> I doubt it, since IBM would need to tailor a kernel layer for FreeBSD (not the kind of thing you can run with the x86 Linux userspace emulation in FreeBSD), which would be a lot of work for not a lot of demand. On Tue, Oct 31, 2017 at 01:10:47PM +0000, Peter Smith wrote: > Hi > > Does such a thing exist? :-) -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine