From TOMP at il.ibm.com Thu Nov 1 07:37:03 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 1 Nov 2018 09:37:03 +0200 Subject: [gpfsug-discuss] V5 client limit? In-Reply-To: References: Message-ID: Kristy, If you mean the maximum number of nodes that can mount a filesystem ( which implies on the number of nodes on related clusters) then the number haven't changed since 3.4.0.13 - and its still 16384. Just to clarify, this is the theoretical limit - I don't think anyone tried more then 14-15k nodes. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Kristy Kallback-Rose To: gpfsug main discussion list Date: 31/10/2018 23:08 Subject: [gpfsug-discuss] V5 client limit? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Can someone tell me the max # of GPFS native clients under 5.x? Everything I can find is dated. Thanks Kristy _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Thu Nov 1 18:31:41 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Thu, 1 Nov 2018 11:31:41 -0700 Subject: [gpfsug-discuss] V5 client limit? In-Reply-To: References: Message-ID: <58DAFCE0-DECF-4612-8704-81C025069584@lbl.gov> Yes, OK. I was wondering if there was an updated number with v5. That answers it. Thank you, Kristy > On Nov 1, 2018, at 12:37 AM, Tomer Perry wrote: > > Kristy, > > If you mean the maximum number of nodes that can mount a filesystem ( which implies on the number of nodes on related clusters) then the number haven't changed since 3.4.0.13 - and its still 16384. > Just to clarify, this is the theoretical limit - I don't think anyone tried more then 14-15k nodes. > > > Regards, > > Tomer Perry > Scalable I/O Development (Spectrum Scale) > email: tomp at il.ibm.com > 1 Azrieli Center, Tel Aviv 67021, Israel > Global Tel: +1 720 3422758 > Israel Tel: +972 3 9188625 > Mobile: +972 52 2554625 > > > > > From: Kristy Kallback-Rose > To: gpfsug main discussion list > Date: 31/10/2018 23:08 > Subject: [gpfsug-discuss] V5 client limit? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > Can someone tell me the max # of GPFS native clients under 5.x? Everything I can find is dated. > > Thanks > Kristy > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Nov 1 18:45:35 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 1 Nov 2018 18:45:35 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas Message-ID: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Nov 1 22:40:21 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 1 Nov 2018 22:40:21 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> References: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> Message-ID: <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> I?m not going this year, or local to Dallas, but I do travel and have a lot of experience traveling from airports to city centers. If I were going, I?d take the DART Orange Line. Looks like a 52 minute ride ? where you get off probably depends on your hotel, but I put in the convention center here: https://www.google.com/maps/dir/Kay+Bailey+Hutchison+Convention+Center+Dallas,+South+Griffin+Street,+Dallas,+TX/DFW+Terminal+A,+2040+S+International+Pkwy,+Irving,+TX+75063/@32.9109009,-97.0712812,13z/am=t/data=!4m14!4m13!1m5!1m1!1s0x864e991a403efaa9:0xae0261a23eab57d2!2m2!1d-96.8002849!2d32.7743895!1m5!1m1!1s0x864c2a4300afd38d:0x3e0ecb50c933781d!2m2!1d-97.0357045!2d32.9048736!3e3 I don?t personally do business with UBER or Lyft ? I feel like the ?gig economy? is just another way people are getting ripped off and don?t want to be a part of it. > On Nov 1, 2018, at 2:45 PM, Oesterlin, Robert wrote: > > Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From babbott at oarc.rutgers.edu Fri Nov 2 03:30:31 2018 From: babbott at oarc.rutgers.edu (Bill Abbott) Date: Fri, 2 Nov 2018 03:30:31 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> References: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> Message-ID: <5BDBC4D8.1080003@oarc.rutgers.edu> SuperShuttle is $40-50 round trip, quick, reliable and in pretty much every city. Bill On 11/1/18 6:40 PM, Ryan Novosielski wrote: > I?m not going this year, or local to Dallas, but I do travel and have a lot of experience traveling from airports to city centers. If I were going, I?d take the DART Orange Line. Looks like a 52 minute ride ? where you get off probably depends on your hotel, but I put in the convention center here: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fdir%2FKay%2BBailey%2BHutchison%2BConvention%2BCenter%2BDallas%2C%2BSouth%2BGriffin%2BStreet%2C%2BDallas%2C%2BTX%2FDFW%2BTerminal%2BA%2C%2B2040%2BS%2BInternational%2BPkwy%2C%2BIrving%2C%2BTX%2B75063%2F%4032.9109009%2C-97.0712812%2C13z%2Fam%3Dt%2Fdata%3D!4m14!4m13!1m5!1m1!1s0x864e991a403efaa9%3A0xae0261a23eab57d2!2m2!1d-96.8002849!2d32.7743895!1m5!1m1!1s0x864c2a4300afd38d%3A0x3e0ecb50c933781d!2m2!1d-97.0357045!2d32.9048736!3e3&data=02%7C01%7Cbabbott%40rutgers.edu%7Ce04f2c06af1440bdd05e08d6407132df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636767252267150866&sdata=xvKo%2BtQo8sxoDqeOp2ZAtaIedHNu87r4QTuOIMXWoKA%3D&reserved=0 > > I don?t personally do business with UBER or Lyft ? I feel like the ?gig economy? is just another way people are getting ripped off and don?t want to be a part of it. > >> On Nov 1, 2018, at 2:45 PM, Oesterlin, Robert wrote: >> >> Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? >> >> Bob Oesterlin >> Sr Principal Storage Engineer, Nuance > > -- > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbabbott%40rutgers.edu%7Ce04f2c06af1440bdd05e08d6407132df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636767252267150866&sdata=CsnXrY0YwZAdQbuJ43GgH9P%2BEKQcWFm6xkg7jX5ySmE%3D&reserved=0 From chris.schlipalius at pawsey.org.au Fri Nov 2 09:37:44 2018 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Fri, 2 Nov 2018 17:37:44 +0800 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas Message-ID: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> Hi all, so I?ve used Super Shuttle booked online for both New Orleans SC round trip and Austin SC just to the hotel, travelling solo and a Sheraton hotel shuttle back to the airport (as a solo travel option, Super is a good price). In Austin for SC my boss actually took the bus to his hotel! For SC18 my colleagues and I will prob pre-book a van transfer as there?s a few of us. Some of the Aussie IBM staff are hiring a car to get to their hotel, so if theres a few who can share, that?s also a good share option if you can park or drop the rental car at or near your hotel. Regards, Chris > On 2 Nov 2018, at 4:02 pm, gpfsug-discuss-request at spectrumscale.org wrote: > > Re: Slightly OT: Getting from DFW to SC17 hotels in Dallas From mark.fellows at stfc.ac.uk Fri Nov 2 11:45:58 2018 From: mark.fellows at stfc.ac.uk (Mark Fellows - UKRI STFC) Date: Fri, 2 Nov 2018 11:45:58 +0000 Subject: [gpfsug-discuss] Hello Message-ID: Hi all, Just introducing myself as a new subscriber to the mailing list. I work at the Hartree Centre within the Science and Technology Facilities Council near Warrington, UK. Our role is to work with industry to promote the use of high performance technologies and data analytics to solve problems and deliver gains in productivity. We also support academic researchers in UK based and international science. We have Spectrum Scale installations on linux (x86 for data storage/Power for HPC clusters) and I've recently been involved with deploying and upgrading some small ESS systems. As a relatively new user of SS I may initially have more questions than answers but hope to be able to exchange some thoughts and ideas within the group. Best regards, Mark Mark Fellows HPC Systems Administrator Platforms and Infrastructure Group Telephone - 01925 603413 | Email - mark.fellows at stfc.ac.uk Hartree Centre, Science & Technology Facilities Council Daresbury Laboratory, Keckwick Lane, Daresbury, Warrington, WA4 4AD, UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From damir.krstic at gmail.com Fri Nov 2 15:55:27 2018 From: damir.krstic at gmail.com (Damir Krstic) Date: Fri, 2 Nov 2018 10:55:27 -0500 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> Message-ID: Hi, Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: -V 15.01 (4.2.0.0) File system version I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. Damir On Tue, May 22, 2018 at 10:43 AM wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That > is we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. > All the ?ps hangs? happen when jobs run related to a Java process or many > Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we > know that it is related to just Java or some jobs reading from directories > with about 600000 files. > > I have raised an IBM critical service request about a month ago related to > this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not > be related to GPFS. > Although, we are sure that these hangs started to appear only after we > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is > because, we are unable to capture any logs/traces from GPFS once the hang > happens. > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting > difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a ?ps > hang" issue and we have to run mmchconfig release=LATEST command, and that > will resolve the issue. > However we are not comfortable making the permanent change to Filesystem > version 5. and since we don?t see any near solution to these hangs - we are > thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know > the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config > state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i > reinstall all rpms to a previous version? or is there anything else that i > need to make sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS > configuration parameters , and i am not sure if downgrading GPFS will > change them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or > is it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at googlemail.com Fri Nov 2 16:24:07 2018 From: luke.raimbach at googlemail.com (Luke Raimbach) Date: Fri, 2 Nov 2018 16:24:07 +0000 Subject: [gpfsug-discuss] RFE: Inode Expansion Message-ID: Dear Spectrum Scale Experts, I would really like to have a callback made available for the file system manager executing an Inode Expansion event. You know, with all the nice variables output, etc. Kind Regards, Luke Raimbach -------------- next part -------------- An HTML attachment was scrubbed... URL: From sveta at cbio.mskcc.org Fri Nov 2 16:09:35 2018 From: sveta at cbio.mskcc.org (Mazurkova, Svetlana/Information Systems) Date: Fri, 2 Nov 2018 12:09:35 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> Message-ID: <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Hi Damir, It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. Regards, Sveta. > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > Hi, > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > -V 15.01 (4.2.0.0) File system version > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > Damir > > On Tue, May 22, 2018 at 10:43 AM > wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > I have raised an IBM critical service request about a month ago related to this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not be related to GPFS. > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run mmchconfig release=LATEST command, and that will resolve the issue. > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Nov 2 16:29:19 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 2 Nov 2018 12:29:19 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Message-ID: Yes, We have upgraded to 5.0.1-0.5, which has the patch for the issue. The related IBM case number was :?TS001010674 Regards, Lohit On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > Hi Damir, > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. > > Regards, > > Sveta. > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > Hi, > > > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > > ?-V ? ? ? ? ? ? ? ? 15.01 (4.2.0.0) ? ? ? ? ?File system version > > > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > > > Damir > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > Hello All, > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > > > > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > > > > > > > I have raised an IBM critical service request about a month ago related to this -?PMR: 24090,L6Q,000. > > > > However, According to the ticket ?- they seemed to feel that it might not be related to GPFS. > > > > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > > > > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > > > > > > > Also ?- According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run ?mmchconfig release=LATEST command, and that will resolve the issue. > > > > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > > > > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > Our previous state: > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > 1 Compute cluster - 4.2.3.2 ?( remote mounts the above 2 storage clusters ) > > > > > > > > Our current state: > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > Thanks, > > > > > > > > Lohit > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Nov 2 16:31:12 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 2 Nov 2018 12:31:12 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Message-ID: <5469f6aa-3f82-47b2-8b82-a599edfa2f16@Spark> Also - You could just upgrade one of the clients to this version, and test to see if the hang still occurs. You do not have to upgrade the NSD servers, to test. Regards, Lohit On Nov 2, 2018, 12:29 PM -0400, valleru at cbio.mskcc.org, wrote: > Yes, > > We have upgraded to 5.0.1-0.5, which has the patch for the issue. > The related IBM case number was :?TS001010674 > > Regards, > Lohit > > On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > > Hi Damir, > > > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. > > > > Regards, > > > > Sveta. > > > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > > > Hi, > > > > > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > > > ?-V ? ? ? ? ? ? ? ? 15.01 (4.2.0.0) ? ? ? ? ?File system version > > > > > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > > > > > Damir > > > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > > Hello All, > > > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > > > > > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > > > > > > > > > I have raised an IBM critical service request about a month ago related to this -?PMR: 24090,L6Q,000. > > > > > However, According to the ticket ?- they seemed to feel that it might not be related to GPFS. > > > > > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > > > > > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > > > > > > > > > Also ?- According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run ?mmchconfig release=LATEST command, and that will resolve the issue. > > > > > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > > > > > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > > > Our previous state: > > > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > > 1 Compute cluster - 4.2.3.2 ?( remote mounts the above 2 storage clusters ) > > > > > > > > > > Our current state: > > > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > > > Thanks, > > > > > > > > > > Lohit > > > > > _______________________________________________ > > > > > gpfsug-discuss mailing list > > > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Sat Nov 3 20:21:50 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Sat, 3 Nov 2018 20:21:50 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> References: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> Message-ID: <09CC2A72-2D2C-4722-87CB-A4B1093D90BC@rutgers.edu> I took the bus back to the airport in Austin (the Airport Flyer). Was a good experience. If Austin is the city I?m thinking of, I took SuperShuttle to the hotel (I believe because I arrived late at night) and was the fourth hotel that got dropped off, which roughly doubled the trip time. There is that risk with the shared-ride shuttles. In recent years, the only location without a solid public transit option was New Orleans (I used it anyway). They have an express airport bus, but the hours and frequency are not ideal (there?s a local bus as well, which is quite a bit slower). SLC had good light rail service, Denver has good rail service from the airport to downtown, and Atlanta has good subway service (all of these I?ve used before). Typically the transit option is less than $10 round-trip (Denver?s is above-average at $9 each way), sometimes even less than $5. > On Nov 2, 2018, at 5:37 AM, Chris Schlipalius wrote: > > Hi all, so I?ve used Super Shuttle booked online for both New Orleans SC round trip and Austin SC just to the hotel, travelling solo and a Sheraton hotel shuttle back to the airport (as a solo travel option, Super is a good price). > In Austin for SC my boss actually took the bus to his hotel! > > For SC18 my colleagues and I will prob pre-book a van transfer as there?s a few of us. > Some of the Aussie IBM staff are hiring a car to get to their hotel, so if theres a few who can share, that?s also a good share option if you can park or drop the rental car at or near your hotel. > > Regards, Chris > >> On 2 Nov 2018, at 4:02 pm, gpfsug-discuss-request at spectrumscale.org wrote: >> >> Re: Slightly OT: Getting from DFW to SC17 hotels in Dallas > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From henrik.cednert at filmlance.se Tue Nov 6 06:23:44 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 06:23:44 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: Hi there For some reason my mail didn?t get through. Trying again. Apologies if there's duplicates... The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Nov 5 20:25:13 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 5 Nov 2018 20:25:13 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Tue Nov 6 09:35:27 2018 From: viccornell at gmail.com (Vic Cornell) Date: Tue, 6 Nov 2018 09:35:27 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> References: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell > On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) wrote: > > Hi there > > The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. > > But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. > > I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. > > I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( > > Versions involved: > Windows 10 Enterprise 2016 LTSB > IBM GPFS Express Edition 4.1.0.4 > IBM GPFS Express Edition License and Prerequisites 4.1 > IBM GPFS GSKit 8.0.0.32 > > Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. > > Cheers and many thanks in advance for helping me out here. I?m all ears. > > > root at M5-CLIPSTER02 ~ > $ cat /var/adm/ras/mmfs.log.latest > Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting > Removing old /var/adm/ras/mmfs.log.* files: > mmtrace: The tracefmt.exe or tracelog.exe command can not be found. > mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. > Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. > Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license > Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... > Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... > Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... > Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. > Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. > Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... > Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... > Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... > Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... > Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... > Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... > Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... > Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... > Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... > Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... > Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... > Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... > Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... > Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... > Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... > Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs > Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs > Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. > Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready > Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all > Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 > > > > > -- > Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance > Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? > > Disclaimer > > > The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Tue Nov 6 10:08:03 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Tue, 6 Nov 2018 10:08:03 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Nov 5 20:21:10 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 5 Nov 2018 20:21:10 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgayne at us.ibm.com Tue Nov 6 13:46:34 2018 From: lgayne at us.ibm.com (Lyle Gayne) Date: Tue, 6 Nov 2018 08:46:34 -0500 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: Message-ID: Vipul or Heather should be able to assist. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/06/2018 07:00 AM Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ? \_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 6 13:52:03 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 6 Nov 2018 13:52:03 +0000 Subject: [gpfsug-discuss] Spectrum Scale and Firewalls In-Reply-To: References: <10239ED8-7E0D-4420-8BEC-F17F0606BE64@bham.ac.uk> Message-ID: Just to close the loop on this, IBM support confirmed it?s a bug in mmnetverify and will be fixed in a later PTF. (I didn?t feel the need for an EFIX for this) Simon From: on behalf of Simon Thompson Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 19 October 2018 at 14:39 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Spectrum Scale and Firewalls Yeah we have the perfmon ports open, and GUI ports open on the GUI nodes. But basically this is just a storage cluster and everything else (protocols etc) run in remote clusters. I?ve just opened a ticket ? no longer a PMR in the new support centre for Scale Simon From: on behalf of "knop at us.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 19 October 2018 at 14:05 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Spectrum Scale and Firewalls Simon, Depending on what functions are being used in Scale, other ports may also get used, as documented in https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_firewall.htm On the other hand, I'd initially speculate that you might be hitting a problem in mmnetverify itself. (perhaps some aspect in mmnetverify is not taking into account that ports other than 22, 1191, 60000-61000 may be getting blocked by the firewall) Could you open a PMR for this one? Thanks, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Simon Thompson ---10/19/2018 06:41:27 AM---Hi, We?re having some issues bringing up firewalls on som]Simon Thompson ---10/19/2018 06:41:27 AM---Hi, We?re having some issues bringing up firewalls on some of our NSD nodes. The problem I was actua From: Simon Thompson To: "gpfsug-discuss at spectrumscale.org" Date: 10/19/2018 06:41 AM Subject: [gpfsug-discuss] Spectrum Scale and Firewalls Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, We?re having some issues bringing up firewalls on some of our NSD nodes. The problem I was actually trying to diagnose I don?t think is firewall related but still ? We have port 22 and 1191 open and also 60000-61000, we also set: # mmlsconfig tscTcpPort tscTcpPort 1191 # mmlsconfig tscCmdPortRange tscCmdPortRange 60000-61000 https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_firewallforinternalcommn.htm Claims this is sufficient ? Running mmnetverify: # mmnetverify all --target-nodes rds-er-mgr01 rds-pg-mgr01 checking local configuration. Operation interface: Success. rds-pg-mgr01 checking communication with node rds-er-mgr01. Operation resolution: Success. Operation ping: Success. Operation shell: Success. Operation copy: Success. Operation time: Success. Operation daemon-port: Success. Operation sdrserv-port: Success. Operation tsccmd-port: Success. Operation data-small: Success. Operation data-medium: Success. Operation data-large: Success. Could not connect to port 46326 on node rds-pg-mgr01 (10.20.0.56): timed out. This may indicate a firewall configuration issue. Operation bandwidth-node: Fail. rds-pg-mgr01 checking cluster communications. Issues Found: rds-er-mgr01 could not connect to rds-pg-mgr01 (TCP, port 46326). mmnetverify: Command failed. Examine previous error messages to determine cause. Note that the port number mentioned changes if we run mmnetverify multiple times. The two clients in this test are running 5.0.2 code. If I run in verbose mode I see: Checking network communication with node rds-er-mgr01. Port range restricted by cluster configuration: 60000 - 61000. rds-er-mgr01: connecting to node rds-pg-mgr01. rds-er-mgr01: exchanged 256.0M bytes with rds-pg-mgr01. Write size: 16.0M bytes. Network statistics for rds-er-mgr01 during data exchange: packets sent: 68112 packets received: 72452 Network Traffic between rds-er-mgr01 and rds-pg-mgr01 port 60000 ok. Operation data-large: Success. Checking network bandwidth. rds-er-mgr01: connecting to node rds-pg-mgr01. Could not connect to port 36277 on node rds-pg-mgr01 (10.20.0.56): timed out. This may indicate a firewall configuration issue. Operation bandwidth-node: Fail. So for many of the tests it looks like its using port 60000 as expected, is this just a bug in mmnetverify or am I doing something silly? Thanks Simon_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 107 bytes Desc: image001.gif URL: From henrik.cednert at filmlance.se Tue Nov 6 11:25:57 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 11:25:57 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger > wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) > wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgayne at us.ibm.com Tue Nov 6 14:02:48 2018 From: lgayne at us.ibm.com (Lyle Gayne) Date: Tue, 6 Nov 2018 09:02:48 -0500 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com><45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> Message-ID: Yes, Henrik. For information on which OS levels are supported at which Spectrum Scale release levels, you should always consult our Spectrum Scale FAQ. This info is in Section 2 or 3 of the FAQ. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: gpfsug main discussion list Date: 11/06/2018 09:00 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ? \_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) < henrik.cednert at filmlance.se> wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From henrik.cednert at filmlance.se Tue Nov 6 14:12:27 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 14:12:27 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com><45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se>, Message-ID: Hello Ah yes, I never thought I was an issue since DDN sent me the v4 installer. Now I know better. Cheers -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 15:02, Lyle Gayne > wrote: Yes, Henrik. For information on which OS levels are supported at which Spectrum Scale release levels, you should always consult our Spectrum Scale FAQ. This info is in Section 2 or 3 of the FAQ. Thanks, Lyle "Henrik Cednert (Filmlance)" ---11/06/2018 09:00:15 AM---Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to From: "Henrik Cednert (Filmlance)" > To: gpfsug main discussion list > Date: 11/06/2018 09:00 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger > wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) > wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From vpaul at us.ibm.com Tue Nov 6 16:54:38 2018 From: vpaul at us.ibm.com (Vipul Paul) Date: Tue, 6 Nov 2018 08:54:38 -0800 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: Message-ID: Hello Henrik, I see that you are trying GPFS 4.1.0.4 on Windows 10. This will not work. You need to upgrade to GPFS 5.0.2 as that is the first release that supports Windows 10. Please see the FAQ https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows "Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Thanks. -- Vipul Paul | IBM Spectrum Scale (GPFS) Development | vpaul at us.ibm.com | (503) 747-1389 (tie 997) From: Lyle Gayne/Poughkeepsie/IBM To: gpfsug main discussion list Cc: Vipul Paul/Portland/IBM, Heather J MacPherson/Beaverton/IBM at IBMUS Date: 11/06/2018 05:46 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Vipul or Heather should be able to assist. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/06/2018 07:00 AM Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_( ?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Wed Nov 7 06:31:45 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Wed, 7 Nov 2018 06:31:45 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable Message-ID: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Hello I will try my luck here. Trying to monitor capacity on our gpfs system via observium. For some reason hrStorageTable doesn?t pick up that gpfs mount point though. In diskTable it?s visible but I cannot use diskTable when monitoring via observium, has to be hrStorageTable (I was told by observium dev). Output of a few snmpwalks and more at the bottom. Are there any obvious reasons for Centos 6.7 to not pick up a gpfs mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not sure if it?s even possible to in some way force it to include it in hrStorageTable?? Apologies if this isn?t the list for questions like this. But feels like there has to be one or two peeps here monitoring their systems here. =) All these commands ran on that host: df -h | grep ddnnas0 /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 mount | grep ddnnas0 /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) snmpwalk -v2c -c secret localhost hrStorageDescr HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy snmpwalk -v2c -c secret localhost dskPath UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 UCD-SNMP-MIB::dskPath.2 = STRING: / yum list | grep net-snmp Failed to set locale, defaulting to C net-snmp.x86_64 1:5.5-60.el6 @base net-snmp-libs.x86_64 1:5.5-60.el6 @base net-snmp-perl.x86_64 1:5.5-60.el6 @base net-snmp-utils.x86_64 1:5.5-60.el6 @base net-snmp-devel.i686 1:5.5-60.el6 base net-snmp-devel.x86_64 1:5.5-60.el6 base net-snmp-libs.i686 1:5.5-60.el6 base net-snmp-python.x86_64 1:5.5-60.el6 base Cheers and thanks -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Wed Nov 7 08:13:04 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 7 Nov 2018 08:13:04 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Nov 7 11:20:37 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 7 Nov 2018 12:20:37 +0100 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: Looking at the CHANGELOG for net-snmp, it seems it needs to know about each filesystem it's going to support, and I see no GPFS/mmfs. It has entries like: - Added simfs (OpenVZ filesystem) to hrStorageTable and hrFSTable. - Added CVFS (CentraVision File System) to hrStorageTable and - Added OCFS2 (Oracle Cluster FS) to hrStorageTable and hrFSTable - report gfs filesystems in hrStorageTable and hrFSTable. and also it didn't understand filesystems larger than 8 TB before version 5.7. I think your best option is to look at implementing the GPFS snmp agent agent https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_snmp.htm -- and see if it provides the data you need -- but it most likely won't affect the hrStorage table. And of course, please upgrade to something newer than v4.1.x. There's lots of improvements on monitoring in v4.2.3 and v5.x (but beware that v5 doesn't work with RHEL6). -jf On Wed, Nov 7, 2018 at 9:05 AM Henrik Cednert (Filmlance) < henrik.cednert at filmlance.se> wrote: > Hello > > I will try my luck here. Trying to monitor capacity on our gpfs system via > observium. For some reason hrStorageTable doesn?t pick up that gpfs mount > point though. In diskTable it?s visible but I cannot use diskTable when > monitoring via observium, has to be hrStorageTable (I was told by observium > dev). Output of a few snmpwalks and more at the bottom. > > Are there any obvious reasons for Centos 6.7 to not pick up a gpfs > mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not > sure if it?s even possible to in some way force it to include it in > hrStorageTable?? > > Apologies if this isn?t the list for questions like this. But feels like > there has to be one or two peeps here monitoring their systems here. =) > > > All these commands ran on that host: > > df -h | grep ddnnas0 > /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 > > > mount | grep ddnnas0 > /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) > > > snmpwalk -v2c -c secret localhost hrStorageDescr > HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory > HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory > HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers > HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory > HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space > HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / > HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm > HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot > HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy > HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash > HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy > HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var > HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy > > > snmpwalk -v2c -c secret localhost dskPath > UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 > UCD-SNMP-MIB::dskPath.2 = STRING: / > > > yum list | grep net-snmp > Failed to set locale, defaulting to C > net-snmp.x86_64 1:5.5-60.el6 > @base > net-snmp-libs.x86_64 1:5.5-60.el6 > @base > net-snmp-perl.x86_64 1:5.5-60.el6 > @base > net-snmp-utils.x86_64 1:5.5-60.el6 > @base > net-snmp-devel.i686 1:5.5-60.el6 > base > net-snmp-devel.x86_64 1:5.5-60.el6 > base > net-snmp-libs.i686 1:5.5-60.el6 > base > net-snmp-python.x86_64 1:5.5-60.el6 > base > > > Cheers and thanks > > -- > Henrik Cednert */ * + 46 704 71 89 54 */* CTO */ * *Filmlance* > Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. > ?\_(?)_/? > > *Disclaimer* > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Nov 7 11:29:11 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 7 Nov 2018 12:29:11 +0100 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: Looks like this is all it should take to add GPFS support to net-snmp: $ git diff diff --git a/agent/mibgroup/hardware/fsys/fsys_mntent.c b/agent/mibgroup/hardware/fsys/fsys_mntent.c index 62e2953..4950879 100644 --- a/agent/mibgroup/hardware/fsys/fsys_mntent.c +++ b/agent/mibgroup/hardware/fsys/fsys_mntent.c @@ -136,6 +136,7 @@ _fsys_type( char *typename ) else if ( !strcmp(typename, MNTTYPE_TMPFS) || !strcmp(typename, MNTTYPE_GFS) || !strcmp(typename, MNTTYPE_GFS2) || + !strcmp(typename, MNTTYPE_GPFS) || !strcmp(typename, MNTTYPE_XFS) || !strcmp(typename, MNTTYPE_JFS) || !strcmp(typename, MNTTYPE_VXFS) || diff --git a/agent/mibgroup/hardware/fsys/mnttypes.h b/agent/mibgroup/hardware/fsys/mnttypes.h index bb1b401..d3f0c60 100644 --- a/agent/mibgroup/hardware/fsys/mnttypes.h +++ b/agent/mibgroup/hardware/fsys/mnttypes.h @@ -121,6 +121,9 @@ #ifndef MNTTYPE_GFS2 #define MNTTYPE_GFS2 "gfs2" #endif +#ifndef MNTTYPE_GPFS +#define MNTTYPE_GPFS "gpfs" +#endif #ifndef MNTTYPE_XFS #define MNTTYPE_XFS "xfs" #endif On Wed, Nov 7, 2018 at 12:20 PM Jan-Frode Myklebust wrote: > Looking at the CHANGELOG for net-snmp, it seems it needs to know about > each filesystem it's going to support, and I see no GPFS/mmfs. It has > entries like: > > - Added simfs (OpenVZ filesystem) to hrStorageTable and hrFSTable. > - Added CVFS (CentraVision File System) to hrStorageTable and > - Added OCFS2 (Oracle Cluster FS) to hrStorageTable and hrFSTable > - report gfs filesystems in hrStorageTable and hrFSTable. > > > and also it didn't understand filesystems larger than 8 TB before version > 5.7. > > I think your best option is to look at implementing the GPFS snmp agent > agent > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_snmp.htm > -- and see if it provides the data you need -- but it most likely won't > affect the hrStorage table. > > And of course, please upgrade to something newer than v4.1.x. There's lots > of improvements on monitoring in v4.2.3 and v5.x (but beware that v5 > doesn't work with RHEL6). > > > -jf > > On Wed, Nov 7, 2018 at 9:05 AM Henrik Cednert (Filmlance) < > henrik.cednert at filmlance.se> wrote: > >> Hello >> >> I will try my luck here. Trying to monitor capacity on our gpfs system >> via observium. For some reason hrStorageTable doesn?t pick up that gpfs >> mount point though. In diskTable it?s visible but I cannot use diskTable >> when monitoring via observium, has to be hrStorageTable (I was told by >> observium dev). Output of a few snmpwalks and more at the bottom. >> >> Are there any obvious reasons for Centos 6.7 to not pick up a gpfs >> mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not >> sure if it?s even possible to in some way force it to include it in >> hrStorageTable?? >> >> Apologies if this isn?t the list for questions like this. But feels like >> there has to be one or two peeps here monitoring their systems here. =) >> >> >> All these commands ran on that host: >> >> df -h | grep ddnnas0 >> /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 >> >> >> mount | grep ddnnas0 >> /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) >> >> >> snmpwalk -v2c -c secret localhost hrStorageDescr >> HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory >> HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory >> HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers >> HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory >> HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space >> HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / >> HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm >> HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot >> HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy >> HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash >> HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy >> HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var >> HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy >> >> >> snmpwalk -v2c -c secret localhost dskPath >> UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 >> UCD-SNMP-MIB::dskPath.2 = STRING: / >> >> >> yum list | grep net-snmp >> Failed to set locale, defaulting to C >> net-snmp.x86_64 1:5.5-60.el6 >> @base >> net-snmp-libs.x86_64 1:5.5-60.el6 >> @base >> net-snmp-perl.x86_64 1:5.5-60.el6 >> @base >> net-snmp-utils.x86_64 1:5.5-60.el6 >> @base >> net-snmp-devel.i686 1:5.5-60.el6 >> base >> net-snmp-devel.x86_64 1:5.5-60.el6 >> base >> net-snmp-libs.i686 1:5.5-60.el6 >> base >> net-snmp-python.x86_64 1:5.5-60.el6 >> base >> >> >> Cheers and thanks >> >> -- >> Henrik Cednert */ * + 46 704 71 89 54 */* CTO */ * *Filmlance* >> Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. >> ?\_(?)_/? >> >> *Disclaimer* >> >> The information contained in this communication from the sender is >> confidential. It is intended solely for use by the recipient and others >> authorized to receive it. If you are not the recipient, you are hereby >> notified that any disclosure, copying, distribution or taking action in >> relation of the contents of this information is strictly prohibited and may >> be unlawful. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Nov 7 13:02:32 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 7 Nov 2018 13:02:32 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: <5da641cc-a171-2d9b-f917-a4470279237f@strath.ac.uk> On 07/11/2018 11:20, Jan-Frode Myklebust wrote: [SNIP] > > And of course, please upgrade to something newer than v4.1.x. There's > lots of improvements on monitoring in v4.2.3 and v5.x (but beware that > v5 doesn't work with RHEL6). > I would suggest that getting off CentOS 6.7 to more recent release should also be a priority. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From aaron.s.knister at nasa.gov Wed Nov 7 23:37:37 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 7 Nov 2018 18:37:37 -0500 Subject: [gpfsug-discuss] Unexpected data in message/Bad message Message-ID: We're experiencing client nodes falling out of the cluster with errors that look like this: ?Tue Nov 6 15:10:34.939 2018: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000047 00000000 00 00 0000 00000000 00000000 0000 0000 Tue Nov 6 15:10:34.942 2018: [E] [0/0] 512 more bytes were available: Tue Nov 6 15:10:34.965 2018: [N] Close connection to 10.100.X.X nsdserver1 (Unexpected error 120) Tue Nov 6 15:10:34.966 2018: [E] Network error on 10.100.X.X nsdserver1 , Check connectivity Tue Nov 6 15:10:36.726 2018: [N] Restarting mmsdrserv Tue Nov 6 15:10:38.850 2018: [E] Bad message Tue Nov 6 15:10:38.851 2018: [X] The mmfs daemon is shutting down abnormally. Tue Nov 6 15:10:38.852 2018: [N] mmfsd is shutting down. Tue Nov 6 15:10:38.853 2018: [N] Reason for shutdown: LOGSHUTDOWN called The cluster is running various PTF Levels of 4.1.1. Has anyone seen this before? I'm struggling to understand what it means from a technical point of view. Was GPFS expecting a larger message than it received? Did it receive all of the bytes it expected and some of it was corrupt? It says "512 more bytes were available" but then doesn't show any additional bytes. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Robert.Oesterlin at nuance.com Thu Nov 8 20:40:05 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 8 Nov 2018 20:40:05 +0000 Subject: [gpfsug-discuss] SSUG @ SC18 - Location details Message-ID: <07491620-1D44-4A9C-9C92-A7DA634304CE@nuance.com> Location: Omni Dallas Hotel 555 S Lamar Dallas, Texas 75202 United States The Omni is connected to Kay Bailey Convention Center via skybridge on 2nd Floor. Dallas Ballroom A - 3rd Floor IBM Spectrum Scale User Group Meeting Sunday, November 11, 2018 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From heiner.billich at psi.ch Fri Nov 9 12:46:31 2018 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Fri, 9 Nov 2018 12:46:31 +0000 Subject: [gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots Message-ID: Hello, we run CES with smbd on a filesystem _without_ snapshots. I would like to completely remove the shadow_copy2 vfs object in samba which exposes the snapshots to windows clients: We don't offer snapshots as service to clients and if I create a snapshot I don't want it to be exposed to clients. I'm also not sure how much additional directory traversals this vfs object causes, shadow_copy2 has to search for the snapshot directories again and again, just to learn that there are no snapshots available. Now the file samba_registry.def (/usr/lpp/mmfs/share/samba/samba_registry.def) doesn't allow to change the settings for shadow_config2 in samba's configuration. Hm, is it o.k. to edit samba_registry.def? That's probably not what IBM intended. But with mmsnapdir I can change the name of the snapshot directories, which would require me to edit the locked settings, too, so it seems a bit restrictive. I didn?t search all documentation, if there is an option do disable shadow_copy2 with some command I would be happy to learn. Any comments or ideas are welcome. Also if you think I should just create a bogus .snapdirs at root level to get rid of the error messages and that's it, please let me know. we run scale 5.0.1-1 on RHEL4 x86_64. We will upgrade to 5.0.2-1 soon, but I didn?t' check that version yet. Cheers Heiner Billich What I would like to change in samba's configuration: 52c52 < vfs objects = syncops gpfs fileid time_audit --- > vfs objects = shadow_copy2 syncops gpfs fileid time_audit 72a73,76 > shadow:snapdir = .snapshots > shadow:fixinodes = yes > shadow:snapdirseverywhere = yes > shadow:sort = desc -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch From jonathan.buzzard at strath.ac.uk Fri Nov 9 13:26:50 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 9 Nov 2018 13:26:50 +0000 Subject: [gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots In-Reply-To: References: Message-ID: <82e1aa3d-566a-3a4a-e841-9f92f30546c6@strath.ac.uk> On 09/11/2018 12:46, Billich Heinrich Rainer (PSI) wrote: > Hello, > > we run CES with smbd on a filesystem _without_ snapshots. I would > like to completely remove the shadow_copy2 vfs object in samba which > exposes the snapshots to windows clients: > > We don't offer snapshots as service to clients and if I create a > snapshot I don't want it to be exposed to clients. I'm also not sure > how much additional directory traversals this vfs object causes, > shadow_copy2 has to search for the snapshot directories again and > again, just to learn that there are no snapshots available. > The shadow_copy2 VFS only exposes snapshots to clients if they are in a very specific format. The chances of you doing this with "management" snapshots you are creating are about ?, assuming you are using the command line. If you are using the GUI then all bets are off. Perhaps someone with experience of the GUI can add their wisdom here. The VFS even if loaded will only create I/O on the server if the client clicks on previous versions tab in Windows Explorer. Given that you don't offer previous version snapshots, then there will be very little of this going on and even if they do then the initial amount of I/O will be limited to basically the equivalent of an `ls` in the shadow copy snapshot directory. So absolutely nothing to get worked up about. With the proviso about doing snapshots from the GUI (never used the new GUI in GPFS, only played with the old one, and don't trust IBM to change it again) you are completely overthinking this. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Robert.Oesterlin at nuance.com Fri Nov 9 14:07:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 9 Nov 2018 14:07:01 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Message-ID: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Nov 9 14:13:19 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 9 Nov 2018 09:13:19 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen > On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: > > Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. > > ESS 5.3.1.1 > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 17.0 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Fri Nov 9 16:02:12 2018 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 9 Nov 2018 11:02:12 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 ("4.2.3" may not get recognized by the command) The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Stephen Ulmer To: gpfsug main discussion list Date: 11/09/2018 09:13 AM Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Sent by: gpfsug-discuss-bounces at spectrumscale.org It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ulmer at ulmer.org Fri Nov 9 16:08:17 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 9 Nov 2018 11:08:17 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: <8EDF9168-698B-4EA7-9D2A-F33D7B8AF265@ulmer.org> You rock. -- Stephen > On Nov 9, 2018, at 11:02 AM, Felipe Knop > wrote: > > Stephen, Bob, > > A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: > 4.2.3.0 and 4.2.3.9 > > ("4.2.3" may not get recognized by the command) > > The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) > > Felipe > > ---- > Felipe Knop knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 T/L 293-9314 > > > > Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 11/09/2018 09:13 AM > Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > It had better work ? I?m literally going to be doing exactly the same thing in two weeks? > > -- > Stephen > > > On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: > > Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. > > ESS 5.3.1.1 > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 17.0 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Nov 9 16:11:05 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 9 Nov 2018 16:11:05 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Message-ID: <6951D57C-3714-4F26-A7AD-92B8D79501EC@nuance.com> That did it, thanks. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Felipe Knop Reply-To: gpfsug main discussion list Date: Friday, November 9, 2018 at 10:04 AM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Nov 9 16:12:02 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 9 Nov 2018 16:12:02 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: Looking in ?mmprodname? looks like if you wanted to use 17, it would be 1700 (for 1709 based on what Felipe mentions below). (I wonder what 99.0.0.0 does ?) Simon From: on behalf of "knop at us.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 9 November 2018 at 16:02 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 ("4.2.3" may not get recognized by the command) The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exac]Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- From: Stephen Ulmer To: gpfsug main discussion list Date: 11/09/2018 09:13 AM Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: From bzhang at ca.ibm.com Fri Nov 9 16:37:08 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Fri, 9 Nov 2018 11:37:08 -0500 Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: Hi all, We are going to host our next technical webinar. everyone is welcome to register and attend. Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B383659.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B754293.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B720798.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B162231.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B680907.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B309846.gif Type: image/gif Size: 1243 bytes Desc: not available URL: From jonbernard at gmail.com Sat Nov 10 20:37:35 2018 From: jonbernard at gmail.com (Jon Bernard) Date: Sat, 10 Nov 2018 14:37:35 -0600 Subject: [gpfsug-discuss] If you're attending KubeCon'18 In-Reply-To: References: Message-ID: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov Folks, > > Please let me know if anyone is attending KubeCon'18 in Seattle this > December (via private e-mail). We will be there and would like to meet in > person with people that already use or consider using > Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, > problems, visions. > > P.S. If you are not attending KubeCon, but are interested in the topic, > shoot me an e-mail anyway. > > Best, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Nov 11 18:07:17 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sun, 11 Nov 2018 13:07:17 -0500 Subject: [gpfsug-discuss] Unexpected data in message/Bad message In-Reply-To: References: Message-ID: Hi Aaron, The header dump shows all zeroes were received for the header. So no valid magic, version, originator, etc. The "512 more bytes" would have been the meat after the header. Very unexpected hence the shutdown. Logs around that event involving the machines noted in that trace would be required to evaluate further. This is not common. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 11/07/2018 06:38 PM Subject: [gpfsug-discuss] Unexpected data in message/Bad message Sent by: gpfsug-discuss-bounces at spectrumscale.org We're experiencing client nodes falling out of the cluster with errors that look like this: Tue Nov 6 15:10:34.939 2018: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000047 00000000 00 00 0000 00000000 00000000 0000 0000 Tue Nov 6 15:10:34.942 2018: [E] [0/0] 512 more bytes were available: Tue Nov 6 15:10:34.965 2018: [N] Close connection to 10.100.X.X nsdserver1 (Unexpected error 120) Tue Nov 6 15:10:34.966 2018: [E] Network error on 10.100.X.X nsdserver1 , Check connectivity Tue Nov 6 15:10:36.726 2018: [N] Restarting mmsdrserv Tue Nov 6 15:10:38.850 2018: [E] Bad message Tue Nov 6 15:10:38.851 2018: [X] The mmfs daemon is shutting down abnormally. Tue Nov 6 15:10:38.852 2018: [N] mmfsd is shutting down. Tue Nov 6 15:10:38.853 2018: [N] Reason for shutdown: LOGSHUTDOWN called The cluster is running various PTF Levels of 4.1.1. Has anyone seen this before? I'm struggling to understand what it means from a technical point of view. Was GPFS expecting a larger message than it received? Did it receive all of the bytes it expected and some of it was corrupt? It says "512 more bytes were available" but then doesn't show any additional bytes. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From madhu at corehive.com Sun Nov 11 19:58:27 2018 From: madhu at corehive.com (Madhu Konidena) Date: Sun, 11 Nov 2018 14:58:27 -0500 Subject: [gpfsug-discuss] If you're attending KubeCon'18 In-Reply-To: References: Message-ID: <2a3e90be-92bd-489d-a9bc-c1f6b6eae5de@corehive.com> I will be there at both. Please stop by our booth at SC18 for a quick chat. ? Madhu Konidena Madhu at CoreHive.com? On Nov 10, 2018, at 3:37 PM, Jon Bernard wrote: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: >Hi Vasily, > >I will be at Kubecon with colleagues from Tower Research Capital (and >at >SC). We have a few hundred nodes across several Kubernetes clusters, >most >of them mounting Scale from the host. > >Jon > >On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: > >> Folks, >> >> Please let me know if anyone is attending KubeCon'18 in Seattle this >> December (via private e-mail). We will be there and would like to >meet in >> person with people that already use or consider using >> Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, >> problems, visions. >> >> P.S. If you are not attending KubeCon, but are interested in the >topic, >> shoot me an e-mail anyway. >> >> Best, >> -- >> Vasily Tarasov, >> Research Staff Member, >> Storage Systems Research, >> IBM Research - Almaden >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 102.png Type: image/png Size: 18340 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sc18-campaignasset_10.png Type: image/png Size: 354312 bytes Desc: not available URL: From heiner.billich at psi.ch Wed Nov 14 16:20:12 2018 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Wed, 14 Nov 2018 16:20:12 +0000 Subject: [gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot Message-ID: Hello, how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node or restart gpfs on a suspended ces node? Being able to do this would make updates much easier With # mmces node suspend ?stop I can move all IPs to other CES nodes and stop all CES services, what also releases the ces-shared-root-directory and allows to unmount the underlying filesystem. But after a reboot/restart only the IPs stay on the on the other nodes, the CES services start up. Hm, sometimes I would very much prefer the services to stay down as long as the nodes is suspended and to keep the node out of the CES cluster as much as possible. I did not try rough things like just renaming smbd, this seems likely to create unwanted issues. Thank you, Cheers, Heiner Billich -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch From: on behalf of Madhu Konidena Reply-To: gpfsug main discussion list Date: Sunday 11 November 2018 at 22:06 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18 I will be there at both. Please stop by our booth at SC18 for a quick chat. Madhu Konidena [cid:ii_d4d3894a4c2f4773] Madhu at CoreHive.com On Nov 10, 2018, at 3:37 PM, Jon Bernard > wrote: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: Folks, Please let me know if anyone is attending KubeCon'18 in Seattle this December (via private e-mail). We will be there and would like to meet in person with people that already use or consider using Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, problems, visions. P.S. If you are not attending KubeCon, but are interested in the topic, shoot me an e-mail anyway. Best, -- Vasily Tarasov, Research Staff Member, Storage Systems Research, IBM Research - Almaden _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 18341 bytes Desc: image001.png URL: From skylar2 at uw.edu Wed Nov 14 16:27:31 2018 From: skylar2 at uw.edu (Skylar Thompson) Date: Wed, 14 Nov 2018 16:27:31 +0000 Subject: [gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot In-Reply-To: References: Message-ID: <20181114162731.a7etjs4g3gftgsyv@utumno.gs.washington.edu> Hi Heiner, Try doing "mmces service stop -N " and/or "mmces service disable -N ". You'll definitely want the node suspended first, since I don't think the service commands do an address migration first. On Wed, Nov 14, 2018 at 04:20:12PM +0000, Billich Heinrich Rainer (PSI) wrote: > Hello, > > how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node or restart gpfs on a suspended ces node? Being able to do this would make updates much easier > > With > > # mmces node suspend ???stop > > I can move all IPs to other CES nodes and stop all CES services, what also releases the ces-shared-root-directory and allows to unmount the underlying filesystem. > But after a reboot/restart only the IPs stay on the on the other nodes, the CES services start up. Hm, sometimes I would very much prefer the services to stay down as long as the nodes is suspended and to keep the node out of the CES cluster as much as possible. > > I did not try rough things like just renaming smbd, this seems likely to create unwanted issues. > > Thank you, > > Cheers, > > Heiner Billich > -- > Paul Scherrer Institut > Heiner Billich > System Engineer Scientific Computing > Science IT / High Performance Computing > WHGA/106 > Forschungsstrasse 111 > 5232 Villigen PSI > Switzerland > > Phone +41 56 310 36 02 > heiner.billich at psi.ch > https://www.psi.ch > > > > From: on behalf of Madhu Konidena > Reply-To: gpfsug main discussion list > Date: Sunday 11 November 2018 at 22:06 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18 > > I will be there at both. Please stop by our booth at SC18 for a quick chat. > > Madhu Konidena > [cid:ii_d4d3894a4c2f4773] > Madhu at CoreHive.com > > > > On Nov 10, 2018, at 3:37 PM, Jon Bernard > wrote: > Hi Vasily, > I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. > Jon > On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: > Folks, Please let me know if anyone is attending KubeCon'18 in Seattle this December (via private e-mail). We will be there and would like to meet in person with people that already use or consider using Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, problems, visions. P.S. If you are not attending KubeCon, but are interested in the topic, shoot me an e-mail anyway. Best, -- Vasily Tarasov, Research Staff Member, Storage Systems Research, IBM Research - Almaden > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From novosirj at rutgers.edu Wed Nov 14 15:28:31 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Wed, 14 Nov 2018 15:28:31 +0000 Subject: [gpfsug-discuss] GSS Software Release? Message-ID: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- From scale at us.ibm.com Thu Nov 15 13:26:18 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Thu, 15 Nov 2018 08:26:18 -0500 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> Message-ID: AFAIK GSS/DSS are handled by Lenovo not IBM so you would need to contact them for release plans. I do not know which version of GPFS was included in GSS 3.3a but I can tell you that GPFS 3.5 is out of service and GPFS 4.1.x will be end of service in April 2019. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Ryan Novosielski To: "gpfsug-discuss at spectrumscale.org" Date: 11/15/2018 12:03 AM Subject: [gpfsug-discuss] GSS Software Release? Sent by: gpfsug-discuss-bounces at spectrumscale.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Thu Nov 15 14:01:28 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Thu, 15 Nov 2018 14:01:28 +0000 Subject: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) Message-ID: >any idea when a newer GSS software release than 3.3a will be released? That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From alvise.dorigo at psi.ch Thu Nov 15 15:22:25 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Thu, 15 Nov 2018 15:22:25 +0000 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon Message-ID: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> Hello, I'm using mmperfmon to get writing stats on NSD during a write activity on a GPFS filesystem (Lenovo system with dss-g-2.0a). I use this command: # mmperfmon query 'sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --number-buckets 48 -b 1 to get the stats. What it returns is a list of valid values followed by a longer list of 'null' as shown below: Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp gpfs_nsdfs_bytes_written 1 2018-11-15-16:15:57 746586112 2 2018-11-15-16:15:58 704643072 3 2018-11-15-16:15:59 805306368 4 2018-11-15-16:16:00 754974720 5 2018-11-15-16:16:01 754974720 6 2018-11-15-16:16:02 763363328 7 2018-11-15-16:16:03 746586112 8 2018-11-15-16:16:04 746848256 9 2018-11-15-16:16:05 780140544 10 2018-11-15-16:16:06 679923712 11 2018-11-15-16:16:07 746618880 12 2018-11-15-16:16:08 780140544 13 2018-11-15-16:16:09 746586112 14 2018-11-15-16:16:10 763363328 15 2018-11-15-16:16:11 780173312 16 2018-11-15-16:16:12 721420288 17 2018-11-15-16:16:13 796917760 18 2018-11-15-16:16:14 763363328 19 2018-11-15-16:16:15 738197504 20 2018-11-15-16:16:16 738197504 21 2018-11-15-16:16:17 null 22 2018-11-15-16:16:18 null 23 2018-11-15-16:16:19 null 24 2018-11-15-16:16:20 null 25 2018-11-15-16:16:21 null 26 2018-11-15-16:16:22 null 27 2018-11-15-16:16:23 null 28 2018-11-15-16:16:24 null 29 2018-11-15-16:16:25 null 30 2018-11-15-16:16:26 null 31 2018-11-15-16:16:27 null 32 2018-11-15-16:16:28 null 33 2018-11-15-16:16:29 null 34 2018-11-15-16:16:30 null 35 2018-11-15-16:16:31 null 36 2018-11-15-16:16:32 null 37 2018-11-15-16:16:33 null 38 2018-11-15-16:16:34 null 39 2018-11-15-16:16:35 null 40 2018-11-15-16:16:36 null 41 2018-11-15-16:16:37 null 42 2018-11-15-16:16:38 null 43 2018-11-15-16:16:39 null 44 2018-11-15-16:16:40 null 45 2018-11-15-16:16:41 null 46 2018-11-15-16:16:42 null 47 2018-11-15-16:16:43 null 48 2018-11-15-16:16:44 null If I run again and again I still get the same pattern: valid data (even 0 in case of not write activity) followed by more null data. Is that normal ? If not, is there a way to get only non-null data by fine-tuning pmcollector's configuration file ? The corresponding ZiMon sensor (GPFSNSDFS) have period=1. The ZiMon version is 4.2.3-7. Thanks, Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From aposthuma at lenovo.com Thu Nov 15 15:56:44 2018 From: aposthuma at lenovo.com (Andre Posthuma) Date: Thu, 15 Nov 2018 15:56:44 +0000 Subject: [gpfsug-discuss] [External] Re: GSS Software Release? (Ryan Novosielski) In-Reply-To: References: Message-ID: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> Hello, GSS 3.3b was released last week, with a number of Spectrum Scale versions available : 5.0.1.2 4.2.3.11 4.1.1.20 DSS-G 2.2a was released yesterday, with 2 Spectrum Scale versions available : 5.0.2.1 4.2.3.11 Best Regards Andre Posthuma IT Specialist HPC Services Lenovo United Kingdom +44 7841782363 aposthuma at lenovo.com ? Lenovo.com Twitter | Facebook | Instagram | Blogs | Forums -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Carl Zetie Sent: Thursday, November 15, 2018 2:01 PM To: gpfsug-discuss at spectrumscale.org Subject: [External] Re: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) >any idea when a newer GSS software release than 3.3a will be released? That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From matthew.robinson02 at gmail.com Thu Nov 15 17:53:14 2018 From: matthew.robinson02 at gmail.com (Matthew Robinson) Date: Thu, 15 Nov 2018 12:53:14 -0500 Subject: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) In-Reply-To: References: Message-ID: Hi Ryan, As an Ex-GSS PE guy for Lenovo, a new GSS update could almost be expected every 3-4 months in a year. I would not be surprised if Lenovo GSS-DSS development started to not update the GSS solution and only focused on DSS updates. That is just my best guess from this point. I agree with Carl this should be a quick open and close case for the Lenovo product engineer that still works on the GSS solution. Kind regards, MattRob On Thu, Nov 15, 2018 at 9:02 AM Carl Zetie wrote: > > >any idea when a newer GSS software release than 3.3a will be released? > > That is definitely a question only our friends at Lenovo can answer. If > you don't get a response here (I'm not sure if any Lenovites are active on > the list), you'll need to address it directly to Lenovo, e.g. your account > team. > > > Carl Zetie > Program Director > Offering Management for Spectrum Scale, IBM > ---- > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Matthew Robinson Comptia A+, Net+ 919.909.0494 matthew.robinson02 at gmail.com The greatest discovery of my generation is that man can alter his life simply by altering his attitude of mind. - William James, Harvard Psychologist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy_kurth at ncsu.edu Thu Nov 15 18:28:46 2018 From: andy_kurth at ncsu.edu (Andy Kurth) Date: Thu, 15 Nov 2018 13:28:46 -0500 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> Message-ID: Public information on GSS updates seems nonexistent. You can find some clues if you have access to Lenovo's restricted download site . It looks like gss3.3b was released in late September. There are gss3.3b download options that include either 4.2.3-9 or 4.1.1-20. Earlier this month they also released some GPFS-only updates for 4.3.2-11 and 5.0.1-2. It looks like these are meant to be applied on top of gss3.3b. For DSS-G, it looks like dss-g-2.2a is the latest full release with options that include 4.2.3-11 or 5.0.2-1. There are also separate DSS-G GPFS-only updates for 4.2.3-11 and 5.0.1-2. Regards, Andy Kurth / NCSU On Thu, Nov 15, 2018 at 12:01 AM Ryan Novosielski wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I know this might not be the perfect venue, but I know IBM developers > participate and will occasionally share this sort of thing: any idea > when a newer GSS software release than 3.3a will be released? We are > attempting to plan our maintenance schedule. At the moment, the DSS-G > software seems to be getting updated and we'd prefer to remain at the > same GPFS release on DSS-G and GSS. > > - -- > ____ > || \\UTGERS, |----------------------*O*------------------------ > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark > `' > -----BEGIN PGP SIGNATURE----- > > iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl > qvcAn2ql2U8peuGuhptTIejVgnDFSWEf > =7Iue > -----END PGP SIGNATURE----- > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- *Andy Kurth* Research Storage Specialist NC State University Office of Information Technology P: 919-513-4090 311A Hillsborough Building Campus Box 7109 Raleigh, NC 27695 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Nov 15 20:35:29 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Nov 2018 21:35:29 +0100 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon In-Reply-To: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Nov 16 02:22:55 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 16 Nov 2018 02:22:55 +0000 Subject: [gpfsug-discuss] Presentations - User Group Meeting at SC18 Message-ID: <917D0EB2-BE2C-4445-AE12-B68DA3D2B6F1@nuance.com> I?ve uploaded the first batch of presentation to the spectrumscale.org site - More coming once I receive them. https://www.spectrumscaleug.org/presentations/2018/ Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From LNakata at SLAC.STANFORD.EDU Fri Nov 16 03:06:46 2018 From: LNakata at SLAC.STANFORD.EDU (Lance Nakata) Date: Thu, 15 Nov 2018 19:06:46 -0800 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Message-ID: <20181116030646.GA28141@slac.stanford.edu> We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory From Greg.Lehmann at csiro.au Fri Nov 16 03:46:01 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 16 Nov 2018 03:46:01 +0000 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: <20181116030646.GA28141@slac.stanford.edu> References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: <6be5c9834bc747b7b7145e884f98caa2@exch1-cdc.nexus.csiro.au> Hi Lance, We are doing it with beegfs (mdadm and NVMe drives in the same HW.) For GPFS have you updated the nsddevices sample script to look at the mdadm devices and put it in /var/mmfs/etc? BTW I'm interested to see how you go with that configuration. Cheers, Greg -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lance Nakata Sent: Friday, November 16, 2018 1:07 PM To: gpfsug-discuss at spectrumscale.org Cc: Jon L. Bergman Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From alvise.dorigo at psi.ch Fri Nov 16 08:29:46 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Fri, 16 Nov 2018 08:29:46 +0000 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon In-Reply-To: References: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch>, Message-ID: <83A6EEB0EC738F459A39439733AE80452679A101@MBX214.d.ethz.ch> Indeed, I just realized that after last recent update to dssg-2.0a ntpd is crashing very frequently. Thanks for the hint. Alvise ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Olaf Weiser [olaf.weiser at de.ibm.com] Sent: Thursday, November 15, 2018 9:35 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Wrong behavior of mmperfmon ntp running / time correct ? From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/15/2018 04:30 PM Subject: [gpfsug-discuss] Wrong behavior of mmperfmon Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, I'm using mmperfmon to get writing stats on NSD during a write activity on a GPFS filesystem (Lenovo system with dss-g-2.0a). I use this command: # mmperfmon query 'sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --number-buckets 48 -b 1 to get the stats. What it returns is a list of valid values followed by a longer list of 'null' as shown below: Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp gpfs_nsdfs_bytes_written 1 2018-11-15-16:15:57 746586112 2 2018-11-15-16:15:58 704643072 3 2018-11-15-16:15:59 805306368 4 2018-11-15-16:16:00 754974720 5 2018-11-15-16:16:01 754974720 6 2018-11-15-16:16:02 763363328 7 2018-11-15-16:16:03 746586112 8 2018-11-15-16:16:04 746848256 9 2018-11-15-16:16:05 780140544 10 2018-11-15-16:16:06 679923712 11 2018-11-15-16:16:07 746618880 12 2018-11-15-16:16:08 780140544 13 2018-11-15-16:16:09 746586112 14 2018-11-15-16:16:10 763363328 15 2018-11-15-16:16:11 780173312 16 2018-11-15-16:16:12 721420288 17 2018-11-15-16:16:13 796917760 18 2018-11-15-16:16:14 763363328 19 2018-11-15-16:16:15 738197504 20 2018-11-15-16:16:16 738197504 21 2018-11-15-16:16:17 null 22 2018-11-15-16:16:18 null 23 2018-11-15-16:16:19 null 24 2018-11-15-16:16:20 null 25 2018-11-15-16:16:21 null 26 2018-11-15-16:16:22 null 27 2018-11-15-16:16:23 null 28 2018-11-15-16:16:24 null 29 2018-11-15-16:16:25 null 30 2018-11-15-16:16:26 null 31 2018-11-15-16:16:27 null 32 2018-11-15-16:16:28 null 33 2018-11-15-16:16:29 null 34 2018-11-15-16:16:30 null 35 2018-11-15-16:16:31 null 36 2018-11-15-16:16:32 null 37 2018-11-15-16:16:33 null 38 2018-11-15-16:16:34 null 39 2018-11-15-16:16:35 null 40 2018-11-15-16:16:36 null 41 2018-11-15-16:16:37 null 42 2018-11-15-16:16:38 null 43 2018-11-15-16:16:39 null 44 2018-11-15-16:16:40 null 45 2018-11-15-16:16:41 null 46 2018-11-15-16:16:42 null 47 2018-11-15-16:16:43 null 48 2018-11-15-16:16:44 null If I run again and again I still get the same pattern: valid data (even 0 in case of not write activity) followed by more null data. Is that normal ? If not, is there a way to get only non-null data by fine-tuning pmcollector's configuration file ? The corresponding ZiMon sensor (GPFSNSDFS) have period=1. The ZiMon version is 4.2.3-7. Thanks, Alvise _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Fri Nov 16 09:19:07 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 16 Nov 2018 10:19:07 +0100 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: <20181116030646.GA28141@slac.stanford.edu> References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Fri Nov 16 09:35:25 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 16 Nov 2018 10:35:25 +0100 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Hi, Having mentioned nsddevices, I do not know how Scale treats different device types differently, so generic would be a fine choice unless development tells you differently. Currently known device types are listed in the comments of the script /usr/lpp/mmfs/bin/mmdevdiscover Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 16/11/2018 10:19 Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Fri Nov 16 12:31:57 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 16 Nov 2018 07:31:57 -0500 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Note, RHEL 7.6 is not yet a supported platform for Spectrum Scale so you may want to use RHEL 7.5 or wait for RHEL 7.6 to be supported. Using "generic" for the device type should be the proper option here. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Uwe Falke" To: gpfsug main discussion list Date: 11/16/2018 04:35 AM Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having mentioned nsddevices, I do not know how Scale treats different device types differently, so generic would be a fine choice unless development tells you differently. Currently known device types are listed in the comments of the script /usr/lpp/mmfs/bin/mmdevdiscover Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 16/11/2018 10:19 Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Nov 15 17:17:15 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 15 Nov 2018 17:17:15 +0000 Subject: [gpfsug-discuss] [External] Re: GSS Software Release? (Ryan Novosielski) In-Reply-To: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> References: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> Message-ID: <28988E74-6BAC-47FB-AEE2-015D2B784A40@rutgers.edu> Thanks, all. I was looking around FlexNet this week and didn?t see it, but it?s good to know it exists/likely will appear soon. -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Nov 15, 2018, at 10:56 AM, Andre Posthuma wrote: > > Hello, > > GSS 3.3b was released last week, with a number of Spectrum Scale versions available : > 5.0.1.2 > 4.2.3.11 > 4.1.1.20 > > DSS-G 2.2a was released yesterday, with 2 Spectrum Scale versions available : > > 5.0.2.1 > 4.2.3.11 > > Best Regards > > > Andre Posthuma > IT Specialist > HPC Services > Lenovo United Kingdom > +44 7841782363 > aposthuma at lenovo.com > > > Lenovo.com > Twitter | Facebook | Instagram | Blogs | Forums > > > > > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Carl Zetie > Sent: Thursday, November 15, 2018 2:01 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [External] Re: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) > > >> any idea when a newer GSS software release than 3.3a will be released? > > That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. > > > Carl Zetie > Program Director > Offering Management for Spectrum Scale, IBM > ---- > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From novosirj at rutgers.edu Thu Nov 15 18:33:12 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 15 Nov 2018 18:33:12 +0000 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu>, Message-ID: Thanks, Andy. I just realized our entitlement lapsed on GSS and that?s probably why I don?t see it there at the moment. Helpful to know what?s in there though for planning while that is worked out. Sent from my iPhone On Nov 15, 2018, at 13:29, Andy Kurth > wrote: Public information on GSS updates seems nonexistent. You can find some clues if you have access to Lenovo's restricted download site. It looks like gss3.3b was released in late September. There are gss3.3b download options that include either 4.2.3-9 or 4.1.1-20. Earlier this month they also released some GPFS-only updates for 4.3.2-11 and 5.0.1-2. It looks like these are meant to be applied on top of gss3.3b. For DSS-G, it looks like dss-g-2.2a is the latest full release with options that include 4.2.3-11 or 5.0.2-1. There are also separate DSS-G GPFS-only updates for 4.2.3-11 and 5.0.1-2. Regards, Andy Kurth / NCSU On Thu, Nov 15, 2018 at 12:01 AM Ryan Novosielski > wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Andy Kurth Research Storage Specialist NC State University Office of Information Technology P: 919-513-4090 311A Hillsborough Building Campus Box 7109 Raleigh, NC 27695 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Tue Nov 20 15:01:36 2018 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Tue, 20 Nov 2018 15:01:36 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS Message-ID: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> On one of our clusters, from time to time if users try to access files or folders via the direct full path over NFS, the NFS-client gets invalid information from the server. For instance, if I run "ls /gpfs/filesystem/test/test2/test3" over NFS-mount, result is just full of ???????? If I recurse through the path once, for instance by ls'ing or cd'ing through the folders one at a time or running ls -R, I can then access directly via the full path afterwards. This seem to be intermittent, and I haven't found how to reliably recreate the issue. Possibly, it can be connected to creating or changing files or folders via a GPFS mount, and then accessing them through NFS, but it doesn't happen consistently. Is this a known behaviour or bug, and does anyone know how to fix the issue? These NSD-servers currently run Scale 4.2.2.3, while the CES is on 5.0.1.1. GPFS clients run Scale 5.0.1.1, and NFS clients run CentOS 7.5. Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4575 bytes Desc: not available URL: From valdis.kletnieks at vt.edu Tue Nov 20 15:25:16 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 20 Nov 2018 10:25:16 -0500 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: <17828.1542727516@turing-police.cc.vt.edu> On Tue, 20 Nov 2018 15:01:36 +0000, Andreas Mattsson said: > On one of our clusters, from time to time if users try to access files or > folders via the direct full path over NFS, the NFS-client gets invalid > information from the server. > > For instance, if I run "ls /gpfs/filesystem/test/test2/test3" over > NFS-mount, result is just full of ???????? I've seen the Ganesha server do this sort of thing once in a while. Never tracked it down, because it was always in the middle of bigger misbehaviors... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Tue Nov 20 23:39:37 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Tue, 20 Nov 2018 23:39:37 +0000 Subject: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a still happy day seeing we're still above ground! ? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 mmfsd 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate. Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Wed Nov 21 13:01:54 2018 From: jjdoherty at yahoo.com (Jim Doherty) Date: Wed, 21 Nov 2018 13:01:54 +0000 (UTC) Subject: [gpfsug-discuss] mmfsd recording High CPU usage In-Reply-To: References: Message-ID: <1913697205.666954.1542805314669@mail.yahoo.com> At a guess with no data ....?? if the application is opening more files than can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to support the open files,? but it will also scan to try and free any unused objects.??? If you can identify the user job that is causing this? you could monitor a system more closely. Jim On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi wrote: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a?still happy day seeing we're still above ground!? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs ? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND 231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd ?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate.?Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted Science & TechnologyNORTH DAKOTA STATE UNIVERSITY? Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Nov 21 15:32:55 2018 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 21 Nov 2018 07:32:55 -0800 Subject: [gpfsug-discuss] mmfsd recording High CPU usage In-Reply-To: References: Message-ID: Hi, the best way to debug something like that is to start with top. start top then press 1 and check if any of the cores has almost 0% idle while others have plenty of CPU left. if that is the case you have one very hot thread. to further isolate it you can press 1 again to collapse the cores, now press shirt-h which will break down each thread of a process and show them as an individual line. now you either see one or many mmfsd's causing cpu consumption, if its many your workload is just doing a lot of work, what is more concerning is if you have just 1 thread running at the 90%+ . if thats the case write down the PID of the thread that runs so hot and run mmfsadm dump threads,kthreads >dum. you will see many entries in the file like : MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm 0x7FC9700008C0 highStackP 0x7FC783F7E530 pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21 ThPoolCommands per-thread gbls: 0:0x0 1:0x0 2:0x0 3:0x3 4:0xFFFFFFFFFFFFFFFF 5:0x0 6:0x0 7:0x7FC98C0067B0 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x400000E 14:0x7FC98C004C10 15:0x0 16:0x4 17:0x0 18:0x0 find the pid behind 'thread id' and post that section, that would be the first indication on what that thread does ... sven On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi < oluwasijibomi.saula at ndsu.edu> wrote: > Hello Scalers, > > > First, let me say Happy Thanksgiving to those of us in the US and to those > beyond, well, it's a still happy day seeing we're still above ground! ? > > > Now, what I have to discuss isn't anything extreme so don't skip the > turkey for this, but lately, on a few of our compute GPFS client nodes, > we've been noticing high CPU usage by the mmfsd process and are wondering > why. Here's a sample: > > > [~]# top -b -n 1 | grep mmfs > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 > *mmfs*d > > 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 run > *mmfs* > > Obviously, this behavior was likely triggered by a not-so-convenient user > job that in most cases is long finished by the time we > investigate. Nevertheless, does anyone have an idea why this might be > happening? Any thoughts on preventive steps even? > > > This is GPFS v4.2.3 on Redhat 7.4, btw... > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > > Research 2 > Building > ? Room 220B > Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu | www.ndsu.edu > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bzhang at ca.ibm.com Wed Nov 21 18:52:12 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Wed, 21 Nov 2018 13:52:12 -0500 Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: Hi all, This is a reminder for our next week's technical webinar. Everyone is welcome to register and attend. Thanks, Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. From: "Bohai Zhang" To: gpfsug main discussion list Date: 2018/11/09 11:37 AM Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We are going to host our next technical webinar. everyone is welcome to register and attend. Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16927775.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16361907.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16531853.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16209659.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16604524.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16509495.gif Type: image/gif Size: 1243 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Wed Nov 21 20:55:29 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Wed, 21 Nov 2018 20:55:29 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 82, Issue 31 In-Reply-To: References: Message-ID: Sven/Jim, Thanks for sharing your thoughts! - Currently, we have mFTC set as such: maxFilesToCache 4000 However, since we have a very diverse workload, we'd have to cycle through a vast majority of our apps to find the most fitting mFTC value as this page (https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wecm/l0wecm00_maxfilestocache.htm) suggests. In the meantime, I was able to gather some more info for the lone mmfsd thread (pid: 34096) running at high CPU utilization, and right away I can see the number of nonvoluntary_ctxt_switches is quite high, compared to the other threads in the dump; however, I think I need some help interpreting all of this. Although, I should add that heavy HPC workloads (i.e. vasp, ansys...) are running on these nodes and may be somewhat related to this issue: Scheduling info for kernel thread 34096 mmfsd (34096, #threads: 309) ------------------------------------------------------------------- se.exec_start : 8057632237.613486 se.vruntime : 4914854123.640008 se.sum_exec_runtime : 1042598557.420591 se.nr_migrations : 8337485 nr_switches : 15824325 nr_voluntary_switches : 4110 nr_involuntary_switches : 15820215 se.load.weight : 88761 policy : 0 prio : 100 clock-delta : 24 mm->numa_scan_seq : 88980 numa_migrations, 5216521 numa_faults_memory, 0, 0, 1, 1, 1 numa_faults_memory, 1, 0, 0, 1, 1030 numa_faults_memory, 0, 1, 0, 0, 1 numa_faults_memory, 1, 1, 0, 0, 1 Status for kernel thread 34096 Name: mmfsd Umask: 0022 State: R (running) Tgid: 58921 Ngid: 34395 Pid: 34096 PPid: 3941 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 256 Groups: VmPeak: 15137612 kB VmSize: 15126340 kB VmLck: 4194304 kB VmPin: 8388712 kB VmHWM: 4424228 kB VmRSS: 4420420 kB RssAnon: 4350128 kB RssFile: 50512 kB RssShmem: 19780 kB VmData: 14843812 kB VmStk: 132 kB VmExe: 23672 kB VmLib: 121856 kB VmPTE: 9652 kB VmSwap: 0 kB Threads: 309 SigQ: 5/257225 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000010017a07 SigIgn: 0000000000000000 SigCgt: 0000000180015eef CapInh: 0000000000000000 CapPrm: 0000001fffffffff CapEff: 0000001fffffffff CapBnd: 0000001fffffffff CapAmb: 0000000000000000 Seccomp: 0 Cpus_allowed: ffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-239 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 4110 nonvoluntary_ctxt_switches: 15820215 Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Wednesday, November 21, 2018 9:33:10 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 82, Issue 31 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: mmfsd recording High CPU usage (Jim Doherty) 2. Re: mmfsd recording High CPU usage (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Wed, 21 Nov 2018 13:01:54 +0000 (UTC) From: Jim Doherty To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: <1913697205.666954.1542805314669 at mail.yahoo.com> Content-Type: text/plain; charset="utf-8" At a guess with no data ....?? if the application is opening more files than can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to support the open files,? but it will also scan to try and free any unused objects.??? If you can identify the user job that is causing this? you could monitor a system more closely. Jim On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi wrote: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a?still happy day seeing we're still above ground!? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs ? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND 231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd ?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate.?Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted Science & TechnologyNORTH DAKOTA STATE UNIVERSITY? Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Wed, 21 Nov 2018 07:32:55 -0800 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: Content-Type: text/plain; charset="utf-8" Hi, the best way to debug something like that is to start with top. start top then press 1 and check if any of the cores has almost 0% idle while others have plenty of CPU left. if that is the case you have one very hot thread. to further isolate it you can press 1 again to collapse the cores, now press shirt-h which will break down each thread of a process and show them as an individual line. now you either see one or many mmfsd's causing cpu consumption, if its many your workload is just doing a lot of work, what is more concerning is if you have just 1 thread running at the 90%+ . if thats the case write down the PID of the thread that runs so hot and run mmfsadm dump threads,kthreads >dum. you will see many entries in the file like : MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm 0x7FC9700008C0 highStackP 0x7FC783F7E530 pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21 ThPoolCommands per-thread gbls: 0:0x0 1:0x0 2:0x0 3:0x3 4:0xFFFFFFFFFFFFFFFF 5:0x0 6:0x0 7:0x7FC98C0067B0 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x400000E 14:0x7FC98C004C10 15:0x0 16:0x4 17:0x0 18:0x0 find the pid behind 'thread id' and post that section, that would be the first indication on what that thread does ... sven On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi < oluwasijibomi.saula at ndsu.edu> wrote: > Hello Scalers, > > > First, let me say Happy Thanksgiving to those of us in the US and to those > beyond, well, it's a still happy day seeing we're still above ground! ? > > > Now, what I have to discuss isn't anything extreme so don't skip the > turkey for this, but lately, on a few of our compute GPFS client nodes, > we've been noticing high CPU usage by the mmfsd process and are wondering > why. Here's a sample: > > > [~]# top -b -n 1 | grep mmfs > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 > *mmfs*d > > 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 run > *mmfs* > > Obviously, this behavior was likely triggered by a not-so-convenient user > job that in most cases is long finished by the time we > investigate. Nevertheless, does anyone have an idea why this might be > happening? Any thoughts on preventive steps even? > > > This is GPFS v4.2.3 on Redhat 7.4, btw... > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > > Research 2 > Building > ? Room 220B > Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu | www.ndsu.edu > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 82, Issue 31 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnaineni at in.ibm.com Thu Nov 22 10:32:27 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Thu, 22 Nov 2018 10:32:27 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.png at 01D480EA.4FF3B020.png Type: image/png Size: 5610 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Fri Nov 23 08:12:25 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Fri, 23 Nov 2018 08:12:25 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Message-ID: Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= From andreas.mattsson at maxiv.lu.se Fri Nov 23 13:41:37 2018 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 23 Nov 2018 13:41:37 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: <9456645b0a1f4b488b13874ea672b9b8@maxiv.lu.se> Yes, this is repeating. We?ve ascertained that it has nothing to do at all with file operations on the GPFS side. Randomly throughout the filesystem mounted via NFS, ls or file access will give ? > ls: reading directory /gpfs/filessystem/test/testdir: Invalid argument ? Trying again later might work on that folder, but might fail somewhere else. We have tried exporting the same filesystem via a standard kernel NFS instead of the CES Ganesha-NFS, and then the problem doesn?t exist. So it is definitely related to the Ganesha NFS server, or its interaction with the file system. Will see if I can get a tcpdump of the issue. Regards, Andreas _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se Fr?n: gpfsug-discuss-bounces at spectrumscale.org F?r Malahal R Naineni Skickat: den 22 november 2018 11:32 Till: gpfsug-discuss at spectrumscale.org Kopia: gpfsug-discuss at spectrumscale.org ?mne: Re: [gpfsug-discuss] Filesystem access issues via CES NFS We have seen empty lists (ls showing nothing). If this repeats, please take tcpdump from the client and we will investigate. Regards, Malahal. ----- Original message ----- From: Andreas Mattsson Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [gpfsug-discuss] Filesystem access issues via CES NFS Date: Tue, Nov 20, 2018 8:47 PM On one of our clusters, from time to time if users try to access files or folders via the direct full path over NFS, the NFS-client gets invalid information from the server. For instance, if I run ?ls /gpfs/filesystem/test/test2/test3? over NFS-mount, result is just full of ???????? If I recurse through the path once, for instance by ls?ing or cd?ing through the folders one at a time or running ls ?R, I can then access directly via the full path afterwards. This seem to be intermittent, and I haven?t found how to reliably recreate the issue. Possibly, it can be connected to creating or changing files or folders via a GPFS mount, and then accessing them through NFS, but it doesn?t happen consistently. Is this a known behaviour or bug, and does anyone know how to fix the issue? These NSD-servers currently run Scale 4.2.2.3, while the CES is on 5.0.1.1. GPFS clients run Scale 5.0.1.1, and NFS clients run CentOS 7.5. Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4575 bytes Desc: not available URL: From jtolson at us.ibm.com Mon Nov 26 14:31:29 2018 From: jtolson at us.ibm.com (John T Olson) Date: Mon, 26 Nov 2018 07:31:29 -0700 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Mon Nov 26 14:55:06 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Mon, 26 Nov 2018 14:55:06 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: Hallo John, record is open, TS001631590. Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From alvise.dorigo at psi.ch Mon Nov 26 15:43:59 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 26 Nov 2018 15:43:59 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Message-ID: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Mon Nov 26 15:54:57 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Mon, 26 Nov 2018 15:54:57 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Try an dedicated extra ? -p ? foreach Attribute Von meinem iPhone gesendet > Am 26.11.2018 um 16:50 schrieb Dorigo Alvise (PSI) : > > Good evening, > I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm > to setup AFM parallel transfer. > > Why the following command (grabbed directly from the web page above) fires out that error ? > > [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome > mmcrfileset: Incorrect extra argument: ?p > Usage: > mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] > [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] > [--allow-permission-change PermissionChangeMode] > > The mapping was correctly created: > > [root at sf-export-3 ~]# mmafmconfig show > Map name: afmgw1 > Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch > > Is this a known bug ? > > Thanks, > Regards. > > Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Nov 26 16:33:58 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Mon, 26 Nov 2018 16:33:58 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Is that an 'ndash' rather than "-"? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of alvise.dorigo at psi.ch [alvise.dorigo at psi.ch] Sent: 26 November 2018 15:43 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise From kenneth.waegeman at ugent.be Mon Nov 26 16:26:51 2018 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Mon, 26 Nov 2018 17:26:51 +0100 Subject: [gpfsug-discuss] mmfsck output Message-ID: Hi all, We had some leftover files with IO errors on a GPFS FS, so we ran a mmfsck. Does someone know what these mmfsck errors mean: Error in inode 38422 snap 0: has nlink field as 1 Error in inode 281057 snap 0: is unreferenced ?Attach inode to lost+found of fileset root filesetId 0? no Thanks! Kenneth From daniel.kidger at uk.ibm.com Mon Nov 26 17:03:14 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Mon, 26 Nov 2018 17:03:14 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: References: , <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From abhisdav at in.ibm.com Tue Nov 27 06:38:27 2018 From: abhisdav at in.ibm.com (Abhishek Dave) Date: Tue, 27 Nov 2018 12:08:27 +0530 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Hi, Looks like some issue with syntax. Please try below one. mmcrfileset ?p afmmode=,afmtarget=://// --inode-space new #mmcrfileset gpfs1 sw1 ?p afmmode=sw,afmtarget=gpfs://mapping1/gpfs/gpfs2/swhome --inode-space new #mmcrfileset gpfs1 ro1 ?p afmmode=ro,afmtarget=gpfs://mapping2/gpfs/gpfs2/swhome --inode-space new Thanks, Abhishek, Dave From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/26/2018 09:20 PM Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Sent by: gpfsug-discuss-bounces at spectrumscale.org Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes [:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 15:24:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 15:24:25 +0000 Subject: [gpfsug-discuss] Hanging file-systems Message-ID: <06FF0D9C-9ED7-434E-A7FF-C56518048E25@bham.ac.uk> I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it. (I have a ticket with IBM open about this as well) We see for example: Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 and on that node: Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 8 (TokenCondvar), reason 'wait for SubToken to become stable' On this node, if you dump tscomm, you see entries like: Pending messages: msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1 this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) dest status pending , err 0, reply len 0 by TCP connection c0n9 is itself. This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster. Any pointers for next place to look/how to fix? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Nov 27 16:02:44 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 27 Nov 2018 16:02:44 +0000 Subject: [gpfsug-discuss] Hanging file-systems Message-ID: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> I have seen something like this in the past, and I have resorted to a cluster restart as well. :-( IBM and I could never really track it down, because I could not get a dump at the time of occurrence. However, you might take a look at your NSD servers, one at a time. As I recall, we thought it was a stuck thread on one of the NSD servers, and when we restarted the ?right? one it cleared the block. The other thing I?ve done in the past to isolate problems like this (since this is related to tokens) is to look at the ?token revokes? on each node, looking for ones that are sticking around for a long time. I tossed together a quick script and ran it via mmdsh on all the node. Not pretty, but it got the job done. Run this a few times, see if any of the revokes are sticking around for a long time #!/bin/sh rm -f /tmp/revokelist /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > /tmp/revokelist 2> /dev/null if [ $? -eq 0 ]; then /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out for n in `cat /tmp/revokelist | grep msgHdr | awk '{print $5}'`; do grep $n /tmp/tscomm.out | tail -1 done rm -f /tmp/tscomm.out fi Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Simon Thompson Reply-To: gpfsug main discussion list Date: Tuesday, November 27, 2018 at 9:27 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] [gpfsug-discuss] Hanging file-systems I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it. (I have a ticket with IBM open about this as well) We see for example: Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 and on that node: Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 8 (TokenCondvar), reason 'wait for SubToken to become stable' On this node, if you dump tscomm, you see entries like: Pending messages: msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1 this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) dest status pending , err 0, reply len 0 by TCP connection c0n9 is itself. This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster. Any pointers for next place to look/how to fix? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 16:14:20 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 08:14:20 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> Message-ID: if this happens you should check a couple of things : 1. are you under memory pressure or even worse started swapping . 2. is there any core running at ~ 0% idle - run top , press 1 and check the idle column. 3. is there any single thread running at ~100% - run top , press shift - h and check what the CPU % shows for the top 5 processes. if you want to go the extra mile, you could run perf top -p $PID_OF_MMFSD and check what the top cpu consumers are. confirming and providing data to any of the above being true could be the missing piece why nobody was able to find it, as this is stuff unfortunate nobody ever looks at. even a trace won't help if any of the above is true as all you see is that the system behaves correct according to the trace, its doesn't appear busy, Sven On Tue, Nov 27, 2018 at 8:03 AM Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: > I have seen something like this in the past, and I have resorted to a > cluster restart as well. :-( IBM and I could never really track it down, > because I could not get a dump at the time of occurrence. However, you > might take a look at your NSD servers, one at a time. As I recall, we > thought it was a stuck thread on one of the NSD servers, and when we > restarted the ?right? one it cleared the block. > > > > The other thing I?ve done in the past to isolate problems like this (since > this is related to tokens) is to look at the ?token revokes? on each node, > looking for ones that are sticking around for a long time. I tossed > together a quick script and ran it via mmdsh on all the node. Not pretty, > but it got the job done. Run this a few times, see if any of the revokes > are sticking around for a long time > > > > #!/bin/sh > > rm -f /tmp/revokelist > > /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > > /tmp/revokelist 2> /dev/null > > if [ $? -eq 0 ]; then > > /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out > > for n in `cat /tmp/revokelist | grep msgHdr | awk '{print $5}'`; do > > grep $n /tmp/tscomm.out | tail -1 > > done > > rm -f /tmp/tscomm.out > > fi > > > > > > Bob Oesterlin > > Sr Principal Storage Engineer, Nuance > > > > > > > > *From: * on behalf of Simon > Thompson > *Reply-To: *gpfsug main discussion list > *Date: *Tuesday, November 27, 2018 at 9:27 AM > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[EXTERNAL] [gpfsug-discuss] Hanging file-systems > > > > I have a file-system which keeps hanging over the past few weeks. Right > now, its offline and taken a bunch of services out with it. > > > > (I have a ticket with IBM open about this as well) > > > > We see for example: > > Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 > SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 > (MsgRecordCondvar), re > > ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 > > > > and on that node: > > Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 > SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 > > 8 (TokenCondvar), reason 'wait for SubToken to become stable' > > > > On this node, if you dump tscomm, you see entries like: > > Pending messages: > > msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, > n_pending 1 > > this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec > > sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) > > dest status pending , err 0, reply len 0 by TCP > connection > > > > c0n9 is itself. > > > > This morning when this happened, the only way to get the FS back online > was to shutdown the entire cluster. > > > > Any pointers for next place to look/how to fix? > > > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 17:53:58 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 17:53:58 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> Message-ID: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Nov 27 17:54:03 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 27 Nov 2018 23:24:03 +0530 Subject: [gpfsug-discuss] mmfsck output In-Reply-To: References: Message-ID: This means that the files having the below inode numbers 38422 and 281057 are orphan files (i.e. files not referenced by any directory/folder) and they will be moved to the lost+found folder of the fileset owning these files by mmfsck repair. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Kenneth Waegeman To: gpfsug main discussion list Date: 11/26/2018 10:10 PM Subject: [gpfsug-discuss] mmfsck output Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We had some leftover files with IO errors on a GPFS FS, so we ran a mmfsck. Does someone know what these mmfsck errors mean: Error in inode 38422 snap 0: has nlink field as 1 Error in inode 281057 snap 0: is unreferenced Attach inode to lost+found of fileset root filesetId 0? no Thanks! Kenneth _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=-J2C2ZYYUsp42fIyYHg3aYSR8wC5SKNhl6ZztfRJMvI&s=4OPQpDp8v56fvska0-O-pskIfONFMnZFydDo0T6KwJM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 18:19:04 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 10:19:04 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson wrote: > Thanks Sven ? > > > > We found a node with kswapd running 100% (and swap was off)? > > > > Killing that node made access to the FS spring into life. > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 16:14 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Tue Nov 27 18:06:57 2018 From: skylar2 at uw.edu (Skylar Thompson) Date: Tue, 27 Nov 2018 18:06:57 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From Dwayne.Hart at med.mun.ca Tue Nov 27 19:25:08 2018 From: Dwayne.Hart at med.mun.ca (Dwayne.Hart at med.mun.ca) Date: Tue, 27 Nov 2018 19:25:08 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> , <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: Hi Simon, Was there a reason behind swap being disabled? Best, Dwayne ? Dwayne Hart | Systems Administrator IV CHIA, Faculty of Medicine Memorial University of Newfoundland 300 Prince Philip Drive St. John?s, Newfoundland | A1B 3V6 Craig L Dobbin Building | 4M409 T 709 864 6631 On Nov 27, 2018, at 2:24 PM, Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From TOMP at il.ibm.com Tue Nov 27 19:35:36 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Tue, 27 Nov 2018 21:35:36 +0200 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com><4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Message-ID: "paging to disk" sometimes means mmap as well - there were several issues around that recently as well. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Skylar Thompson To: gpfsug-discuss at spectrumscale.org Date: 27/11/2018 20:28 Subject: Re: [gpfsug-discuss] Hanging file-systems Sent by: gpfsug-discuss-bounces at spectrumscale.org Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 20:02:14 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 20:02:14 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Message-ID: Yes, but we?d upgraded all out HPC client nodes to 5.0.2-1 last week as well when this first happened ? Unless it?s necessary to upgrade the NSD servers as well for this? Simon From: on behalf of "TOMP at il.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 19:48 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems "paging to disk" sometimes means mmap as well - there were several issues around that recently as well. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Skylar Thompson To: gpfsug-discuss at spectrumscale.org Date: 27/11/2018 20:28 Subject: Re: [gpfsug-discuss] Hanging file-systems Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 20:09:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 20:09:25 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 I think). So is this memory pressure on the NSD nodes then? I thought it was documented somewhere that GFPS won?t use more than 50% of the host memory. And actually if you look at the values for maxStatCache and maxFilesToCache, the memory footprint is quite small. Sure on these NSD servers we had a pretty big pagepool (which we?ve dropped by some), but there still should have been quite a lot of memory space on the nodes ? If only someone as going to do a talk in December at the CIUK SSUG on memory usage ? Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 18:19 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 20:43:04 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 12:43:04 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: was the node you rebooted a client or a server that was running kswapd at 100% ? sven On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson wrote: > The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 > I think). > > > > So is this memory pressure on the NSD nodes then? I thought it was > documented somewhere that GFPS won?t use more than 50% of the host memory. > > > > And actually if you look at the values for maxStatCache and > maxFilesToCache, the memory footprint is quite small. > > > > Sure on these NSD servers we had a pretty big pagepool (which we?ve > dropped by some), but there still should have been quite a lot of memory > space on the nodes ? > > > > If only someone as going to do a talk in December at the CIUK SSUG on > memory usage ? > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 18:19 > > > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > Hi, > > > > now i need to swap back in a lot of information about GPFS i tried to swap > out :-) > > > > i bet kswapd is not doing anything you think the name suggest here, which > is handling swap space. i claim the kswapd thread is trying to throw > dentries out of the cache and what it tries to actually get rid of are > entries of directories very high up in the tree which GPFS still has a > refcount on so it can't free it. when it does this there is a single thread > (unfortunate was never implemented with multiple threads) walking down the > tree to find some entries to steal, it it can't find any it goes to the > next , next , etc and on a bus system it can take forever to free anything > up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which > i pushed for the weeks before i left IBM. you never see this in a trace > with default traces which is why nobody would have ever suspected this, you > need to set special trace levels to even see this. > > i don't know the exact version the changes went into, but somewhere in the > 5.0.1.X timeframe. the change was separating the cache list to prefer > stealing files before directories, also keep a minimum percentages of > directories in the cache (10 % by default) before it would ever try to get > rid of a directory. it also tries to keep a list of free entries all the > time (means pro active cleaning them) and also allows to go over the hard > limit compared to just block as in previous versions. so i assume you run a > version prior to 5.0.1.x and what you see is kspwapd desperately get rid of > entries, but can't find one its already at the limit so it blocks and > doesn't allow a new entry to be created or promoted from the statcache . > > > > again all this is without source code access and speculation on my part > based on experience :-) > > > > what version are you running and also share mmdiag --stats of that node > > > > sven > > > > > > > > > > > > > > On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: > > Thanks Sven ? > > > > We found a node with kswapd running 100% (and swap was off)? > > > > Killing that node made access to the FS spring into life. > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 16:14 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > 1. are you under memory pressure or even worse started swapping . > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 20:44:26 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 12:44:26 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: and i already talk about NUMA stuff at the CIUK usergroup meeting, i won't volunteer for a 2nd advanced topic :-D On Tue, Nov 27, 2018 at 12:43 PM Sven Oehme wrote: > was the node you rebooted a client or a server that was running kswapd at > 100% ? > > sven > > > On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson > wrote: > >> The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 >> I think). >> >> >> >> So is this memory pressure on the NSD nodes then? I thought it was >> documented somewhere that GFPS won?t use more than 50% of the host memory. >> >> >> >> And actually if you look at the values for maxStatCache and >> maxFilesToCache, the memory footprint is quite small. >> >> >> >> Sure on these NSD servers we had a pretty big pagepool (which we?ve >> dropped by some), but there still should have been quite a lot of memory >> space on the nodes ? >> >> >> >> If only someone as going to do a talk in December at the CIUK SSUG on >> memory usage ? >> >> >> >> Simon >> >> >> >> *From: * on behalf of " >> oehmes at gmail.com" >> *Reply-To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Date: *Tuesday, 27 November 2018 at 18:19 >> >> >> *To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Subject: *Re: [gpfsug-discuss] Hanging file-systems >> >> >> >> Hi, >> >> >> >> now i need to swap back in a lot of information about GPFS i tried to >> swap out :-) >> >> >> >> i bet kswapd is not doing anything you think the name suggest here, which >> is handling swap space. i claim the kswapd thread is trying to throw >> dentries out of the cache and what it tries to actually get rid of are >> entries of directories very high up in the tree which GPFS still has a >> refcount on so it can't free it. when it does this there is a single thread >> (unfortunate was never implemented with multiple threads) walking down the >> tree to find some entries to steal, it it can't find any it goes to the >> next , next , etc and on a bus system it can take forever to free anything >> up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which >> i pushed for the weeks before i left IBM. you never see this in a trace >> with default traces which is why nobody would have ever suspected this, you >> need to set special trace levels to even see this. >> >> i don't know the exact version the changes went into, but somewhere in >> the 5.0.1.X timeframe. the change was separating the cache list to prefer >> stealing files before directories, also keep a minimum percentages of >> directories in the cache (10 % by default) before it would ever try to get >> rid of a directory. it also tries to keep a list of free entries all the >> time (means pro active cleaning them) and also allows to go over the hard >> limit compared to just block as in previous versions. so i assume you run a >> version prior to 5.0.1.x and what you see is kspwapd desperately get rid of >> entries, but can't find one its already at the limit so it blocks and >> doesn't allow a new entry to be created or promoted from the statcache . >> >> >> >> again all this is without source code access and speculation on my part >> based on experience :-) >> >> >> >> what version are you running and also share mmdiag --stats of that node >> >> >> >> sven >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson >> wrote: >> >> Thanks Sven ? >> >> >> >> We found a node with kswapd running 100% (and swap was off)? >> >> >> >> Killing that node made access to the FS spring into life. >> >> >> >> Simon >> >> >> >> *From: * on behalf of " >> oehmes at gmail.com" >> *Reply-To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Date: *Tuesday, 27 November 2018 at 16:14 >> *To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Subject: *Re: [gpfsug-discuss] Hanging file-systems >> >> >> >> 1. are you under memory pressure or even worse started swapping . >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From constance.rice at us.ibm.com Tue Nov 27 20:28:14 2018 From: constance.rice at us.ibm.com (Constance M Rice) Date: Tue, 27 Nov 2018 20:28:14 +0000 Subject: [gpfsug-discuss] Introduction Message-ID: Hello, I am a new member here. I work for IBM in the Washington System Center supporting Spectrum Scale and ESS across North America. I live in Leesburg, Virginia, USA northwest of Washington, DC. Connie Rice Storage Specialist Washington Systems Center Mobile: 202-821-6747 E-mail: constance.rice at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 56935 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 21:01:07 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 21:01:07 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: <66C52F6F-5193-4DD7-B87E-C88E9ADBB53D@bham.ac.uk> It was an NSD server ? we?d already shutdown all the clients in the remote clusters! And Tomer has already agreed to do a talk on memory ? (but I?m still looking for a user talk if anyone is interested!) Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 20:44 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems and i already talk about NUMA stuff at the CIUK usergroup meeting, i won't volunteer for a 2nd advanced topic :-D On Tue, Nov 27, 2018 at 12:43 PM Sven Oehme > wrote: was the node you rebooted a client or a server that was running kswapd at 100% ? sven On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson > wrote: The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 I think). So is this memory pressure on the NSD nodes then? I thought it was documented somewhere that GFPS won?t use more than 50% of the host memory. And actually if you look at the values for maxStatCache and maxFilesToCache, the memory footprint is quite small. Sure on these NSD servers we had a pretty big pagepool (which we?ve dropped by some), but there still should have been quite a lot of memory space on the nodes ? If only someone as going to do a talk in December at the CIUK SSUG on memory usage ? Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 18:19 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Thu Nov 29 07:29:36 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 29 Nov 2018 07:29:36 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From TOMP at il.ibm.com Thu Nov 29 07:45:00 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 29 Nov 2018 09:45:00 +0200 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Message-ID: Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" To: 'gpfsug main discussion list' Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Thu Nov 29 08:03:34 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 29 Nov 2018 08:03:34 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Message-ID: <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Hallo Tomer, thanks for this Info, but can you explain in witch release all these points fixed now? Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Tomer Perry Gesendet: Donnerstag, 29. November 2018 08:45 An: gpfsug main discussion list ; Olaf Weiser Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" > To: 'gpfsug main discussion list' > Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list > Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From olaf.weiser at de.ibm.com Thu Nov 29 08:39:01 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 29 Nov 2018 09:39:01 +0100 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From MDIETZ at de.ibm.com Thu Nov 29 10:45:25 2018 From: MDIETZ at de.ibm.com (Mathias Dietz) Date: Thu, 29 Nov 2018 11:45:25 +0100 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de><44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Message-ID: Hi Renar, the tsctl problem is described in APAR IV93896 https://www-01.ibm.com/support/docview.wss?uid=isg1IV93896 You can easily find out if your system has the problem: Run "tsctl shownodes up" and check if the hostnames are valid, if the hostnames are wrong/mixed up then you are affected. This APAR has been fixed with 5.0.2 Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale Development - Release Lead Architect (4.2.x) Spectrum Scale RAS Architect --------------------------------------------------------------------------- IBM Deutschland Am Weiher 24 65451 Kelsterbach Phone: +49 70342744105 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ----------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk WittkoppSitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Olaf Weiser" To: "Grunenberg, Renar" Cc: gpfsug main discussion list Date: 29/11/2018 09:39 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org HI Tomer, send my work around wrapper to Renar.. I've seen to less data to be sure, that's the same (tsctl shownodes ...) issue but he'll try and let us know .. From: "Grunenberg, Renar" To: gpfsug main discussion list , "Olaf Weiser" Date: 11/29/2018 09:04 AM Subject: AW: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hallo Tomer, thanks for this Info, but can you explain in witch release all these points fixed now? Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Tomer Perry Gesendet: Donnerstag, 29. November 2018 08:45 An: gpfsug main discussion list ; Olaf Weiser Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" To: 'gpfsug main discussion list' Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org[ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From spectrumscale at kiranghag.com Thu Nov 29 15:42:48 2018 From: spectrumscale at kiranghag.com (KG) Date: Thu, 29 Nov 2018 21:12:48 +0530 Subject: [gpfsug-discuss] high cpu usage by mmfsadm Message-ID: One of our scale node shows 30-50% CPU utilisation by mmfsadm while filesystem is being accessed. Is this normal? (The node is configured as server node but not a manager node for any filesystem or NSD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Nov 29 17:57:00 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 29 Nov 2018 17:57:00 +0000 Subject: [gpfsug-discuss] high cpu usage by mmfsadm In-Reply-To: References: Message-ID: <671bbd4db92d496abbbceead1b9a7d5c@jumptrading.com> I wouldn?t call that normal? probably take a gpfs.snap and open a PMR to get the quickest answer from IBM support, -B From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of KG Sent: Thursday, November 29, 2018 9:43 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] high cpu usage by mmfsadm [EXTERNAL EMAIL] One of our scale node shows 30-50% CPU utilisation by mmfsadm while filesystem is being accessed. Is this normal? (The node is configured as server node but not a manager node for any filesystem or NSD) ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From TOMP at il.ibm.com Thu Nov 1 07:37:03 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 1 Nov 2018 09:37:03 +0200 Subject: [gpfsug-discuss] V5 client limit? In-Reply-To: References: Message-ID: Kristy, If you mean the maximum number of nodes that can mount a filesystem ( which implies on the number of nodes on related clusters) then the number haven't changed since 3.4.0.13 - and its still 16384. Just to clarify, this is the theoretical limit - I don't think anyone tried more then 14-15k nodes. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Kristy Kallback-Rose To: gpfsug main discussion list Date: 31/10/2018 23:08 Subject: [gpfsug-discuss] V5 client limit? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Can someone tell me the max # of GPFS native clients under 5.x? Everything I can find is dated. Thanks Kristy _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Thu Nov 1 18:31:41 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Thu, 1 Nov 2018 11:31:41 -0700 Subject: [gpfsug-discuss] V5 client limit? In-Reply-To: References: Message-ID: <58DAFCE0-DECF-4612-8704-81C025069584@lbl.gov> Yes, OK. I was wondering if there was an updated number with v5. That answers it. Thank you, Kristy > On Nov 1, 2018, at 12:37 AM, Tomer Perry wrote: > > Kristy, > > If you mean the maximum number of nodes that can mount a filesystem ( which implies on the number of nodes on related clusters) then the number haven't changed since 3.4.0.13 - and its still 16384. > Just to clarify, this is the theoretical limit - I don't think anyone tried more then 14-15k nodes. > > > Regards, > > Tomer Perry > Scalable I/O Development (Spectrum Scale) > email: tomp at il.ibm.com > 1 Azrieli Center, Tel Aviv 67021, Israel > Global Tel: +1 720 3422758 > Israel Tel: +972 3 9188625 > Mobile: +972 52 2554625 > > > > > From: Kristy Kallback-Rose > To: gpfsug main discussion list > Date: 31/10/2018 23:08 > Subject: [gpfsug-discuss] V5 client limit? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > Can someone tell me the max # of GPFS native clients under 5.x? Everything I can find is dated. > > Thanks > Kristy > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Nov 1 18:45:35 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 1 Nov 2018 18:45:35 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas Message-ID: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Nov 1 22:40:21 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 1 Nov 2018 22:40:21 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> References: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> Message-ID: <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> I?m not going this year, or local to Dallas, but I do travel and have a lot of experience traveling from airports to city centers. If I were going, I?d take the DART Orange Line. Looks like a 52 minute ride ? where you get off probably depends on your hotel, but I put in the convention center here: https://www.google.com/maps/dir/Kay+Bailey+Hutchison+Convention+Center+Dallas,+South+Griffin+Street,+Dallas,+TX/DFW+Terminal+A,+2040+S+International+Pkwy,+Irving,+TX+75063/@32.9109009,-97.0712812,13z/am=t/data=!4m14!4m13!1m5!1m1!1s0x864e991a403efaa9:0xae0261a23eab57d2!2m2!1d-96.8002849!2d32.7743895!1m5!1m1!1s0x864c2a4300afd38d:0x3e0ecb50c933781d!2m2!1d-97.0357045!2d32.9048736!3e3 I don?t personally do business with UBER or Lyft ? I feel like the ?gig economy? is just another way people are getting ripped off and don?t want to be a part of it. > On Nov 1, 2018, at 2:45 PM, Oesterlin, Robert wrote: > > Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From babbott at oarc.rutgers.edu Fri Nov 2 03:30:31 2018 From: babbott at oarc.rutgers.edu (Bill Abbott) Date: Fri, 2 Nov 2018 03:30:31 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> References: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> Message-ID: <5BDBC4D8.1080003@oarc.rutgers.edu> SuperShuttle is $40-50 round trip, quick, reliable and in pretty much every city. Bill On 11/1/18 6:40 PM, Ryan Novosielski wrote: > I?m not going this year, or local to Dallas, but I do travel and have a lot of experience traveling from airports to city centers. If I were going, I?d take the DART Orange Line. Looks like a 52 minute ride ? where you get off probably depends on your hotel, but I put in the convention center here: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fdir%2FKay%2BBailey%2BHutchison%2BConvention%2BCenter%2BDallas%2C%2BSouth%2BGriffin%2BStreet%2C%2BDallas%2C%2BTX%2FDFW%2BTerminal%2BA%2C%2B2040%2BS%2BInternational%2BPkwy%2C%2BIrving%2C%2BTX%2B75063%2F%4032.9109009%2C-97.0712812%2C13z%2Fam%3Dt%2Fdata%3D!4m14!4m13!1m5!1m1!1s0x864e991a403efaa9%3A0xae0261a23eab57d2!2m2!1d-96.8002849!2d32.7743895!1m5!1m1!1s0x864c2a4300afd38d%3A0x3e0ecb50c933781d!2m2!1d-97.0357045!2d32.9048736!3e3&data=02%7C01%7Cbabbott%40rutgers.edu%7Ce04f2c06af1440bdd05e08d6407132df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636767252267150866&sdata=xvKo%2BtQo8sxoDqeOp2ZAtaIedHNu87r4QTuOIMXWoKA%3D&reserved=0 > > I don?t personally do business with UBER or Lyft ? I feel like the ?gig economy? is just another way people are getting ripped off and don?t want to be a part of it. > >> On Nov 1, 2018, at 2:45 PM, Oesterlin, Robert wrote: >> >> Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? >> >> Bob Oesterlin >> Sr Principal Storage Engineer, Nuance > > -- > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbabbott%40rutgers.edu%7Ce04f2c06af1440bdd05e08d6407132df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636767252267150866&sdata=CsnXrY0YwZAdQbuJ43GgH9P%2BEKQcWFm6xkg7jX5ySmE%3D&reserved=0 From chris.schlipalius at pawsey.org.au Fri Nov 2 09:37:44 2018 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Fri, 2 Nov 2018 17:37:44 +0800 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas Message-ID: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> Hi all, so I?ve used Super Shuttle booked online for both New Orleans SC round trip and Austin SC just to the hotel, travelling solo and a Sheraton hotel shuttle back to the airport (as a solo travel option, Super is a good price). In Austin for SC my boss actually took the bus to his hotel! For SC18 my colleagues and I will prob pre-book a van transfer as there?s a few of us. Some of the Aussie IBM staff are hiring a car to get to their hotel, so if theres a few who can share, that?s also a good share option if you can park or drop the rental car at or near your hotel. Regards, Chris > On 2 Nov 2018, at 4:02 pm, gpfsug-discuss-request at spectrumscale.org wrote: > > Re: Slightly OT: Getting from DFW to SC17 hotels in Dallas From mark.fellows at stfc.ac.uk Fri Nov 2 11:45:58 2018 From: mark.fellows at stfc.ac.uk (Mark Fellows - UKRI STFC) Date: Fri, 2 Nov 2018 11:45:58 +0000 Subject: [gpfsug-discuss] Hello Message-ID: Hi all, Just introducing myself as a new subscriber to the mailing list. I work at the Hartree Centre within the Science and Technology Facilities Council near Warrington, UK. Our role is to work with industry to promote the use of high performance technologies and data analytics to solve problems and deliver gains in productivity. We also support academic researchers in UK based and international science. We have Spectrum Scale installations on linux (x86 for data storage/Power for HPC clusters) and I've recently been involved with deploying and upgrading some small ESS systems. As a relatively new user of SS I may initially have more questions than answers but hope to be able to exchange some thoughts and ideas within the group. Best regards, Mark Mark Fellows HPC Systems Administrator Platforms and Infrastructure Group Telephone - 01925 603413 | Email - mark.fellows at stfc.ac.uk Hartree Centre, Science & Technology Facilities Council Daresbury Laboratory, Keckwick Lane, Daresbury, Warrington, WA4 4AD, UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From damir.krstic at gmail.com Fri Nov 2 15:55:27 2018 From: damir.krstic at gmail.com (Damir Krstic) Date: Fri, 2 Nov 2018 10:55:27 -0500 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> Message-ID: Hi, Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: -V 15.01 (4.2.0.0) File system version I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. Damir On Tue, May 22, 2018 at 10:43 AM wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That > is we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. > All the ?ps hangs? happen when jobs run related to a Java process or many > Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we > know that it is related to just Java or some jobs reading from directories > with about 600000 files. > > I have raised an IBM critical service request about a month ago related to > this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not > be related to GPFS. > Although, we are sure that these hangs started to appear only after we > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is > because, we are unable to capture any logs/traces from GPFS once the hang > happens. > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting > difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a ?ps > hang" issue and we have to run mmchconfig release=LATEST command, and that > will resolve the issue. > However we are not comfortable making the permanent change to Filesystem > version 5. and since we don?t see any near solution to these hangs - we are > thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know > the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config > state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i > reinstall all rpms to a previous version? or is there anything else that i > need to make sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS > configuration parameters , and i am not sure if downgrading GPFS will > change them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or > is it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at googlemail.com Fri Nov 2 16:24:07 2018 From: luke.raimbach at googlemail.com (Luke Raimbach) Date: Fri, 2 Nov 2018 16:24:07 +0000 Subject: [gpfsug-discuss] RFE: Inode Expansion Message-ID: Dear Spectrum Scale Experts, I would really like to have a callback made available for the file system manager executing an Inode Expansion event. You know, with all the nice variables output, etc. Kind Regards, Luke Raimbach -------------- next part -------------- An HTML attachment was scrubbed... URL: From sveta at cbio.mskcc.org Fri Nov 2 16:09:35 2018 From: sveta at cbio.mskcc.org (Mazurkova, Svetlana/Information Systems) Date: Fri, 2 Nov 2018 12:09:35 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> Message-ID: <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Hi Damir, It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. Regards, Sveta. > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > Hi, > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > -V 15.01 (4.2.0.0) File system version > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > Damir > > On Tue, May 22, 2018 at 10:43 AM > wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > I have raised an IBM critical service request about a month ago related to this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not be related to GPFS. > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run mmchconfig release=LATEST command, and that will resolve the issue. > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Nov 2 16:29:19 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 2 Nov 2018 12:29:19 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Message-ID: Yes, We have upgraded to 5.0.1-0.5, which has the patch for the issue. The related IBM case number was :?TS001010674 Regards, Lohit On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > Hi Damir, > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. > > Regards, > > Sveta. > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > Hi, > > > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > > ?-V ? ? ? ? ? ? ? ? 15.01 (4.2.0.0) ? ? ? ? ?File system version > > > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > > > Damir > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > Hello All, > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > > > > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > > > > > > > I have raised an IBM critical service request about a month ago related to this -?PMR: 24090,L6Q,000. > > > > However, According to the ticket ?- they seemed to feel that it might not be related to GPFS. > > > > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > > > > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > > > > > > > Also ?- According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run ?mmchconfig release=LATEST command, and that will resolve the issue. > > > > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > > > > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > Our previous state: > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > 1 Compute cluster - 4.2.3.2 ?( remote mounts the above 2 storage clusters ) > > > > > > > > Our current state: > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > Thanks, > > > > > > > > Lohit > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Nov 2 16:31:12 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 2 Nov 2018 12:31:12 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Message-ID: <5469f6aa-3f82-47b2-8b82-a599edfa2f16@Spark> Also - You could just upgrade one of the clients to this version, and test to see if the hang still occurs. You do not have to upgrade the NSD servers, to test. Regards, Lohit On Nov 2, 2018, 12:29 PM -0400, valleru at cbio.mskcc.org, wrote: > Yes, > > We have upgraded to 5.0.1-0.5, which has the patch for the issue. > The related IBM case number was :?TS001010674 > > Regards, > Lohit > > On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > > Hi Damir, > > > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. > > > > Regards, > > > > Sveta. > > > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > > > Hi, > > > > > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > > > ?-V ? ? ? ? ? ? ? ? 15.01 (4.2.0.0) ? ? ? ? ?File system version > > > > > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > > > > > Damir > > > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > > Hello All, > > > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > > > > > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > > > > > > > > > I have raised an IBM critical service request about a month ago related to this -?PMR: 24090,L6Q,000. > > > > > However, According to the ticket ?- they seemed to feel that it might not be related to GPFS. > > > > > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > > > > > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > > > > > > > > > Also ?- According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run ?mmchconfig release=LATEST command, and that will resolve the issue. > > > > > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > > > > > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > > > Our previous state: > > > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > > 1 Compute cluster - 4.2.3.2 ?( remote mounts the above 2 storage clusters ) > > > > > > > > > > Our current state: > > > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > > > Thanks, > > > > > > > > > > Lohit > > > > > _______________________________________________ > > > > > gpfsug-discuss mailing list > > > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Sat Nov 3 20:21:50 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Sat, 3 Nov 2018 20:21:50 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> References: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> Message-ID: <09CC2A72-2D2C-4722-87CB-A4B1093D90BC@rutgers.edu> I took the bus back to the airport in Austin (the Airport Flyer). Was a good experience. If Austin is the city I?m thinking of, I took SuperShuttle to the hotel (I believe because I arrived late at night) and was the fourth hotel that got dropped off, which roughly doubled the trip time. There is that risk with the shared-ride shuttles. In recent years, the only location without a solid public transit option was New Orleans (I used it anyway). They have an express airport bus, but the hours and frequency are not ideal (there?s a local bus as well, which is quite a bit slower). SLC had good light rail service, Denver has good rail service from the airport to downtown, and Atlanta has good subway service (all of these I?ve used before). Typically the transit option is less than $10 round-trip (Denver?s is above-average at $9 each way), sometimes even less than $5. > On Nov 2, 2018, at 5:37 AM, Chris Schlipalius wrote: > > Hi all, so I?ve used Super Shuttle booked online for both New Orleans SC round trip and Austin SC just to the hotel, travelling solo and a Sheraton hotel shuttle back to the airport (as a solo travel option, Super is a good price). > In Austin for SC my boss actually took the bus to his hotel! > > For SC18 my colleagues and I will prob pre-book a van transfer as there?s a few of us. > Some of the Aussie IBM staff are hiring a car to get to their hotel, so if theres a few who can share, that?s also a good share option if you can park or drop the rental car at or near your hotel. > > Regards, Chris > >> On 2 Nov 2018, at 4:02 pm, gpfsug-discuss-request at spectrumscale.org wrote: >> >> Re: Slightly OT: Getting from DFW to SC17 hotels in Dallas > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From henrik.cednert at filmlance.se Tue Nov 6 06:23:44 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 06:23:44 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: Hi there For some reason my mail didn?t get through. Trying again. Apologies if there's duplicates... The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Nov 5 20:25:13 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 5 Nov 2018 20:25:13 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Tue Nov 6 09:35:27 2018 From: viccornell at gmail.com (Vic Cornell) Date: Tue, 6 Nov 2018 09:35:27 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> References: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell > On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) wrote: > > Hi there > > The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. > > But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. > > I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. > > I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( > > Versions involved: > Windows 10 Enterprise 2016 LTSB > IBM GPFS Express Edition 4.1.0.4 > IBM GPFS Express Edition License and Prerequisites 4.1 > IBM GPFS GSKit 8.0.0.32 > > Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. > > Cheers and many thanks in advance for helping me out here. I?m all ears. > > > root at M5-CLIPSTER02 ~ > $ cat /var/adm/ras/mmfs.log.latest > Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting > Removing old /var/adm/ras/mmfs.log.* files: > mmtrace: The tracefmt.exe or tracelog.exe command can not be found. > mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. > Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. > Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license > Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... > Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... > Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... > Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. > Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. > Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... > Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... > Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... > Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... > Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... > Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... > Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... > Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... > Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... > Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... > Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... > Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... > Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... > Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... > Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... > Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs > Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs > Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. > Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready > Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all > Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 > > > > > -- > Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance > Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? > > Disclaimer > > > The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Tue Nov 6 10:08:03 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Tue, 6 Nov 2018 10:08:03 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Nov 5 20:21:10 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 5 Nov 2018 20:21:10 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgayne at us.ibm.com Tue Nov 6 13:46:34 2018 From: lgayne at us.ibm.com (Lyle Gayne) Date: Tue, 6 Nov 2018 08:46:34 -0500 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: Message-ID: Vipul or Heather should be able to assist. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/06/2018 07:00 AM Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ? \_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 6 13:52:03 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 6 Nov 2018 13:52:03 +0000 Subject: [gpfsug-discuss] Spectrum Scale and Firewalls In-Reply-To: References: <10239ED8-7E0D-4420-8BEC-F17F0606BE64@bham.ac.uk> Message-ID: Just to close the loop on this, IBM support confirmed it?s a bug in mmnetverify and will be fixed in a later PTF. (I didn?t feel the need for an EFIX for this) Simon From: on behalf of Simon Thompson Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 19 October 2018 at 14:39 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Spectrum Scale and Firewalls Yeah we have the perfmon ports open, and GUI ports open on the GUI nodes. But basically this is just a storage cluster and everything else (protocols etc) run in remote clusters. I?ve just opened a ticket ? no longer a PMR in the new support centre for Scale Simon From: on behalf of "knop at us.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 19 October 2018 at 14:05 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Spectrum Scale and Firewalls Simon, Depending on what functions are being used in Scale, other ports may also get used, as documented in https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_firewall.htm On the other hand, I'd initially speculate that you might be hitting a problem in mmnetverify itself. (perhaps some aspect in mmnetverify is not taking into account that ports other than 22, 1191, 60000-61000 may be getting blocked by the firewall) Could you open a PMR for this one? Thanks, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Simon Thompson ---10/19/2018 06:41:27 AM---Hi, We?re having some issues bringing up firewalls on som]Simon Thompson ---10/19/2018 06:41:27 AM---Hi, We?re having some issues bringing up firewalls on some of our NSD nodes. The problem I was actua From: Simon Thompson To: "gpfsug-discuss at spectrumscale.org" Date: 10/19/2018 06:41 AM Subject: [gpfsug-discuss] Spectrum Scale and Firewalls Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, We?re having some issues bringing up firewalls on some of our NSD nodes. The problem I was actually trying to diagnose I don?t think is firewall related but still ? We have port 22 and 1191 open and also 60000-61000, we also set: # mmlsconfig tscTcpPort tscTcpPort 1191 # mmlsconfig tscCmdPortRange tscCmdPortRange 60000-61000 https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_firewallforinternalcommn.htm Claims this is sufficient ? Running mmnetverify: # mmnetverify all --target-nodes rds-er-mgr01 rds-pg-mgr01 checking local configuration. Operation interface: Success. rds-pg-mgr01 checking communication with node rds-er-mgr01. Operation resolution: Success. Operation ping: Success. Operation shell: Success. Operation copy: Success. Operation time: Success. Operation daemon-port: Success. Operation sdrserv-port: Success. Operation tsccmd-port: Success. Operation data-small: Success. Operation data-medium: Success. Operation data-large: Success. Could not connect to port 46326 on node rds-pg-mgr01 (10.20.0.56): timed out. This may indicate a firewall configuration issue. Operation bandwidth-node: Fail. rds-pg-mgr01 checking cluster communications. Issues Found: rds-er-mgr01 could not connect to rds-pg-mgr01 (TCP, port 46326). mmnetverify: Command failed. Examine previous error messages to determine cause. Note that the port number mentioned changes if we run mmnetverify multiple times. The two clients in this test are running 5.0.2 code. If I run in verbose mode I see: Checking network communication with node rds-er-mgr01. Port range restricted by cluster configuration: 60000 - 61000. rds-er-mgr01: connecting to node rds-pg-mgr01. rds-er-mgr01: exchanged 256.0M bytes with rds-pg-mgr01. Write size: 16.0M bytes. Network statistics for rds-er-mgr01 during data exchange: packets sent: 68112 packets received: 72452 Network Traffic between rds-er-mgr01 and rds-pg-mgr01 port 60000 ok. Operation data-large: Success. Checking network bandwidth. rds-er-mgr01: connecting to node rds-pg-mgr01. Could not connect to port 36277 on node rds-pg-mgr01 (10.20.0.56): timed out. This may indicate a firewall configuration issue. Operation bandwidth-node: Fail. So for many of the tests it looks like its using port 60000 as expected, is this just a bug in mmnetverify or am I doing something silly? Thanks Simon_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 107 bytes Desc: image001.gif URL: From henrik.cednert at filmlance.se Tue Nov 6 11:25:57 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 11:25:57 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger > wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) > wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgayne at us.ibm.com Tue Nov 6 14:02:48 2018 From: lgayne at us.ibm.com (Lyle Gayne) Date: Tue, 6 Nov 2018 09:02:48 -0500 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com><45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> Message-ID: Yes, Henrik. For information on which OS levels are supported at which Spectrum Scale release levels, you should always consult our Spectrum Scale FAQ. This info is in Section 2 or 3 of the FAQ. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: gpfsug main discussion list Date: 11/06/2018 09:00 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ? \_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) < henrik.cednert at filmlance.se> wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From henrik.cednert at filmlance.se Tue Nov 6 14:12:27 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 14:12:27 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com><45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se>, Message-ID: Hello Ah yes, I never thought I was an issue since DDN sent me the v4 installer. Now I know better. Cheers -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 15:02, Lyle Gayne > wrote: Yes, Henrik. For information on which OS levels are supported at which Spectrum Scale release levels, you should always consult our Spectrum Scale FAQ. This info is in Section 2 or 3 of the FAQ. Thanks, Lyle "Henrik Cednert (Filmlance)" ---11/06/2018 09:00:15 AM---Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to From: "Henrik Cednert (Filmlance)" > To: gpfsug main discussion list > Date: 11/06/2018 09:00 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger > wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) > wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From vpaul at us.ibm.com Tue Nov 6 16:54:38 2018 From: vpaul at us.ibm.com (Vipul Paul) Date: Tue, 6 Nov 2018 08:54:38 -0800 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: Message-ID: Hello Henrik, I see that you are trying GPFS 4.1.0.4 on Windows 10. This will not work. You need to upgrade to GPFS 5.0.2 as that is the first release that supports Windows 10. Please see the FAQ https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows "Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Thanks. -- Vipul Paul | IBM Spectrum Scale (GPFS) Development | vpaul at us.ibm.com | (503) 747-1389 (tie 997) From: Lyle Gayne/Poughkeepsie/IBM To: gpfsug main discussion list Cc: Vipul Paul/Portland/IBM, Heather J MacPherson/Beaverton/IBM at IBMUS Date: 11/06/2018 05:46 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Vipul or Heather should be able to assist. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/06/2018 07:00 AM Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_( ?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Wed Nov 7 06:31:45 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Wed, 7 Nov 2018 06:31:45 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable Message-ID: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Hello I will try my luck here. Trying to monitor capacity on our gpfs system via observium. For some reason hrStorageTable doesn?t pick up that gpfs mount point though. In diskTable it?s visible but I cannot use diskTable when monitoring via observium, has to be hrStorageTable (I was told by observium dev). Output of a few snmpwalks and more at the bottom. Are there any obvious reasons for Centos 6.7 to not pick up a gpfs mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not sure if it?s even possible to in some way force it to include it in hrStorageTable?? Apologies if this isn?t the list for questions like this. But feels like there has to be one or two peeps here monitoring their systems here. =) All these commands ran on that host: df -h | grep ddnnas0 /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 mount | grep ddnnas0 /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) snmpwalk -v2c -c secret localhost hrStorageDescr HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy snmpwalk -v2c -c secret localhost dskPath UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 UCD-SNMP-MIB::dskPath.2 = STRING: / yum list | grep net-snmp Failed to set locale, defaulting to C net-snmp.x86_64 1:5.5-60.el6 @base net-snmp-libs.x86_64 1:5.5-60.el6 @base net-snmp-perl.x86_64 1:5.5-60.el6 @base net-snmp-utils.x86_64 1:5.5-60.el6 @base net-snmp-devel.i686 1:5.5-60.el6 base net-snmp-devel.x86_64 1:5.5-60.el6 base net-snmp-libs.i686 1:5.5-60.el6 base net-snmp-python.x86_64 1:5.5-60.el6 base Cheers and thanks -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Wed Nov 7 08:13:04 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 7 Nov 2018 08:13:04 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Nov 7 11:20:37 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 7 Nov 2018 12:20:37 +0100 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: Looking at the CHANGELOG for net-snmp, it seems it needs to know about each filesystem it's going to support, and I see no GPFS/mmfs. It has entries like: - Added simfs (OpenVZ filesystem) to hrStorageTable and hrFSTable. - Added CVFS (CentraVision File System) to hrStorageTable and - Added OCFS2 (Oracle Cluster FS) to hrStorageTable and hrFSTable - report gfs filesystems in hrStorageTable and hrFSTable. and also it didn't understand filesystems larger than 8 TB before version 5.7. I think your best option is to look at implementing the GPFS snmp agent agent https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_snmp.htm -- and see if it provides the data you need -- but it most likely won't affect the hrStorage table. And of course, please upgrade to something newer than v4.1.x. There's lots of improvements on monitoring in v4.2.3 and v5.x (but beware that v5 doesn't work with RHEL6). -jf On Wed, Nov 7, 2018 at 9:05 AM Henrik Cednert (Filmlance) < henrik.cednert at filmlance.se> wrote: > Hello > > I will try my luck here. Trying to monitor capacity on our gpfs system via > observium. For some reason hrStorageTable doesn?t pick up that gpfs mount > point though. In diskTable it?s visible but I cannot use diskTable when > monitoring via observium, has to be hrStorageTable (I was told by observium > dev). Output of a few snmpwalks and more at the bottom. > > Are there any obvious reasons for Centos 6.7 to not pick up a gpfs > mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not > sure if it?s even possible to in some way force it to include it in > hrStorageTable?? > > Apologies if this isn?t the list for questions like this. But feels like > there has to be one or two peeps here monitoring their systems here. =) > > > All these commands ran on that host: > > df -h | grep ddnnas0 > /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 > > > mount | grep ddnnas0 > /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) > > > snmpwalk -v2c -c secret localhost hrStorageDescr > HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory > HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory > HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers > HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory > HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space > HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / > HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm > HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot > HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy > HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash > HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy > HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var > HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy > > > snmpwalk -v2c -c secret localhost dskPath > UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 > UCD-SNMP-MIB::dskPath.2 = STRING: / > > > yum list | grep net-snmp > Failed to set locale, defaulting to C > net-snmp.x86_64 1:5.5-60.el6 > @base > net-snmp-libs.x86_64 1:5.5-60.el6 > @base > net-snmp-perl.x86_64 1:5.5-60.el6 > @base > net-snmp-utils.x86_64 1:5.5-60.el6 > @base > net-snmp-devel.i686 1:5.5-60.el6 > base > net-snmp-devel.x86_64 1:5.5-60.el6 > base > net-snmp-libs.i686 1:5.5-60.el6 > base > net-snmp-python.x86_64 1:5.5-60.el6 > base > > > Cheers and thanks > > -- > Henrik Cednert */ * + 46 704 71 89 54 */* CTO */ * *Filmlance* > Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. > ?\_(?)_/? > > *Disclaimer* > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Nov 7 11:29:11 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 7 Nov 2018 12:29:11 +0100 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: Looks like this is all it should take to add GPFS support to net-snmp: $ git diff diff --git a/agent/mibgroup/hardware/fsys/fsys_mntent.c b/agent/mibgroup/hardware/fsys/fsys_mntent.c index 62e2953..4950879 100644 --- a/agent/mibgroup/hardware/fsys/fsys_mntent.c +++ b/agent/mibgroup/hardware/fsys/fsys_mntent.c @@ -136,6 +136,7 @@ _fsys_type( char *typename ) else if ( !strcmp(typename, MNTTYPE_TMPFS) || !strcmp(typename, MNTTYPE_GFS) || !strcmp(typename, MNTTYPE_GFS2) || + !strcmp(typename, MNTTYPE_GPFS) || !strcmp(typename, MNTTYPE_XFS) || !strcmp(typename, MNTTYPE_JFS) || !strcmp(typename, MNTTYPE_VXFS) || diff --git a/agent/mibgroup/hardware/fsys/mnttypes.h b/agent/mibgroup/hardware/fsys/mnttypes.h index bb1b401..d3f0c60 100644 --- a/agent/mibgroup/hardware/fsys/mnttypes.h +++ b/agent/mibgroup/hardware/fsys/mnttypes.h @@ -121,6 +121,9 @@ #ifndef MNTTYPE_GFS2 #define MNTTYPE_GFS2 "gfs2" #endif +#ifndef MNTTYPE_GPFS +#define MNTTYPE_GPFS "gpfs" +#endif #ifndef MNTTYPE_XFS #define MNTTYPE_XFS "xfs" #endif On Wed, Nov 7, 2018 at 12:20 PM Jan-Frode Myklebust wrote: > Looking at the CHANGELOG for net-snmp, it seems it needs to know about > each filesystem it's going to support, and I see no GPFS/mmfs. It has > entries like: > > - Added simfs (OpenVZ filesystem) to hrStorageTable and hrFSTable. > - Added CVFS (CentraVision File System) to hrStorageTable and > - Added OCFS2 (Oracle Cluster FS) to hrStorageTable and hrFSTable > - report gfs filesystems in hrStorageTable and hrFSTable. > > > and also it didn't understand filesystems larger than 8 TB before version > 5.7. > > I think your best option is to look at implementing the GPFS snmp agent > agent > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_snmp.htm > -- and see if it provides the data you need -- but it most likely won't > affect the hrStorage table. > > And of course, please upgrade to something newer than v4.1.x. There's lots > of improvements on monitoring in v4.2.3 and v5.x (but beware that v5 > doesn't work with RHEL6). > > > -jf > > On Wed, Nov 7, 2018 at 9:05 AM Henrik Cednert (Filmlance) < > henrik.cednert at filmlance.se> wrote: > >> Hello >> >> I will try my luck here. Trying to monitor capacity on our gpfs system >> via observium. For some reason hrStorageTable doesn?t pick up that gpfs >> mount point though. In diskTable it?s visible but I cannot use diskTable >> when monitoring via observium, has to be hrStorageTable (I was told by >> observium dev). Output of a few snmpwalks and more at the bottom. >> >> Are there any obvious reasons for Centos 6.7 to not pick up a gpfs >> mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not >> sure if it?s even possible to in some way force it to include it in >> hrStorageTable?? >> >> Apologies if this isn?t the list for questions like this. But feels like >> there has to be one or two peeps here monitoring their systems here. =) >> >> >> All these commands ran on that host: >> >> df -h | grep ddnnas0 >> /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 >> >> >> mount | grep ddnnas0 >> /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) >> >> >> snmpwalk -v2c -c secret localhost hrStorageDescr >> HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory >> HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory >> HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers >> HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory >> HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space >> HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / >> HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm >> HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot >> HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy >> HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash >> HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy >> HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var >> HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy >> >> >> snmpwalk -v2c -c secret localhost dskPath >> UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 >> UCD-SNMP-MIB::dskPath.2 = STRING: / >> >> >> yum list | grep net-snmp >> Failed to set locale, defaulting to C >> net-snmp.x86_64 1:5.5-60.el6 >> @base >> net-snmp-libs.x86_64 1:5.5-60.el6 >> @base >> net-snmp-perl.x86_64 1:5.5-60.el6 >> @base >> net-snmp-utils.x86_64 1:5.5-60.el6 >> @base >> net-snmp-devel.i686 1:5.5-60.el6 >> base >> net-snmp-devel.x86_64 1:5.5-60.el6 >> base >> net-snmp-libs.i686 1:5.5-60.el6 >> base >> net-snmp-python.x86_64 1:5.5-60.el6 >> base >> >> >> Cheers and thanks >> >> -- >> Henrik Cednert */ * + 46 704 71 89 54 */* CTO */ * *Filmlance* >> Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. >> ?\_(?)_/? >> >> *Disclaimer* >> >> The information contained in this communication from the sender is >> confidential. It is intended solely for use by the recipient and others >> authorized to receive it. If you are not the recipient, you are hereby >> notified that any disclosure, copying, distribution or taking action in >> relation of the contents of this information is strictly prohibited and may >> be unlawful. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Nov 7 13:02:32 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 7 Nov 2018 13:02:32 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: <5da641cc-a171-2d9b-f917-a4470279237f@strath.ac.uk> On 07/11/2018 11:20, Jan-Frode Myklebust wrote: [SNIP] > > And of course, please upgrade to something newer than v4.1.x. There's > lots of improvements on monitoring in v4.2.3 and v5.x (but beware that > v5 doesn't work with RHEL6). > I would suggest that getting off CentOS 6.7 to more recent release should also be a priority. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From aaron.s.knister at nasa.gov Wed Nov 7 23:37:37 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 7 Nov 2018 18:37:37 -0500 Subject: [gpfsug-discuss] Unexpected data in message/Bad message Message-ID: We're experiencing client nodes falling out of the cluster with errors that look like this: ?Tue Nov 6 15:10:34.939 2018: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000047 00000000 00 00 0000 00000000 00000000 0000 0000 Tue Nov 6 15:10:34.942 2018: [E] [0/0] 512 more bytes were available: Tue Nov 6 15:10:34.965 2018: [N] Close connection to 10.100.X.X nsdserver1 (Unexpected error 120) Tue Nov 6 15:10:34.966 2018: [E] Network error on 10.100.X.X nsdserver1 , Check connectivity Tue Nov 6 15:10:36.726 2018: [N] Restarting mmsdrserv Tue Nov 6 15:10:38.850 2018: [E] Bad message Tue Nov 6 15:10:38.851 2018: [X] The mmfs daemon is shutting down abnormally. Tue Nov 6 15:10:38.852 2018: [N] mmfsd is shutting down. Tue Nov 6 15:10:38.853 2018: [N] Reason for shutdown: LOGSHUTDOWN called The cluster is running various PTF Levels of 4.1.1. Has anyone seen this before? I'm struggling to understand what it means from a technical point of view. Was GPFS expecting a larger message than it received? Did it receive all of the bytes it expected and some of it was corrupt? It says "512 more bytes were available" but then doesn't show any additional bytes. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Robert.Oesterlin at nuance.com Thu Nov 8 20:40:05 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 8 Nov 2018 20:40:05 +0000 Subject: [gpfsug-discuss] SSUG @ SC18 - Location details Message-ID: <07491620-1D44-4A9C-9C92-A7DA634304CE@nuance.com> Location: Omni Dallas Hotel 555 S Lamar Dallas, Texas 75202 United States The Omni is connected to Kay Bailey Convention Center via skybridge on 2nd Floor. Dallas Ballroom A - 3rd Floor IBM Spectrum Scale User Group Meeting Sunday, November 11, 2018 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From heiner.billich at psi.ch Fri Nov 9 12:46:31 2018 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Fri, 9 Nov 2018 12:46:31 +0000 Subject: [gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots Message-ID: Hello, we run CES with smbd on a filesystem _without_ snapshots. I would like to completely remove the shadow_copy2 vfs object in samba which exposes the snapshots to windows clients: We don't offer snapshots as service to clients and if I create a snapshot I don't want it to be exposed to clients. I'm also not sure how much additional directory traversals this vfs object causes, shadow_copy2 has to search for the snapshot directories again and again, just to learn that there are no snapshots available. Now the file samba_registry.def (/usr/lpp/mmfs/share/samba/samba_registry.def) doesn't allow to change the settings for shadow_config2 in samba's configuration. Hm, is it o.k. to edit samba_registry.def? That's probably not what IBM intended. But with mmsnapdir I can change the name of the snapshot directories, which would require me to edit the locked settings, too, so it seems a bit restrictive. I didn?t search all documentation, if there is an option do disable shadow_copy2 with some command I would be happy to learn. Any comments or ideas are welcome. Also if you think I should just create a bogus .snapdirs at root level to get rid of the error messages and that's it, please let me know. we run scale 5.0.1-1 on RHEL4 x86_64. We will upgrade to 5.0.2-1 soon, but I didn?t' check that version yet. Cheers Heiner Billich What I would like to change in samba's configuration: 52c52 < vfs objects = syncops gpfs fileid time_audit --- > vfs objects = shadow_copy2 syncops gpfs fileid time_audit 72a73,76 > shadow:snapdir = .snapshots > shadow:fixinodes = yes > shadow:snapdirseverywhere = yes > shadow:sort = desc -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch From jonathan.buzzard at strath.ac.uk Fri Nov 9 13:26:50 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 9 Nov 2018 13:26:50 +0000 Subject: [gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots In-Reply-To: References: Message-ID: <82e1aa3d-566a-3a4a-e841-9f92f30546c6@strath.ac.uk> On 09/11/2018 12:46, Billich Heinrich Rainer (PSI) wrote: > Hello, > > we run CES with smbd on a filesystem _without_ snapshots. I would > like to completely remove the shadow_copy2 vfs object in samba which > exposes the snapshots to windows clients: > > We don't offer snapshots as service to clients and if I create a > snapshot I don't want it to be exposed to clients. I'm also not sure > how much additional directory traversals this vfs object causes, > shadow_copy2 has to search for the snapshot directories again and > again, just to learn that there are no snapshots available. > The shadow_copy2 VFS only exposes snapshots to clients if they are in a very specific format. The chances of you doing this with "management" snapshots you are creating are about ?, assuming you are using the command line. If you are using the GUI then all bets are off. Perhaps someone with experience of the GUI can add their wisdom here. The VFS even if loaded will only create I/O on the server if the client clicks on previous versions tab in Windows Explorer. Given that you don't offer previous version snapshots, then there will be very little of this going on and even if they do then the initial amount of I/O will be limited to basically the equivalent of an `ls` in the shadow copy snapshot directory. So absolutely nothing to get worked up about. With the proviso about doing snapshots from the GUI (never used the new GUI in GPFS, only played with the old one, and don't trust IBM to change it again) you are completely overthinking this. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Robert.Oesterlin at nuance.com Fri Nov 9 14:07:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 9 Nov 2018 14:07:01 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Message-ID: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Nov 9 14:13:19 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 9 Nov 2018 09:13:19 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen > On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: > > Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. > > ESS 5.3.1.1 > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 17.0 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Fri Nov 9 16:02:12 2018 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 9 Nov 2018 11:02:12 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 ("4.2.3" may not get recognized by the command) The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Stephen Ulmer To: gpfsug main discussion list Date: 11/09/2018 09:13 AM Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Sent by: gpfsug-discuss-bounces at spectrumscale.org It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ulmer at ulmer.org Fri Nov 9 16:08:17 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 9 Nov 2018 11:08:17 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: <8EDF9168-698B-4EA7-9D2A-F33D7B8AF265@ulmer.org> You rock. -- Stephen > On Nov 9, 2018, at 11:02 AM, Felipe Knop > wrote: > > Stephen, Bob, > > A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: > 4.2.3.0 and 4.2.3.9 > > ("4.2.3" may not get recognized by the command) > > The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) > > Felipe > > ---- > Felipe Knop knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 T/L 293-9314 > > > > Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 11/09/2018 09:13 AM > Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > It had better work ? I?m literally going to be doing exactly the same thing in two weeks? > > -- > Stephen > > > On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: > > Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. > > ESS 5.3.1.1 > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 17.0 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Nov 9 16:11:05 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 9 Nov 2018 16:11:05 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Message-ID: <6951D57C-3714-4F26-A7AD-92B8D79501EC@nuance.com> That did it, thanks. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Felipe Knop Reply-To: gpfsug main discussion list Date: Friday, November 9, 2018 at 10:04 AM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Nov 9 16:12:02 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 9 Nov 2018 16:12:02 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: Looking in ?mmprodname? looks like if you wanted to use 17, it would be 1700 (for 1709 based on what Felipe mentions below). (I wonder what 99.0.0.0 does ?) Simon From: on behalf of "knop at us.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 9 November 2018 at 16:02 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 ("4.2.3" may not get recognized by the command) The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exac]Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- From: Stephen Ulmer To: gpfsug main discussion list Date: 11/09/2018 09:13 AM Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: From bzhang at ca.ibm.com Fri Nov 9 16:37:08 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Fri, 9 Nov 2018 11:37:08 -0500 Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: Hi all, We are going to host our next technical webinar. everyone is welcome to register and attend. Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B383659.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B754293.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B720798.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B162231.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B680907.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B309846.gif Type: image/gif Size: 1243 bytes Desc: not available URL: From jonbernard at gmail.com Sat Nov 10 20:37:35 2018 From: jonbernard at gmail.com (Jon Bernard) Date: Sat, 10 Nov 2018 14:37:35 -0600 Subject: [gpfsug-discuss] If you're attending KubeCon'18 In-Reply-To: References: Message-ID: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov Folks, > > Please let me know if anyone is attending KubeCon'18 in Seattle this > December (via private e-mail). We will be there and would like to meet in > person with people that already use or consider using > Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, > problems, visions. > > P.S. If you are not attending KubeCon, but are interested in the topic, > shoot me an e-mail anyway. > > Best, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Nov 11 18:07:17 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sun, 11 Nov 2018 13:07:17 -0500 Subject: [gpfsug-discuss] Unexpected data in message/Bad message In-Reply-To: References: Message-ID: Hi Aaron, The header dump shows all zeroes were received for the header. So no valid magic, version, originator, etc. The "512 more bytes" would have been the meat after the header. Very unexpected hence the shutdown. Logs around that event involving the machines noted in that trace would be required to evaluate further. This is not common. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 11/07/2018 06:38 PM Subject: [gpfsug-discuss] Unexpected data in message/Bad message Sent by: gpfsug-discuss-bounces at spectrumscale.org We're experiencing client nodes falling out of the cluster with errors that look like this: Tue Nov 6 15:10:34.939 2018: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000047 00000000 00 00 0000 00000000 00000000 0000 0000 Tue Nov 6 15:10:34.942 2018: [E] [0/0] 512 more bytes were available: Tue Nov 6 15:10:34.965 2018: [N] Close connection to 10.100.X.X nsdserver1 (Unexpected error 120) Tue Nov 6 15:10:34.966 2018: [E] Network error on 10.100.X.X nsdserver1 , Check connectivity Tue Nov 6 15:10:36.726 2018: [N] Restarting mmsdrserv Tue Nov 6 15:10:38.850 2018: [E] Bad message Tue Nov 6 15:10:38.851 2018: [X] The mmfs daemon is shutting down abnormally. Tue Nov 6 15:10:38.852 2018: [N] mmfsd is shutting down. Tue Nov 6 15:10:38.853 2018: [N] Reason for shutdown: LOGSHUTDOWN called The cluster is running various PTF Levels of 4.1.1. Has anyone seen this before? I'm struggling to understand what it means from a technical point of view. Was GPFS expecting a larger message than it received? Did it receive all of the bytes it expected and some of it was corrupt? It says "512 more bytes were available" but then doesn't show any additional bytes. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From madhu at corehive.com Sun Nov 11 19:58:27 2018 From: madhu at corehive.com (Madhu Konidena) Date: Sun, 11 Nov 2018 14:58:27 -0500 Subject: [gpfsug-discuss] If you're attending KubeCon'18 In-Reply-To: References: Message-ID: <2a3e90be-92bd-489d-a9bc-c1f6b6eae5de@corehive.com> I will be there at both. Please stop by our booth at SC18 for a quick chat. ? Madhu Konidena Madhu at CoreHive.com? On Nov 10, 2018, at 3:37 PM, Jon Bernard wrote: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: >Hi Vasily, > >I will be at Kubecon with colleagues from Tower Research Capital (and >at >SC). We have a few hundred nodes across several Kubernetes clusters, >most >of them mounting Scale from the host. > >Jon > >On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: > >> Folks, >> >> Please let me know if anyone is attending KubeCon'18 in Seattle this >> December (via private e-mail). We will be there and would like to >meet in >> person with people that already use or consider using >> Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, >> problems, visions. >> >> P.S. If you are not attending KubeCon, but are interested in the >topic, >> shoot me an e-mail anyway. >> >> Best, >> -- >> Vasily Tarasov, >> Research Staff Member, >> Storage Systems Research, >> IBM Research - Almaden >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 102.png Type: image/png Size: 18340 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sc18-campaignasset_10.png Type: image/png Size: 354312 bytes Desc: not available URL: From heiner.billich at psi.ch Wed Nov 14 16:20:12 2018 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Wed, 14 Nov 2018 16:20:12 +0000 Subject: [gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot Message-ID: Hello, how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node or restart gpfs on a suspended ces node? Being able to do this would make updates much easier With # mmces node suspend ?stop I can move all IPs to other CES nodes and stop all CES services, what also releases the ces-shared-root-directory and allows to unmount the underlying filesystem. But after a reboot/restart only the IPs stay on the on the other nodes, the CES services start up. Hm, sometimes I would very much prefer the services to stay down as long as the nodes is suspended and to keep the node out of the CES cluster as much as possible. I did not try rough things like just renaming smbd, this seems likely to create unwanted issues. Thank you, Cheers, Heiner Billich -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch From: on behalf of Madhu Konidena Reply-To: gpfsug main discussion list Date: Sunday 11 November 2018 at 22:06 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18 I will be there at both. Please stop by our booth at SC18 for a quick chat. Madhu Konidena [cid:ii_d4d3894a4c2f4773] Madhu at CoreHive.com On Nov 10, 2018, at 3:37 PM, Jon Bernard > wrote: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: Folks, Please let me know if anyone is attending KubeCon'18 in Seattle this December (via private e-mail). We will be there and would like to meet in person with people that already use or consider using Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, problems, visions. P.S. If you are not attending KubeCon, but are interested in the topic, shoot me an e-mail anyway. Best, -- Vasily Tarasov, Research Staff Member, Storage Systems Research, IBM Research - Almaden _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 18341 bytes Desc: image001.png URL: From skylar2 at uw.edu Wed Nov 14 16:27:31 2018 From: skylar2 at uw.edu (Skylar Thompson) Date: Wed, 14 Nov 2018 16:27:31 +0000 Subject: [gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot In-Reply-To: References: Message-ID: <20181114162731.a7etjs4g3gftgsyv@utumno.gs.washington.edu> Hi Heiner, Try doing "mmces service stop -N " and/or "mmces service disable -N ". You'll definitely want the node suspended first, since I don't think the service commands do an address migration first. On Wed, Nov 14, 2018 at 04:20:12PM +0000, Billich Heinrich Rainer (PSI) wrote: > Hello, > > how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node or restart gpfs on a suspended ces node? Being able to do this would make updates much easier > > With > > # mmces node suspend ???stop > > I can move all IPs to other CES nodes and stop all CES services, what also releases the ces-shared-root-directory and allows to unmount the underlying filesystem. > But after a reboot/restart only the IPs stay on the on the other nodes, the CES services start up. Hm, sometimes I would very much prefer the services to stay down as long as the nodes is suspended and to keep the node out of the CES cluster as much as possible. > > I did not try rough things like just renaming smbd, this seems likely to create unwanted issues. > > Thank you, > > Cheers, > > Heiner Billich > -- > Paul Scherrer Institut > Heiner Billich > System Engineer Scientific Computing > Science IT / High Performance Computing > WHGA/106 > Forschungsstrasse 111 > 5232 Villigen PSI > Switzerland > > Phone +41 56 310 36 02 > heiner.billich at psi.ch > https://www.psi.ch > > > > From: on behalf of Madhu Konidena > Reply-To: gpfsug main discussion list > Date: Sunday 11 November 2018 at 22:06 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18 > > I will be there at both. Please stop by our booth at SC18 for a quick chat. > > Madhu Konidena > [cid:ii_d4d3894a4c2f4773] > Madhu at CoreHive.com > > > > On Nov 10, 2018, at 3:37 PM, Jon Bernard > wrote: > Hi Vasily, > I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. > Jon > On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: > Folks, Please let me know if anyone is attending KubeCon'18 in Seattle this December (via private e-mail). We will be there and would like to meet in person with people that already use or consider using Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, problems, visions. P.S. If you are not attending KubeCon, but are interested in the topic, shoot me an e-mail anyway. Best, -- Vasily Tarasov, Research Staff Member, Storage Systems Research, IBM Research - Almaden > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From novosirj at rutgers.edu Wed Nov 14 15:28:31 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Wed, 14 Nov 2018 15:28:31 +0000 Subject: [gpfsug-discuss] GSS Software Release? Message-ID: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- From scale at us.ibm.com Thu Nov 15 13:26:18 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Thu, 15 Nov 2018 08:26:18 -0500 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> Message-ID: AFAIK GSS/DSS are handled by Lenovo not IBM so you would need to contact them for release plans. I do not know which version of GPFS was included in GSS 3.3a but I can tell you that GPFS 3.5 is out of service and GPFS 4.1.x will be end of service in April 2019. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Ryan Novosielski To: "gpfsug-discuss at spectrumscale.org" Date: 11/15/2018 12:03 AM Subject: [gpfsug-discuss] GSS Software Release? Sent by: gpfsug-discuss-bounces at spectrumscale.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Thu Nov 15 14:01:28 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Thu, 15 Nov 2018 14:01:28 +0000 Subject: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) Message-ID: >any idea when a newer GSS software release than 3.3a will be released? That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From alvise.dorigo at psi.ch Thu Nov 15 15:22:25 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Thu, 15 Nov 2018 15:22:25 +0000 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon Message-ID: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> Hello, I'm using mmperfmon to get writing stats on NSD during a write activity on a GPFS filesystem (Lenovo system with dss-g-2.0a). I use this command: # mmperfmon query 'sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --number-buckets 48 -b 1 to get the stats. What it returns is a list of valid values followed by a longer list of 'null' as shown below: Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp gpfs_nsdfs_bytes_written 1 2018-11-15-16:15:57 746586112 2 2018-11-15-16:15:58 704643072 3 2018-11-15-16:15:59 805306368 4 2018-11-15-16:16:00 754974720 5 2018-11-15-16:16:01 754974720 6 2018-11-15-16:16:02 763363328 7 2018-11-15-16:16:03 746586112 8 2018-11-15-16:16:04 746848256 9 2018-11-15-16:16:05 780140544 10 2018-11-15-16:16:06 679923712 11 2018-11-15-16:16:07 746618880 12 2018-11-15-16:16:08 780140544 13 2018-11-15-16:16:09 746586112 14 2018-11-15-16:16:10 763363328 15 2018-11-15-16:16:11 780173312 16 2018-11-15-16:16:12 721420288 17 2018-11-15-16:16:13 796917760 18 2018-11-15-16:16:14 763363328 19 2018-11-15-16:16:15 738197504 20 2018-11-15-16:16:16 738197504 21 2018-11-15-16:16:17 null 22 2018-11-15-16:16:18 null 23 2018-11-15-16:16:19 null 24 2018-11-15-16:16:20 null 25 2018-11-15-16:16:21 null 26 2018-11-15-16:16:22 null 27 2018-11-15-16:16:23 null 28 2018-11-15-16:16:24 null 29 2018-11-15-16:16:25 null 30 2018-11-15-16:16:26 null 31 2018-11-15-16:16:27 null 32 2018-11-15-16:16:28 null 33 2018-11-15-16:16:29 null 34 2018-11-15-16:16:30 null 35 2018-11-15-16:16:31 null 36 2018-11-15-16:16:32 null 37 2018-11-15-16:16:33 null 38 2018-11-15-16:16:34 null 39 2018-11-15-16:16:35 null 40 2018-11-15-16:16:36 null 41 2018-11-15-16:16:37 null 42 2018-11-15-16:16:38 null 43 2018-11-15-16:16:39 null 44 2018-11-15-16:16:40 null 45 2018-11-15-16:16:41 null 46 2018-11-15-16:16:42 null 47 2018-11-15-16:16:43 null 48 2018-11-15-16:16:44 null If I run again and again I still get the same pattern: valid data (even 0 in case of not write activity) followed by more null data. Is that normal ? If not, is there a way to get only non-null data by fine-tuning pmcollector's configuration file ? The corresponding ZiMon sensor (GPFSNSDFS) have period=1. The ZiMon version is 4.2.3-7. Thanks, Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From aposthuma at lenovo.com Thu Nov 15 15:56:44 2018 From: aposthuma at lenovo.com (Andre Posthuma) Date: Thu, 15 Nov 2018 15:56:44 +0000 Subject: [gpfsug-discuss] [External] Re: GSS Software Release? (Ryan Novosielski) In-Reply-To: References: Message-ID: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> Hello, GSS 3.3b was released last week, with a number of Spectrum Scale versions available : 5.0.1.2 4.2.3.11 4.1.1.20 DSS-G 2.2a was released yesterday, with 2 Spectrum Scale versions available : 5.0.2.1 4.2.3.11 Best Regards Andre Posthuma IT Specialist HPC Services Lenovo United Kingdom +44 7841782363 aposthuma at lenovo.com ? Lenovo.com Twitter | Facebook | Instagram | Blogs | Forums -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Carl Zetie Sent: Thursday, November 15, 2018 2:01 PM To: gpfsug-discuss at spectrumscale.org Subject: [External] Re: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) >any idea when a newer GSS software release than 3.3a will be released? That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From matthew.robinson02 at gmail.com Thu Nov 15 17:53:14 2018 From: matthew.robinson02 at gmail.com (Matthew Robinson) Date: Thu, 15 Nov 2018 12:53:14 -0500 Subject: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) In-Reply-To: References: Message-ID: Hi Ryan, As an Ex-GSS PE guy for Lenovo, a new GSS update could almost be expected every 3-4 months in a year. I would not be surprised if Lenovo GSS-DSS development started to not update the GSS solution and only focused on DSS updates. That is just my best guess from this point. I agree with Carl this should be a quick open and close case for the Lenovo product engineer that still works on the GSS solution. Kind regards, MattRob On Thu, Nov 15, 2018 at 9:02 AM Carl Zetie wrote: > > >any idea when a newer GSS software release than 3.3a will be released? > > That is definitely a question only our friends at Lenovo can answer. If > you don't get a response here (I'm not sure if any Lenovites are active on > the list), you'll need to address it directly to Lenovo, e.g. your account > team. > > > Carl Zetie > Program Director > Offering Management for Spectrum Scale, IBM > ---- > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Matthew Robinson Comptia A+, Net+ 919.909.0494 matthew.robinson02 at gmail.com The greatest discovery of my generation is that man can alter his life simply by altering his attitude of mind. - William James, Harvard Psychologist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy_kurth at ncsu.edu Thu Nov 15 18:28:46 2018 From: andy_kurth at ncsu.edu (Andy Kurth) Date: Thu, 15 Nov 2018 13:28:46 -0500 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> Message-ID: Public information on GSS updates seems nonexistent. You can find some clues if you have access to Lenovo's restricted download site . It looks like gss3.3b was released in late September. There are gss3.3b download options that include either 4.2.3-9 or 4.1.1-20. Earlier this month they also released some GPFS-only updates for 4.3.2-11 and 5.0.1-2. It looks like these are meant to be applied on top of gss3.3b. For DSS-G, it looks like dss-g-2.2a is the latest full release with options that include 4.2.3-11 or 5.0.2-1. There are also separate DSS-G GPFS-only updates for 4.2.3-11 and 5.0.1-2. Regards, Andy Kurth / NCSU On Thu, Nov 15, 2018 at 12:01 AM Ryan Novosielski wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I know this might not be the perfect venue, but I know IBM developers > participate and will occasionally share this sort of thing: any idea > when a newer GSS software release than 3.3a will be released? We are > attempting to plan our maintenance schedule. At the moment, the DSS-G > software seems to be getting updated and we'd prefer to remain at the > same GPFS release on DSS-G and GSS. > > - -- > ____ > || \\UTGERS, |----------------------*O*------------------------ > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark > `' > -----BEGIN PGP SIGNATURE----- > > iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl > qvcAn2ql2U8peuGuhptTIejVgnDFSWEf > =7Iue > -----END PGP SIGNATURE----- > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- *Andy Kurth* Research Storage Specialist NC State University Office of Information Technology P: 919-513-4090 311A Hillsborough Building Campus Box 7109 Raleigh, NC 27695 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Nov 15 20:35:29 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Nov 2018 21:35:29 +0100 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon In-Reply-To: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Nov 16 02:22:55 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 16 Nov 2018 02:22:55 +0000 Subject: [gpfsug-discuss] Presentations - User Group Meeting at SC18 Message-ID: <917D0EB2-BE2C-4445-AE12-B68DA3D2B6F1@nuance.com> I?ve uploaded the first batch of presentation to the spectrumscale.org site - More coming once I receive them. https://www.spectrumscaleug.org/presentations/2018/ Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From LNakata at SLAC.STANFORD.EDU Fri Nov 16 03:06:46 2018 From: LNakata at SLAC.STANFORD.EDU (Lance Nakata) Date: Thu, 15 Nov 2018 19:06:46 -0800 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Message-ID: <20181116030646.GA28141@slac.stanford.edu> We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory From Greg.Lehmann at csiro.au Fri Nov 16 03:46:01 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 16 Nov 2018 03:46:01 +0000 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: <20181116030646.GA28141@slac.stanford.edu> References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: <6be5c9834bc747b7b7145e884f98caa2@exch1-cdc.nexus.csiro.au> Hi Lance, We are doing it with beegfs (mdadm and NVMe drives in the same HW.) For GPFS have you updated the nsddevices sample script to look at the mdadm devices and put it in /var/mmfs/etc? BTW I'm interested to see how you go with that configuration. Cheers, Greg -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lance Nakata Sent: Friday, November 16, 2018 1:07 PM To: gpfsug-discuss at spectrumscale.org Cc: Jon L. Bergman Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From alvise.dorigo at psi.ch Fri Nov 16 08:29:46 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Fri, 16 Nov 2018 08:29:46 +0000 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon In-Reply-To: References: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch>, Message-ID: <83A6EEB0EC738F459A39439733AE80452679A101@MBX214.d.ethz.ch> Indeed, I just realized that after last recent update to dssg-2.0a ntpd is crashing very frequently. Thanks for the hint. Alvise ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Olaf Weiser [olaf.weiser at de.ibm.com] Sent: Thursday, November 15, 2018 9:35 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Wrong behavior of mmperfmon ntp running / time correct ? From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/15/2018 04:30 PM Subject: [gpfsug-discuss] Wrong behavior of mmperfmon Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, I'm using mmperfmon to get writing stats on NSD during a write activity on a GPFS filesystem (Lenovo system with dss-g-2.0a). I use this command: # mmperfmon query 'sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --number-buckets 48 -b 1 to get the stats. What it returns is a list of valid values followed by a longer list of 'null' as shown below: Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp gpfs_nsdfs_bytes_written 1 2018-11-15-16:15:57 746586112 2 2018-11-15-16:15:58 704643072 3 2018-11-15-16:15:59 805306368 4 2018-11-15-16:16:00 754974720 5 2018-11-15-16:16:01 754974720 6 2018-11-15-16:16:02 763363328 7 2018-11-15-16:16:03 746586112 8 2018-11-15-16:16:04 746848256 9 2018-11-15-16:16:05 780140544 10 2018-11-15-16:16:06 679923712 11 2018-11-15-16:16:07 746618880 12 2018-11-15-16:16:08 780140544 13 2018-11-15-16:16:09 746586112 14 2018-11-15-16:16:10 763363328 15 2018-11-15-16:16:11 780173312 16 2018-11-15-16:16:12 721420288 17 2018-11-15-16:16:13 796917760 18 2018-11-15-16:16:14 763363328 19 2018-11-15-16:16:15 738197504 20 2018-11-15-16:16:16 738197504 21 2018-11-15-16:16:17 null 22 2018-11-15-16:16:18 null 23 2018-11-15-16:16:19 null 24 2018-11-15-16:16:20 null 25 2018-11-15-16:16:21 null 26 2018-11-15-16:16:22 null 27 2018-11-15-16:16:23 null 28 2018-11-15-16:16:24 null 29 2018-11-15-16:16:25 null 30 2018-11-15-16:16:26 null 31 2018-11-15-16:16:27 null 32 2018-11-15-16:16:28 null 33 2018-11-15-16:16:29 null 34 2018-11-15-16:16:30 null 35 2018-11-15-16:16:31 null 36 2018-11-15-16:16:32 null 37 2018-11-15-16:16:33 null 38 2018-11-15-16:16:34 null 39 2018-11-15-16:16:35 null 40 2018-11-15-16:16:36 null 41 2018-11-15-16:16:37 null 42 2018-11-15-16:16:38 null 43 2018-11-15-16:16:39 null 44 2018-11-15-16:16:40 null 45 2018-11-15-16:16:41 null 46 2018-11-15-16:16:42 null 47 2018-11-15-16:16:43 null 48 2018-11-15-16:16:44 null If I run again and again I still get the same pattern: valid data (even 0 in case of not write activity) followed by more null data. Is that normal ? If not, is there a way to get only non-null data by fine-tuning pmcollector's configuration file ? The corresponding ZiMon sensor (GPFSNSDFS) have period=1. The ZiMon version is 4.2.3-7. Thanks, Alvise _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Fri Nov 16 09:19:07 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 16 Nov 2018 10:19:07 +0100 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: <20181116030646.GA28141@slac.stanford.edu> References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Fri Nov 16 09:35:25 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 16 Nov 2018 10:35:25 +0100 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Hi, Having mentioned nsddevices, I do not know how Scale treats different device types differently, so generic would be a fine choice unless development tells you differently. Currently known device types are listed in the comments of the script /usr/lpp/mmfs/bin/mmdevdiscover Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 16/11/2018 10:19 Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Fri Nov 16 12:31:57 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 16 Nov 2018 07:31:57 -0500 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Note, RHEL 7.6 is not yet a supported platform for Spectrum Scale so you may want to use RHEL 7.5 or wait for RHEL 7.6 to be supported. Using "generic" for the device type should be the proper option here. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Uwe Falke" To: gpfsug main discussion list Date: 11/16/2018 04:35 AM Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having mentioned nsddevices, I do not know how Scale treats different device types differently, so generic would be a fine choice unless development tells you differently. Currently known device types are listed in the comments of the script /usr/lpp/mmfs/bin/mmdevdiscover Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 16/11/2018 10:19 Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Nov 15 17:17:15 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 15 Nov 2018 17:17:15 +0000 Subject: [gpfsug-discuss] [External] Re: GSS Software Release? (Ryan Novosielski) In-Reply-To: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> References: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> Message-ID: <28988E74-6BAC-47FB-AEE2-015D2B784A40@rutgers.edu> Thanks, all. I was looking around FlexNet this week and didn?t see it, but it?s good to know it exists/likely will appear soon. -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Nov 15, 2018, at 10:56 AM, Andre Posthuma wrote: > > Hello, > > GSS 3.3b was released last week, with a number of Spectrum Scale versions available : > 5.0.1.2 > 4.2.3.11 > 4.1.1.20 > > DSS-G 2.2a was released yesterday, with 2 Spectrum Scale versions available : > > 5.0.2.1 > 4.2.3.11 > > Best Regards > > > Andre Posthuma > IT Specialist > HPC Services > Lenovo United Kingdom > +44 7841782363 > aposthuma at lenovo.com > > > Lenovo.com > Twitter | Facebook | Instagram | Blogs | Forums > > > > > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Carl Zetie > Sent: Thursday, November 15, 2018 2:01 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [External] Re: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) > > >> any idea when a newer GSS software release than 3.3a will be released? > > That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. > > > Carl Zetie > Program Director > Offering Management for Spectrum Scale, IBM > ---- > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From novosirj at rutgers.edu Thu Nov 15 18:33:12 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 15 Nov 2018 18:33:12 +0000 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu>, Message-ID: Thanks, Andy. I just realized our entitlement lapsed on GSS and that?s probably why I don?t see it there at the moment. Helpful to know what?s in there though for planning while that is worked out. Sent from my iPhone On Nov 15, 2018, at 13:29, Andy Kurth > wrote: Public information on GSS updates seems nonexistent. You can find some clues if you have access to Lenovo's restricted download site. It looks like gss3.3b was released in late September. There are gss3.3b download options that include either 4.2.3-9 or 4.1.1-20. Earlier this month they also released some GPFS-only updates for 4.3.2-11 and 5.0.1-2. It looks like these are meant to be applied on top of gss3.3b. For DSS-G, it looks like dss-g-2.2a is the latest full release with options that include 4.2.3-11 or 5.0.2-1. There are also separate DSS-G GPFS-only updates for 4.2.3-11 and 5.0.1-2. Regards, Andy Kurth / NCSU On Thu, Nov 15, 2018 at 12:01 AM Ryan Novosielski > wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Andy Kurth Research Storage Specialist NC State University Office of Information Technology P: 919-513-4090 311A Hillsborough Building Campus Box 7109 Raleigh, NC 27695 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Tue Nov 20 15:01:36 2018 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Tue, 20 Nov 2018 15:01:36 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS Message-ID: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> On one of our clusters, from time to time if users try to access files or folders via the direct full path over NFS, the NFS-client gets invalid information from the server. For instance, if I run "ls /gpfs/filesystem/test/test2/test3" over NFS-mount, result is just full of ???????? If I recurse through the path once, for instance by ls'ing or cd'ing through the folders one at a time or running ls -R, I can then access directly via the full path afterwards. This seem to be intermittent, and I haven't found how to reliably recreate the issue. Possibly, it can be connected to creating or changing files or folders via a GPFS mount, and then accessing them through NFS, but it doesn't happen consistently. Is this a known behaviour or bug, and does anyone know how to fix the issue? These NSD-servers currently run Scale 4.2.2.3, while the CES is on 5.0.1.1. GPFS clients run Scale 5.0.1.1, and NFS clients run CentOS 7.5. Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4575 bytes Desc: not available URL: From valdis.kletnieks at vt.edu Tue Nov 20 15:25:16 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 20 Nov 2018 10:25:16 -0500 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: <17828.1542727516@turing-police.cc.vt.edu> On Tue, 20 Nov 2018 15:01:36 +0000, Andreas Mattsson said: > On one of our clusters, from time to time if users try to access files or > folders via the direct full path over NFS, the NFS-client gets invalid > information from the server. > > For instance, if I run "ls /gpfs/filesystem/test/test2/test3" over > NFS-mount, result is just full of ???????? I've seen the Ganesha server do this sort of thing once in a while. Never tracked it down, because it was always in the middle of bigger misbehaviors... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Tue Nov 20 23:39:37 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Tue, 20 Nov 2018 23:39:37 +0000 Subject: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a still happy day seeing we're still above ground! ? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 mmfsd 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate. Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Wed Nov 21 13:01:54 2018 From: jjdoherty at yahoo.com (Jim Doherty) Date: Wed, 21 Nov 2018 13:01:54 +0000 (UTC) Subject: [gpfsug-discuss] mmfsd recording High CPU usage In-Reply-To: References: Message-ID: <1913697205.666954.1542805314669@mail.yahoo.com> At a guess with no data ....?? if the application is opening more files than can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to support the open files,? but it will also scan to try and free any unused objects.??? If you can identify the user job that is causing this? you could monitor a system more closely. Jim On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi wrote: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a?still happy day seeing we're still above ground!? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs ? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND 231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd ?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate.?Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted Science & TechnologyNORTH DAKOTA STATE UNIVERSITY? Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Nov 21 15:32:55 2018 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 21 Nov 2018 07:32:55 -0800 Subject: [gpfsug-discuss] mmfsd recording High CPU usage In-Reply-To: References: Message-ID: Hi, the best way to debug something like that is to start with top. start top then press 1 and check if any of the cores has almost 0% idle while others have plenty of CPU left. if that is the case you have one very hot thread. to further isolate it you can press 1 again to collapse the cores, now press shirt-h which will break down each thread of a process and show them as an individual line. now you either see one or many mmfsd's causing cpu consumption, if its many your workload is just doing a lot of work, what is more concerning is if you have just 1 thread running at the 90%+ . if thats the case write down the PID of the thread that runs so hot and run mmfsadm dump threads,kthreads >dum. you will see many entries in the file like : MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm 0x7FC9700008C0 highStackP 0x7FC783F7E530 pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21 ThPoolCommands per-thread gbls: 0:0x0 1:0x0 2:0x0 3:0x3 4:0xFFFFFFFFFFFFFFFF 5:0x0 6:0x0 7:0x7FC98C0067B0 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x400000E 14:0x7FC98C004C10 15:0x0 16:0x4 17:0x0 18:0x0 find the pid behind 'thread id' and post that section, that would be the first indication on what that thread does ... sven On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi < oluwasijibomi.saula at ndsu.edu> wrote: > Hello Scalers, > > > First, let me say Happy Thanksgiving to those of us in the US and to those > beyond, well, it's a still happy day seeing we're still above ground! ? > > > Now, what I have to discuss isn't anything extreme so don't skip the > turkey for this, but lately, on a few of our compute GPFS client nodes, > we've been noticing high CPU usage by the mmfsd process and are wondering > why. Here's a sample: > > > [~]# top -b -n 1 | grep mmfs > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 > *mmfs*d > > 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 run > *mmfs* > > Obviously, this behavior was likely triggered by a not-so-convenient user > job that in most cases is long finished by the time we > investigate. Nevertheless, does anyone have an idea why this might be > happening? Any thoughts on preventive steps even? > > > This is GPFS v4.2.3 on Redhat 7.4, btw... > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > > Research 2 > Building > ? Room 220B > Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu | www.ndsu.edu > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bzhang at ca.ibm.com Wed Nov 21 18:52:12 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Wed, 21 Nov 2018 13:52:12 -0500 Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: Hi all, This is a reminder for our next week's technical webinar. Everyone is welcome to register and attend. Thanks, Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. From: "Bohai Zhang" To: gpfsug main discussion list Date: 2018/11/09 11:37 AM Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We are going to host our next technical webinar. everyone is welcome to register and attend. Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16927775.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16361907.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16531853.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16209659.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16604524.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16509495.gif Type: image/gif Size: 1243 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Wed Nov 21 20:55:29 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Wed, 21 Nov 2018 20:55:29 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 82, Issue 31 In-Reply-To: References: Message-ID: Sven/Jim, Thanks for sharing your thoughts! - Currently, we have mFTC set as such: maxFilesToCache 4000 However, since we have a very diverse workload, we'd have to cycle through a vast majority of our apps to find the most fitting mFTC value as this page (https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wecm/l0wecm00_maxfilestocache.htm) suggests. In the meantime, I was able to gather some more info for the lone mmfsd thread (pid: 34096) running at high CPU utilization, and right away I can see the number of nonvoluntary_ctxt_switches is quite high, compared to the other threads in the dump; however, I think I need some help interpreting all of this. Although, I should add that heavy HPC workloads (i.e. vasp, ansys...) are running on these nodes and may be somewhat related to this issue: Scheduling info for kernel thread 34096 mmfsd (34096, #threads: 309) ------------------------------------------------------------------- se.exec_start : 8057632237.613486 se.vruntime : 4914854123.640008 se.sum_exec_runtime : 1042598557.420591 se.nr_migrations : 8337485 nr_switches : 15824325 nr_voluntary_switches : 4110 nr_involuntary_switches : 15820215 se.load.weight : 88761 policy : 0 prio : 100 clock-delta : 24 mm->numa_scan_seq : 88980 numa_migrations, 5216521 numa_faults_memory, 0, 0, 1, 1, 1 numa_faults_memory, 1, 0, 0, 1, 1030 numa_faults_memory, 0, 1, 0, 0, 1 numa_faults_memory, 1, 1, 0, 0, 1 Status for kernel thread 34096 Name: mmfsd Umask: 0022 State: R (running) Tgid: 58921 Ngid: 34395 Pid: 34096 PPid: 3941 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 256 Groups: VmPeak: 15137612 kB VmSize: 15126340 kB VmLck: 4194304 kB VmPin: 8388712 kB VmHWM: 4424228 kB VmRSS: 4420420 kB RssAnon: 4350128 kB RssFile: 50512 kB RssShmem: 19780 kB VmData: 14843812 kB VmStk: 132 kB VmExe: 23672 kB VmLib: 121856 kB VmPTE: 9652 kB VmSwap: 0 kB Threads: 309 SigQ: 5/257225 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000010017a07 SigIgn: 0000000000000000 SigCgt: 0000000180015eef CapInh: 0000000000000000 CapPrm: 0000001fffffffff CapEff: 0000001fffffffff CapBnd: 0000001fffffffff CapAmb: 0000000000000000 Seccomp: 0 Cpus_allowed: ffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-239 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 4110 nonvoluntary_ctxt_switches: 15820215 Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Wednesday, November 21, 2018 9:33:10 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 82, Issue 31 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: mmfsd recording High CPU usage (Jim Doherty) 2. Re: mmfsd recording High CPU usage (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Wed, 21 Nov 2018 13:01:54 +0000 (UTC) From: Jim Doherty To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: <1913697205.666954.1542805314669 at mail.yahoo.com> Content-Type: text/plain; charset="utf-8" At a guess with no data ....?? if the application is opening more files than can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to support the open files,? but it will also scan to try and free any unused objects.??? If you can identify the user job that is causing this? you could monitor a system more closely. Jim On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi wrote: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a?still happy day seeing we're still above ground!? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs ? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND 231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd ?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate.?Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted Science & TechnologyNORTH DAKOTA STATE UNIVERSITY? Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Wed, 21 Nov 2018 07:32:55 -0800 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: Content-Type: text/plain; charset="utf-8" Hi, the best way to debug something like that is to start with top. start top then press 1 and check if any of the cores has almost 0% idle while others have plenty of CPU left. if that is the case you have one very hot thread. to further isolate it you can press 1 again to collapse the cores, now press shirt-h which will break down each thread of a process and show them as an individual line. now you either see one or many mmfsd's causing cpu consumption, if its many your workload is just doing a lot of work, what is more concerning is if you have just 1 thread running at the 90%+ . if thats the case write down the PID of the thread that runs so hot and run mmfsadm dump threads,kthreads >dum. you will see many entries in the file like : MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm 0x7FC9700008C0 highStackP 0x7FC783F7E530 pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21 ThPoolCommands per-thread gbls: 0:0x0 1:0x0 2:0x0 3:0x3 4:0xFFFFFFFFFFFFFFFF 5:0x0 6:0x0 7:0x7FC98C0067B0 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x400000E 14:0x7FC98C004C10 15:0x0 16:0x4 17:0x0 18:0x0 find the pid behind 'thread id' and post that section, that would be the first indication on what that thread does ... sven On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi < oluwasijibomi.saula at ndsu.edu> wrote: > Hello Scalers, > > > First, let me say Happy Thanksgiving to those of us in the US and to those > beyond, well, it's a still happy day seeing we're still above ground! ? > > > Now, what I have to discuss isn't anything extreme so don't skip the > turkey for this, but lately, on a few of our compute GPFS client nodes, > we've been noticing high CPU usage by the mmfsd process and are wondering > why. Here's a sample: > > > [~]# top -b -n 1 | grep mmfs > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 > *mmfs*d > > 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 run > *mmfs* > > Obviously, this behavior was likely triggered by a not-so-convenient user > job that in most cases is long finished by the time we > investigate. Nevertheless, does anyone have an idea why this might be > happening? Any thoughts on preventive steps even? > > > This is GPFS v4.2.3 on Redhat 7.4, btw... > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > > Research 2 > Building > ? Room 220B > Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu | www.ndsu.edu > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 82, Issue 31 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnaineni at in.ibm.com Thu Nov 22 10:32:27 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Thu, 22 Nov 2018 10:32:27 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.png at 01D480EA.4FF3B020.png Type: image/png Size: 5610 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Fri Nov 23 08:12:25 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Fri, 23 Nov 2018 08:12:25 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Message-ID: Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= From andreas.mattsson at maxiv.lu.se Fri Nov 23 13:41:37 2018 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 23 Nov 2018 13:41:37 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: <9456645b0a1f4b488b13874ea672b9b8@maxiv.lu.se> Yes, this is repeating. We?ve ascertained that it has nothing to do at all with file operations on the GPFS side. Randomly throughout the filesystem mounted via NFS, ls or file access will give ? > ls: reading directory /gpfs/filessystem/test/testdir: Invalid argument ? Trying again later might work on that folder, but might fail somewhere else. We have tried exporting the same filesystem via a standard kernel NFS instead of the CES Ganesha-NFS, and then the problem doesn?t exist. So it is definitely related to the Ganesha NFS server, or its interaction with the file system. Will see if I can get a tcpdump of the issue. Regards, Andreas _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se Fr?n: gpfsug-discuss-bounces at spectrumscale.org F?r Malahal R Naineni Skickat: den 22 november 2018 11:32 Till: gpfsug-discuss at spectrumscale.org Kopia: gpfsug-discuss at spectrumscale.org ?mne: Re: [gpfsug-discuss] Filesystem access issues via CES NFS We have seen empty lists (ls showing nothing). If this repeats, please take tcpdump from the client and we will investigate. Regards, Malahal. ----- Original message ----- From: Andreas Mattsson Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [gpfsug-discuss] Filesystem access issues via CES NFS Date: Tue, Nov 20, 2018 8:47 PM On one of our clusters, from time to time if users try to access files or folders via the direct full path over NFS, the NFS-client gets invalid information from the server. For instance, if I run ?ls /gpfs/filesystem/test/test2/test3? over NFS-mount, result is just full of ???????? If I recurse through the path once, for instance by ls?ing or cd?ing through the folders one at a time or running ls ?R, I can then access directly via the full path afterwards. This seem to be intermittent, and I haven?t found how to reliably recreate the issue. Possibly, it can be connected to creating or changing files or folders via a GPFS mount, and then accessing them through NFS, but it doesn?t happen consistently. Is this a known behaviour or bug, and does anyone know how to fix the issue? These NSD-servers currently run Scale 4.2.2.3, while the CES is on 5.0.1.1. GPFS clients run Scale 5.0.1.1, and NFS clients run CentOS 7.5. Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4575 bytes Desc: not available URL: From jtolson at us.ibm.com Mon Nov 26 14:31:29 2018 From: jtolson at us.ibm.com (John T Olson) Date: Mon, 26 Nov 2018 07:31:29 -0700 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Mon Nov 26 14:55:06 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Mon, 26 Nov 2018 14:55:06 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: Hallo John, record is open, TS001631590. Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From alvise.dorigo at psi.ch Mon Nov 26 15:43:59 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 26 Nov 2018 15:43:59 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Message-ID: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Mon Nov 26 15:54:57 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Mon, 26 Nov 2018 15:54:57 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Try an dedicated extra ? -p ? foreach Attribute Von meinem iPhone gesendet > Am 26.11.2018 um 16:50 schrieb Dorigo Alvise (PSI) : > > Good evening, > I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm > to setup AFM parallel transfer. > > Why the following command (grabbed directly from the web page above) fires out that error ? > > [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome > mmcrfileset: Incorrect extra argument: ?p > Usage: > mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] > [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] > [--allow-permission-change PermissionChangeMode] > > The mapping was correctly created: > > [root at sf-export-3 ~]# mmafmconfig show > Map name: afmgw1 > Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch > > Is this a known bug ? > > Thanks, > Regards. > > Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Nov 26 16:33:58 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Mon, 26 Nov 2018 16:33:58 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Is that an 'ndash' rather than "-"? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of alvise.dorigo at psi.ch [alvise.dorigo at psi.ch] Sent: 26 November 2018 15:43 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise From kenneth.waegeman at ugent.be Mon Nov 26 16:26:51 2018 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Mon, 26 Nov 2018 17:26:51 +0100 Subject: [gpfsug-discuss] mmfsck output Message-ID: Hi all, We had some leftover files with IO errors on a GPFS FS, so we ran a mmfsck. Does someone know what these mmfsck errors mean: Error in inode 38422 snap 0: has nlink field as 1 Error in inode 281057 snap 0: is unreferenced ?Attach inode to lost+found of fileset root filesetId 0? no Thanks! Kenneth From daniel.kidger at uk.ibm.com Mon Nov 26 17:03:14 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Mon, 26 Nov 2018 17:03:14 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: References: , <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From abhisdav at in.ibm.com Tue Nov 27 06:38:27 2018 From: abhisdav at in.ibm.com (Abhishek Dave) Date: Tue, 27 Nov 2018 12:08:27 +0530 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Hi, Looks like some issue with syntax. Please try below one. mmcrfileset ?p afmmode=,afmtarget=://// --inode-space new #mmcrfileset gpfs1 sw1 ?p afmmode=sw,afmtarget=gpfs://mapping1/gpfs/gpfs2/swhome --inode-space new #mmcrfileset gpfs1 ro1 ?p afmmode=ro,afmtarget=gpfs://mapping2/gpfs/gpfs2/swhome --inode-space new Thanks, Abhishek, Dave From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/26/2018 09:20 PM Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Sent by: gpfsug-discuss-bounces at spectrumscale.org Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes [:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 15:24:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 15:24:25 +0000 Subject: [gpfsug-discuss] Hanging file-systems Message-ID: <06FF0D9C-9ED7-434E-A7FF-C56518048E25@bham.ac.uk> I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it. (I have a ticket with IBM open about this as well) We see for example: Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 and on that node: Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 8 (TokenCondvar), reason 'wait for SubToken to become stable' On this node, if you dump tscomm, you see entries like: Pending messages: msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1 this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) dest status pending , err 0, reply len 0 by TCP connection c0n9 is itself. This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster. Any pointers for next place to look/how to fix? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Nov 27 16:02:44 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 27 Nov 2018 16:02:44 +0000 Subject: [gpfsug-discuss] Hanging file-systems Message-ID: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> I have seen something like this in the past, and I have resorted to a cluster restart as well. :-( IBM and I could never really track it down, because I could not get a dump at the time of occurrence. However, you might take a look at your NSD servers, one at a time. As I recall, we thought it was a stuck thread on one of the NSD servers, and when we restarted the ?right? one it cleared the block. The other thing I?ve done in the past to isolate problems like this (since this is related to tokens) is to look at the ?token revokes? on each node, looking for ones that are sticking around for a long time. I tossed together a quick script and ran it via mmdsh on all the node. Not pretty, but it got the job done. Run this a few times, see if any of the revokes are sticking around for a long time #!/bin/sh rm -f /tmp/revokelist /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > /tmp/revokelist 2> /dev/null if [ $? -eq 0 ]; then /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out for n in `cat /tmp/revokelist | grep msgHdr | awk '{print $5}'`; do grep $n /tmp/tscomm.out | tail -1 done rm -f /tmp/tscomm.out fi Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Simon Thompson Reply-To: gpfsug main discussion list Date: Tuesday, November 27, 2018 at 9:27 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] [gpfsug-discuss] Hanging file-systems I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it. (I have a ticket with IBM open about this as well) We see for example: Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 and on that node: Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 8 (TokenCondvar), reason 'wait for SubToken to become stable' On this node, if you dump tscomm, you see entries like: Pending messages: msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1 this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) dest status pending , err 0, reply len 0 by TCP connection c0n9 is itself. This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster. Any pointers for next place to look/how to fix? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 16:14:20 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 08:14:20 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> Message-ID: if this happens you should check a couple of things : 1. are you under memory pressure or even worse started swapping . 2. is there any core running at ~ 0% idle - run top , press 1 and check the idle column. 3. is there any single thread running at ~100% - run top , press shift - h and check what the CPU % shows for the top 5 processes. if you want to go the extra mile, you could run perf top -p $PID_OF_MMFSD and check what the top cpu consumers are. confirming and providing data to any of the above being true could be the missing piece why nobody was able to find it, as this is stuff unfortunate nobody ever looks at. even a trace won't help if any of the above is true as all you see is that the system behaves correct according to the trace, its doesn't appear busy, Sven On Tue, Nov 27, 2018 at 8:03 AM Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: > I have seen something like this in the past, and I have resorted to a > cluster restart as well. :-( IBM and I could never really track it down, > because I could not get a dump at the time of occurrence. However, you > might take a look at your NSD servers, one at a time. As I recall, we > thought it was a stuck thread on one of the NSD servers, and when we > restarted the ?right? one it cleared the block. > > > > The other thing I?ve done in the past to isolate problems like this (since > this is related to tokens) is to look at the ?token revokes? on each node, > looking for ones that are sticking around for a long time. I tossed > together a quick script and ran it via mmdsh on all the node. Not pretty, > but it got the job done. Run this a few times, see if any of the revokes > are sticking around for a long time > > > > #!/bin/sh > > rm -f /tmp/revokelist > > /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > > /tmp/revokelist 2> /dev/null > > if [ $? -eq 0 ]; then > > /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out > > for n in `cat /tmp/revokelist | grep msgHdr | awk '{print $5}'`; do > > grep $n /tmp/tscomm.out | tail -1 > > done > > rm -f /tmp/tscomm.out > > fi > > > > > > Bob Oesterlin > > Sr Principal Storage Engineer, Nuance > > > > > > > > *From: * on behalf of Simon > Thompson > *Reply-To: *gpfsug main discussion list > *Date: *Tuesday, November 27, 2018 at 9:27 AM > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[EXTERNAL] [gpfsug-discuss] Hanging file-systems > > > > I have a file-system which keeps hanging over the past few weeks. Right > now, its offline and taken a bunch of services out with it. > > > > (I have a ticket with IBM open about this as well) > > > > We see for example: > > Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 > SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 > (MsgRecordCondvar), re > > ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 > > > > and on that node: > > Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 > SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 > > 8 (TokenCondvar), reason 'wait for SubToken to become stable' > > > > On this node, if you dump tscomm, you see entries like: > > Pending messages: > > msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, > n_pending 1 > > this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec > > sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) > > dest status pending , err 0, reply len 0 by TCP > connection > > > > c0n9 is itself. > > > > This morning when this happened, the only way to get the FS back online > was to shutdown the entire cluster. > > > > Any pointers for next place to look/how to fix? > > > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 17:53:58 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 17:53:58 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> Message-ID: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Nov 27 17:54:03 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 27 Nov 2018 23:24:03 +0530 Subject: [gpfsug-discuss] mmfsck output In-Reply-To: References: Message-ID: This means that the files having the below inode numbers 38422 and 281057 are orphan files (i.e. files not referenced by any directory/folder) and they will be moved to the lost+found folder of the fileset owning these files by mmfsck repair. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Kenneth Waegeman To: gpfsug main discussion list Date: 11/26/2018 10:10 PM Subject: [gpfsug-discuss] mmfsck output Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We had some leftover files with IO errors on a GPFS FS, so we ran a mmfsck. Does someone know what these mmfsck errors mean: Error in inode 38422 snap 0: has nlink field as 1 Error in inode 281057 snap 0: is unreferenced Attach inode to lost+found of fileset root filesetId 0? no Thanks! Kenneth _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=-J2C2ZYYUsp42fIyYHg3aYSR8wC5SKNhl6ZztfRJMvI&s=4OPQpDp8v56fvska0-O-pskIfONFMnZFydDo0T6KwJM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 18:19:04 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 10:19:04 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson wrote: > Thanks Sven ? > > > > We found a node with kswapd running 100% (and swap was off)? > > > > Killing that node made access to the FS spring into life. > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 16:14 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Tue Nov 27 18:06:57 2018 From: skylar2 at uw.edu (Skylar Thompson) Date: Tue, 27 Nov 2018 18:06:57 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From Dwayne.Hart at med.mun.ca Tue Nov 27 19:25:08 2018 From: Dwayne.Hart at med.mun.ca (Dwayne.Hart at med.mun.ca) Date: Tue, 27 Nov 2018 19:25:08 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> , <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: Hi Simon, Was there a reason behind swap being disabled? Best, Dwayne ? Dwayne Hart | Systems Administrator IV CHIA, Faculty of Medicine Memorial University of Newfoundland 300 Prince Philip Drive St. John?s, Newfoundland | A1B 3V6 Craig L Dobbin Building | 4M409 T 709 864 6631 On Nov 27, 2018, at 2:24 PM, Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From TOMP at il.ibm.com Tue Nov 27 19:35:36 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Tue, 27 Nov 2018 21:35:36 +0200 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com><4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Message-ID: "paging to disk" sometimes means mmap as well - there were several issues around that recently as well. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Skylar Thompson To: gpfsug-discuss at spectrumscale.org Date: 27/11/2018 20:28 Subject: Re: [gpfsug-discuss] Hanging file-systems Sent by: gpfsug-discuss-bounces at spectrumscale.org Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 20:02:14 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 20:02:14 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Message-ID: Yes, but we?d upgraded all out HPC client nodes to 5.0.2-1 last week as well when this first happened ? Unless it?s necessary to upgrade the NSD servers as well for this? Simon From: on behalf of "TOMP at il.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 19:48 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems "paging to disk" sometimes means mmap as well - there were several issues around that recently as well. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Skylar Thompson To: gpfsug-discuss at spectrumscale.org Date: 27/11/2018 20:28 Subject: Re: [gpfsug-discuss] Hanging file-systems Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 20:09:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 20:09:25 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 I think). So is this memory pressure on the NSD nodes then? I thought it was documented somewhere that GFPS won?t use more than 50% of the host memory. And actually if you look at the values for maxStatCache and maxFilesToCache, the memory footprint is quite small. Sure on these NSD servers we had a pretty big pagepool (which we?ve dropped by some), but there still should have been quite a lot of memory space on the nodes ? If only someone as going to do a talk in December at the CIUK SSUG on memory usage ? Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 18:19 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 20:43:04 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 12:43:04 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: was the node you rebooted a client or a server that was running kswapd at 100% ? sven On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson wrote: > The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 > I think). > > > > So is this memory pressure on the NSD nodes then? I thought it was > documented somewhere that GFPS won?t use more than 50% of the host memory. > > > > And actually if you look at the values for maxStatCache and > maxFilesToCache, the memory footprint is quite small. > > > > Sure on these NSD servers we had a pretty big pagepool (which we?ve > dropped by some), but there still should have been quite a lot of memory > space on the nodes ? > > > > If only someone as going to do a talk in December at the CIUK SSUG on > memory usage ? > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 18:19 > > > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > Hi, > > > > now i need to swap back in a lot of information about GPFS i tried to swap > out :-) > > > > i bet kswapd is not doing anything you think the name suggest here, which > is handling swap space. i claim the kswapd thread is trying to throw > dentries out of the cache and what it tries to actually get rid of are > entries of directories very high up in the tree which GPFS still has a > refcount on so it can't free it. when it does this there is a single thread > (unfortunate was never implemented with multiple threads) walking down the > tree to find some entries to steal, it it can't find any it goes to the > next , next , etc and on a bus system it can take forever to free anything > up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which > i pushed for the weeks before i left IBM. you never see this in a trace > with default traces which is why nobody would have ever suspected this, you > need to set special trace levels to even see this. > > i don't know the exact version the changes went into, but somewhere in the > 5.0.1.X timeframe. the change was separating the cache list to prefer > stealing files before directories, also keep a minimum percentages of > directories in the cache (10 % by default) before it would ever try to get > rid of a directory. it also tries to keep a list of free entries all the > time (means pro active cleaning them) and also allows to go over the hard > limit compared to just block as in previous versions. so i assume you run a > version prior to 5.0.1.x and what you see is kspwapd desperately get rid of > entries, but can't find one its already at the limit so it blocks and > doesn't allow a new entry to be created or promoted from the statcache . > > > > again all this is without source code access and speculation on my part > based on experience :-) > > > > what version are you running and also share mmdiag --stats of that node > > > > sven > > > > > > > > > > > > > > On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: > > Thanks Sven ? > > > > We found a node with kswapd running 100% (and swap was off)? > > > > Killing that node made access to the FS spring into life. > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 16:14 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > 1. are you under memory pressure or even worse started swapping . > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 20:44:26 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 12:44:26 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: and i already talk about NUMA stuff at the CIUK usergroup meeting, i won't volunteer for a 2nd advanced topic :-D On Tue, Nov 27, 2018 at 12:43 PM Sven Oehme wrote: > was the node you rebooted a client or a server that was running kswapd at > 100% ? > > sven > > > On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson > wrote: > >> The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 >> I think). >> >> >> >> So is this memory pressure on the NSD nodes then? I thought it was >> documented somewhere that GFPS won?t use more than 50% of the host memory. >> >> >> >> And actually if you look at the values for maxStatCache and >> maxFilesToCache, the memory footprint is quite small. >> >> >> >> Sure on these NSD servers we had a pretty big pagepool (which we?ve >> dropped by some), but there still should have been quite a lot of memory >> space on the nodes ? >> >> >> >> If only someone as going to do a talk in December at the CIUK SSUG on >> memory usage ? >> >> >> >> Simon >> >> >> >> *From: * on behalf of " >> oehmes at gmail.com" >> *Reply-To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Date: *Tuesday, 27 November 2018 at 18:19 >> >> >> *To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Subject: *Re: [gpfsug-discuss] Hanging file-systems >> >> >> >> Hi, >> >> >> >> now i need to swap back in a lot of information about GPFS i tried to >> swap out :-) >> >> >> >> i bet kswapd is not doing anything you think the name suggest here, which >> is handling swap space. i claim the kswapd thread is trying to throw >> dentries out of the cache and what it tries to actually get rid of are >> entries of directories very high up in the tree which GPFS still has a >> refcount on so it can't free it. when it does this there is a single thread >> (unfortunate was never implemented with multiple threads) walking down the >> tree to find some entries to steal, it it can't find any it goes to the >> next , next , etc and on a bus system it can take forever to free anything >> up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which >> i pushed for the weeks before i left IBM. you never see this in a trace >> with default traces which is why nobody would have ever suspected this, you >> need to set special trace levels to even see this. >> >> i don't know the exact version the changes went into, but somewhere in >> the 5.0.1.X timeframe. the change was separating the cache list to prefer >> stealing files before directories, also keep a minimum percentages of >> directories in the cache (10 % by default) before it would ever try to get >> rid of a directory. it also tries to keep a list of free entries all the >> time (means pro active cleaning them) and also allows to go over the hard >> limit compared to just block as in previous versions. so i assume you run a >> version prior to 5.0.1.x and what you see is kspwapd desperately get rid of >> entries, but can't find one its already at the limit so it blocks and >> doesn't allow a new entry to be created or promoted from the statcache . >> >> >> >> again all this is without source code access and speculation on my part >> based on experience :-) >> >> >> >> what version are you running and also share mmdiag --stats of that node >> >> >> >> sven >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson >> wrote: >> >> Thanks Sven ? >> >> >> >> We found a node with kswapd running 100% (and swap was off)? >> >> >> >> Killing that node made access to the FS spring into life. >> >> >> >> Simon >> >> >> >> *From: * on behalf of " >> oehmes at gmail.com" >> *Reply-To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Date: *Tuesday, 27 November 2018 at 16:14 >> *To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Subject: *Re: [gpfsug-discuss] Hanging file-systems >> >> >> >> 1. are you under memory pressure or even worse started swapping . >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From constance.rice at us.ibm.com Tue Nov 27 20:28:14 2018 From: constance.rice at us.ibm.com (Constance M Rice) Date: Tue, 27 Nov 2018 20:28:14 +0000 Subject: [gpfsug-discuss] Introduction Message-ID: Hello, I am a new member here. I work for IBM in the Washington System Center supporting Spectrum Scale and ESS across North America. I live in Leesburg, Virginia, USA northwest of Washington, DC. Connie Rice Storage Specialist Washington Systems Center Mobile: 202-821-6747 E-mail: constance.rice at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 56935 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 21:01:07 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 21:01:07 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: <66C52F6F-5193-4DD7-B87E-C88E9ADBB53D@bham.ac.uk> It was an NSD server ? we?d already shutdown all the clients in the remote clusters! And Tomer has already agreed to do a talk on memory ? (but I?m still looking for a user talk if anyone is interested!) Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 20:44 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems and i already talk about NUMA stuff at the CIUK usergroup meeting, i won't volunteer for a 2nd advanced topic :-D On Tue, Nov 27, 2018 at 12:43 PM Sven Oehme > wrote: was the node you rebooted a client or a server that was running kswapd at 100% ? sven On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson > wrote: The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 I think). So is this memory pressure on the NSD nodes then? I thought it was documented somewhere that GFPS won?t use more than 50% of the host memory. And actually if you look at the values for maxStatCache and maxFilesToCache, the memory footprint is quite small. Sure on these NSD servers we had a pretty big pagepool (which we?ve dropped by some), but there still should have been quite a lot of memory space on the nodes ? If only someone as going to do a talk in December at the CIUK SSUG on memory usage ? Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 18:19 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Thu Nov 29 07:29:36 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 29 Nov 2018 07:29:36 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From TOMP at il.ibm.com Thu Nov 29 07:45:00 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 29 Nov 2018 09:45:00 +0200 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Message-ID: Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" To: 'gpfsug main discussion list' Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Thu Nov 29 08:03:34 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 29 Nov 2018 08:03:34 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Message-ID: <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Hallo Tomer, thanks for this Info, but can you explain in witch release all these points fixed now? Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Tomer Perry Gesendet: Donnerstag, 29. November 2018 08:45 An: gpfsug main discussion list ; Olaf Weiser Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" > To: 'gpfsug main discussion list' > Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list > Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From olaf.weiser at de.ibm.com Thu Nov 29 08:39:01 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 29 Nov 2018 09:39:01 +0100 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From MDIETZ at de.ibm.com Thu Nov 29 10:45:25 2018 From: MDIETZ at de.ibm.com (Mathias Dietz) Date: Thu, 29 Nov 2018 11:45:25 +0100 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de><44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Message-ID: Hi Renar, the tsctl problem is described in APAR IV93896 https://www-01.ibm.com/support/docview.wss?uid=isg1IV93896 You can easily find out if your system has the problem: Run "tsctl shownodes up" and check if the hostnames are valid, if the hostnames are wrong/mixed up then you are affected. This APAR has been fixed with 5.0.2 Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale Development - Release Lead Architect (4.2.x) Spectrum Scale RAS Architect --------------------------------------------------------------------------- IBM Deutschland Am Weiher 24 65451 Kelsterbach Phone: +49 70342744105 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ----------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk WittkoppSitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Olaf Weiser" To: "Grunenberg, Renar" Cc: gpfsug main discussion list Date: 29/11/2018 09:39 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org HI Tomer, send my work around wrapper to Renar.. I've seen to less data to be sure, that's the same (tsctl shownodes ...) issue but he'll try and let us know .. From: "Grunenberg, Renar" To: gpfsug main discussion list , "Olaf Weiser" Date: 11/29/2018 09:04 AM Subject: AW: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hallo Tomer, thanks for this Info, but can you explain in witch release all these points fixed now? Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Tomer Perry Gesendet: Donnerstag, 29. November 2018 08:45 An: gpfsug main discussion list ; Olaf Weiser Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" To: 'gpfsug main discussion list' Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org[ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From spectrumscale at kiranghag.com Thu Nov 29 15:42:48 2018 From: spectrumscale at kiranghag.com (KG) Date: Thu, 29 Nov 2018 21:12:48 +0530 Subject: [gpfsug-discuss] high cpu usage by mmfsadm Message-ID: One of our scale node shows 30-50% CPU utilisation by mmfsadm while filesystem is being accessed. Is this normal? (The node is configured as server node but not a manager node for any filesystem or NSD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Nov 29 17:57:00 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 29 Nov 2018 17:57:00 +0000 Subject: [gpfsug-discuss] high cpu usage by mmfsadm In-Reply-To: References: Message-ID: <671bbd4db92d496abbbceead1b9a7d5c@jumptrading.com> I wouldn?t call that normal? probably take a gpfs.snap and open a PMR to get the quickest answer from IBM support, -B From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of KG Sent: Thursday, November 29, 2018 9:43 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] high cpu usage by mmfsadm [EXTERNAL EMAIL] One of our scale node shows 30-50% CPU utilisation by mmfsadm while filesystem is being accessed. Is this normal? (The node is configured as server node but not a manager node for any filesystem or NSD) ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: From TOMP at il.ibm.com Thu Nov 1 07:37:03 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 1 Nov 2018 09:37:03 +0200 Subject: [gpfsug-discuss] V5 client limit? In-Reply-To: References: Message-ID: Kristy, If you mean the maximum number of nodes that can mount a filesystem ( which implies on the number of nodes on related clusters) then the number haven't changed since 3.4.0.13 - and its still 16384. Just to clarify, this is the theoretical limit - I don't think anyone tried more then 14-15k nodes. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Kristy Kallback-Rose To: gpfsug main discussion list Date: 31/10/2018 23:08 Subject: [gpfsug-discuss] V5 client limit? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Can someone tell me the max # of GPFS native clients under 5.x? Everything I can find is dated. Thanks Kristy _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Thu Nov 1 18:31:41 2018 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Thu, 1 Nov 2018 11:31:41 -0700 Subject: [gpfsug-discuss] V5 client limit? In-Reply-To: References: Message-ID: <58DAFCE0-DECF-4612-8704-81C025069584@lbl.gov> Yes, OK. I was wondering if there was an updated number with v5. That answers it. Thank you, Kristy > On Nov 1, 2018, at 12:37 AM, Tomer Perry wrote: > > Kristy, > > If you mean the maximum number of nodes that can mount a filesystem ( which implies on the number of nodes on related clusters) then the number haven't changed since 3.4.0.13 - and its still 16384. > Just to clarify, this is the theoretical limit - I don't think anyone tried more then 14-15k nodes. > > > Regards, > > Tomer Perry > Scalable I/O Development (Spectrum Scale) > email: tomp at il.ibm.com > 1 Azrieli Center, Tel Aviv 67021, Israel > Global Tel: +1 720 3422758 > Israel Tel: +972 3 9188625 > Mobile: +972 52 2554625 > > > > > From: Kristy Kallback-Rose > To: gpfsug main discussion list > Date: 31/10/2018 23:08 > Subject: [gpfsug-discuss] V5 client limit? > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > Can someone tell me the max # of GPFS native clients under 5.x? Everything I can find is dated. > > Thanks > Kristy > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Nov 1 18:45:35 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 1 Nov 2018 18:45:35 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas Message-ID: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Nov 1 22:40:21 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 1 Nov 2018 22:40:21 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> References: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> Message-ID: <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> I?m not going this year, or local to Dallas, but I do travel and have a lot of experience traveling from airports to city centers. If I were going, I?d take the DART Orange Line. Looks like a 52 minute ride ? where you get off probably depends on your hotel, but I put in the convention center here: https://www.google.com/maps/dir/Kay+Bailey+Hutchison+Convention+Center+Dallas,+South+Griffin+Street,+Dallas,+TX/DFW+Terminal+A,+2040+S+International+Pkwy,+Irving,+TX+75063/@32.9109009,-97.0712812,13z/am=t/data=!4m14!4m13!1m5!1m1!1s0x864e991a403efaa9:0xae0261a23eab57d2!2m2!1d-96.8002849!2d32.7743895!1m5!1m1!1s0x864c2a4300afd38d:0x3e0ecb50c933781d!2m2!1d-97.0357045!2d32.9048736!3e3 I don?t personally do business with UBER or Lyft ? I feel like the ?gig economy? is just another way people are getting ripped off and don?t want to be a part of it. > On Nov 1, 2018, at 2:45 PM, Oesterlin, Robert wrote: > > Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' From babbott at oarc.rutgers.edu Fri Nov 2 03:30:31 2018 From: babbott at oarc.rutgers.edu (Bill Abbott) Date: Fri, 2 Nov 2018 03:30:31 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> References: <40727979-984E-41C4-A12C-A962DA433D1F@nuance.com> <3FE0DD08-C334-4243-B169-9DEBBEBE71EC@rutgers.edu> Message-ID: <5BDBC4D8.1080003@oarc.rutgers.edu> SuperShuttle is $40-50 round trip, quick, reliable and in pretty much every city. Bill On 11/1/18 6:40 PM, Ryan Novosielski wrote: > I?m not going this year, or local to Dallas, but I do travel and have a lot of experience traveling from airports to city centers. If I were going, I?d take the DART Orange Line. Looks like a 52 minute ride ? where you get off probably depends on your hotel, but I put in the convention center here: > > https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.google.com%2Fmaps%2Fdir%2FKay%2BBailey%2BHutchison%2BConvention%2BCenter%2BDallas%2C%2BSouth%2BGriffin%2BStreet%2C%2BDallas%2C%2BTX%2FDFW%2BTerminal%2BA%2C%2B2040%2BS%2BInternational%2BPkwy%2C%2BIrving%2C%2BTX%2B75063%2F%4032.9109009%2C-97.0712812%2C13z%2Fam%3Dt%2Fdata%3D!4m14!4m13!1m5!1m1!1s0x864e991a403efaa9%3A0xae0261a23eab57d2!2m2!1d-96.8002849!2d32.7743895!1m5!1m1!1s0x864c2a4300afd38d%3A0x3e0ecb50c933781d!2m2!1d-97.0357045!2d32.9048736!3e3&data=02%7C01%7Cbabbott%40rutgers.edu%7Ce04f2c06af1440bdd05e08d6407132df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636767252267150866&sdata=xvKo%2BtQo8sxoDqeOp2ZAtaIedHNu87r4QTuOIMXWoKA%3D&reserved=0 > > I don?t personally do business with UBER or Lyft ? I feel like the ?gig economy? is just another way people are getting ripped off and don?t want to be a part of it. > >> On Nov 1, 2018, at 2:45 PM, Oesterlin, Robert wrote: >> >> Anyone in the Dallas area/familiar that can suggest the best option here: Van shuttle, Train, Uber/Lyft? >> >> Bob Oesterlin >> Sr Principal Storage Engineer, Nuance > > -- > ____ > || \\UTGERS, |---------------------------*O*--------------------------- > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark > `' > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7Cbabbott%40rutgers.edu%7Ce04f2c06af1440bdd05e08d6407132df%7Cb92d2b234d35447093ff69aca6632ffe%7C1%7C0%7C636767252267150866&sdata=CsnXrY0YwZAdQbuJ43GgH9P%2BEKQcWFm6xkg7jX5ySmE%3D&reserved=0 From chris.schlipalius at pawsey.org.au Fri Nov 2 09:37:44 2018 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Fri, 2 Nov 2018 17:37:44 +0800 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas Message-ID: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> Hi all, so I?ve used Super Shuttle booked online for both New Orleans SC round trip and Austin SC just to the hotel, travelling solo and a Sheraton hotel shuttle back to the airport (as a solo travel option, Super is a good price). In Austin for SC my boss actually took the bus to his hotel! For SC18 my colleagues and I will prob pre-book a van transfer as there?s a few of us. Some of the Aussie IBM staff are hiring a car to get to their hotel, so if theres a few who can share, that?s also a good share option if you can park or drop the rental car at or near your hotel. Regards, Chris > On 2 Nov 2018, at 4:02 pm, gpfsug-discuss-request at spectrumscale.org wrote: > > Re: Slightly OT: Getting from DFW to SC17 hotels in Dallas From mark.fellows at stfc.ac.uk Fri Nov 2 11:45:58 2018 From: mark.fellows at stfc.ac.uk (Mark Fellows - UKRI STFC) Date: Fri, 2 Nov 2018 11:45:58 +0000 Subject: [gpfsug-discuss] Hello Message-ID: Hi all, Just introducing myself as a new subscriber to the mailing list. I work at the Hartree Centre within the Science and Technology Facilities Council near Warrington, UK. Our role is to work with industry to promote the use of high performance technologies and data analytics to solve problems and deliver gains in productivity. We also support academic researchers in UK based and international science. We have Spectrum Scale installations on linux (x86 for data storage/Power for HPC clusters) and I've recently been involved with deploying and upgrading some small ESS systems. As a relatively new user of SS I may initially have more questions than answers but hope to be able to exchange some thoughts and ideas within the group. Best regards, Mark Mark Fellows HPC Systems Administrator Platforms and Infrastructure Group Telephone - 01925 603413 | Email - mark.fellows at stfc.ac.uk Hartree Centre, Science & Technology Facilities Council Daresbury Laboratory, Keckwick Lane, Daresbury, Warrington, WA4 4AD, UK -------------- next part -------------- An HTML attachment was scrubbed... URL: From damir.krstic at gmail.com Fri Nov 2 15:55:27 2018 From: damir.krstic at gmail.com (Damir Krstic) Date: Fri, 2 Nov 2018 10:55:27 -0500 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> Message-ID: Hi, Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: -V 15.01 (4.2.0.0) File system version I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. Damir On Tue, May 22, 2018 at 10:43 AM wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month > ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That > is we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. > All the ?ps hangs? happen when jobs run related to a Java process or many > Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we > know that it is related to just Java or some jobs reading from directories > with about 600000 files. > > I have raised an IBM critical service request about a month ago related to > this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not > be related to GPFS. > Although, we are sure that these hangs started to appear only after we > upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is > because, we are unable to capture any logs/traces from GPFS once the hang > happens. > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting > difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a ?ps > hang" issue and we have to run mmchconfig release=LATEST command, and that > will resolve the issue. > However we are not comfortable making the permanent change to Filesystem > version 5. and since we don?t see any near solution to these hangs - we are > thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know > the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config > state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i > reinstall all rpms to a previous version? or is there anything else that i > need to make sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS > configuration parameters , and i am not sure if downgrading GPFS will > change them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or > is it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From luke.raimbach at googlemail.com Fri Nov 2 16:24:07 2018 From: luke.raimbach at googlemail.com (Luke Raimbach) Date: Fri, 2 Nov 2018 16:24:07 +0000 Subject: [gpfsug-discuss] RFE: Inode Expansion Message-ID: Dear Spectrum Scale Experts, I would really like to have a callback made available for the file system manager executing an Inode Expansion event. You know, with all the nice variables output, etc. Kind Regards, Luke Raimbach -------------- next part -------------- An HTML attachment was scrubbed... URL: From sveta at cbio.mskcc.org Fri Nov 2 16:09:35 2018 From: sveta at cbio.mskcc.org (Mazurkova, Svetlana/Information Systems) Date: Fri, 2 Nov 2018 12:09:35 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> Message-ID: <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Hi Damir, It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. Regards, Sveta. > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > Hi, > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > -V 15.01 (4.2.0.0) File system version > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > Damir > > On Tue, May 22, 2018 at 10:43 AM > wrote: > Hello All, > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > I have raised an IBM critical service request about a month ago related to this - PMR: 24090,L6Q,000. > However, According to the ticket - they seemed to feel that it might not be related to GPFS. > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > Also - According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run mmchconfig release=LATEST command, and that will resolve the issue. > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > Our previous state: > > 2 Storage clusters - 4.2.3.2 > 1 Compute cluster - 4.2.3.2 ( remote mounts the above 2 storage clusters ) > > Our current state: > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > 1 Compute cluster - 5.0.0.2 > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > Any advice on the best steps forward, would greatly help. > > Thanks, > > Lohit > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Nov 2 16:29:19 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 2 Nov 2018 12:29:19 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Message-ID: Yes, We have upgraded to 5.0.1-0.5, which has the patch for the issue. The related IBM case number was :?TS001010674 Regards, Lohit On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > Hi Damir, > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. > > Regards, > > Sveta. > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > Hi, > > > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > > ?-V ? ? ? ? ? ? ? ? 15.01 (4.2.0.0) ? ? ? ? ?File system version > > > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > > > Damir > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > Hello All, > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > > > > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > > > > > > > I have raised an IBM critical service request about a month ago related to this -?PMR: 24090,L6Q,000. > > > > However, According to the ticket ?- they seemed to feel that it might not be related to GPFS. > > > > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > > > > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > > > > > > > Also ?- According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run ?mmchconfig release=LATEST command, and that will resolve the issue. > > > > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > > > > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > Our previous state: > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > 1 Compute cluster - 4.2.3.2 ?( remote mounts the above 2 storage clusters ) > > > > > > > > Our current state: > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > Thanks, > > > > > > > > Lohit > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From valleru at cbio.mskcc.org Fri Nov 2 16:31:12 2018 From: valleru at cbio.mskcc.org (valleru at cbio.mskcc.org) Date: Fri, 2 Nov 2018 12:31:12 -0400 Subject: [gpfsug-discuss] Critical Hang issues with GPFS 5.0. Downgrading from GPFS 5.0.0-2 to GPFS 4.2.3.2 In-Reply-To: References: <7eb36288-4a26-4322-8161-6a2c3fbdec41@Spark> <8BD6C3CE-8B9C-4B8C-8074-4F7F0E92C5FC@cbio.mskcc.org> Message-ID: <5469f6aa-3f82-47b2-8b82-a599edfa2f16@Spark> Also - You could just upgrade one of the clients to this version, and test to see if the hang still occurs. You do not have to upgrade the NSD servers, to test. Regards, Lohit On Nov 2, 2018, 12:29 PM -0400, valleru at cbio.mskcc.org, wrote: > Yes, > > We have upgraded to 5.0.1-0.5, which has the patch for the issue. > The related IBM case number was :?TS001010674 > > Regards, > Lohit > > On Nov 2, 2018, 12:27 PM -0400, Mazurkova, Svetlana/Information Systems , wrote: > > Hi Damir, > > > > It was related to specific user jobs and mmap (?). We opened PMR with IBM and have patch from IBM, since than we don?t see issue. > > > > Regards, > > > > Sveta. > > > > > On Nov 2, 2018, at 11:55 AM, Damir Krstic wrote: > > > > > > Hi, > > > > > > Did you ever figure out the root cause of the issue? We have recently (end of the June) upgraded our storage to: gpfs.base-5.0.0-1.1.3.ppc64 > > > > > > In the last few weeks we have seen an increasing number of ps hangs across compute and login nodes on our cluster. The filesystem version (of all filesystems on our cluster) is: > > > ?-V ? ? ? ? ? ? ? ? 15.01 (4.2.0.0) ? ? ? ? ?File system version > > > > > > I am just wondering if anyone has seen this type of issue since you first reported it and if there is a known fix for it. > > > > > > Damir > > > > > > > On Tue, May 22, 2018 at 10:43 AM wrote: > > > > > Hello All, > > > > > > > > > > We have recently upgraded from GPFS 4.2.3.2 to GPFS 5.0.0-2 about a month ago. We have not yet converted the 4.2.2.2 filesystem version to 5. ( That is we have not run the mmchconfig release=LATEST command) > > > > > Right after the upgrade, we are seeing many ?ps hangs" across the cluster. All the ?ps hangs? happen when jobs run related to a Java process or many Java threads (example: GATK ) > > > > > The hangs are pretty random, and have no particular pattern except that we know that it is related to just Java or some jobs reading from directories with about 600000 files. > > > > > > > > > > I have raised an IBM critical service request about a month ago related to this -?PMR: 24090,L6Q,000. > > > > > However, According to the ticket ?- they seemed to feel that it might not be related to GPFS. > > > > > Although, we are sure that these hangs started to appear only after we upgraded GPFS to GPFS 5.0.0.2 from 4.2.3.2. > > > > > > > > > > One of the other reasons we are not able to prove that it is GPFS is because, we are unable to capture any logs/traces from GPFS once the hang happens. > > > > > Even GPFS trace commands hang, once ?ps hangs? and thus it is getting difficult to get any dumps from GPFS. > > > > > > > > > > Also ?- According to the IBM ticket, they seemed to have a seen a ?ps hang" issue and we have to run ?mmchconfig release=LATEST command, and that will resolve the issue. > > > > > However we are not comfortable making the permanent change to Filesystem version 5. and since we don?t see any near solution to these hangs - we are thinking of downgrading to GPFS 4.2.3.2 or the previous state that we know the cluster was stable. > > > > > > > > > > Can downgrading GPFS take us back to exactly the previous GPFS config state? > > > > > With respect to downgrading from 5 to 4.2.3.2 -> is it just that i reinstall all rpms to a previous version? or is there anything else that i need to make sure with respect to GPFS configuration? > > > > > Because i think that GPFS 5.0 might have updated internal default GPFS configuration parameters , and i am not sure if downgrading GPFS will change them back to what they were in GPFS 4.2.3.2 > > > > > > > > > > Our previous state: > > > > > > > > > > 2 Storage clusters - 4.2.3.2 > > > > > 1 Compute cluster - 4.2.3.2 ?( remote mounts the above 2 storage clusters ) > > > > > > > > > > Our current state: > > > > > > > > > > 2 Storage clusters - 5.0.0.2 ( filesystem version - 4.2.2.2) > > > > > 1 Compute cluster - 5.0.0.2 > > > > > > > > > > Do i need to downgrade all the clusters to go to the previous state ? or is it ok if we just downgrade the compute cluster to previous version? > > > > > > > > > > Any advice on the best steps forward, would greatly help. > > > > > > > > > > Thanks, > > > > > > > > > > Lohit > > > > > _______________________________________________ > > > > > gpfsug-discuss mailing list > > > > > gpfsug-discuss at spectrumscale.org > > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Sat Nov 3 20:21:50 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Sat, 3 Nov 2018 20:21:50 +0000 Subject: [gpfsug-discuss] Slightly OT: Getting from DFW to SC17 hotels in Dallas In-Reply-To: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> References: <45EB6C38-DEB4-4884-AC9A-5D8D2CBD93E6@pawsey.org.au> Message-ID: <09CC2A72-2D2C-4722-87CB-A4B1093D90BC@rutgers.edu> I took the bus back to the airport in Austin (the Airport Flyer). Was a good experience. If Austin is the city I?m thinking of, I took SuperShuttle to the hotel (I believe because I arrived late at night) and was the fourth hotel that got dropped off, which roughly doubled the trip time. There is that risk with the shared-ride shuttles. In recent years, the only location without a solid public transit option was New Orleans (I used it anyway). They have an express airport bus, but the hours and frequency are not ideal (there?s a local bus as well, which is quite a bit slower). SLC had good light rail service, Denver has good rail service from the airport to downtown, and Atlanta has good subway service (all of these I?ve used before). Typically the transit option is less than $10 round-trip (Denver?s is above-average at $9 each way), sometimes even less than $5. > On Nov 2, 2018, at 5:37 AM, Chris Schlipalius wrote: > > Hi all, so I?ve used Super Shuttle booked online for both New Orleans SC round trip and Austin SC just to the hotel, travelling solo and a Sheraton hotel shuttle back to the airport (as a solo travel option, Super is a good price). > In Austin for SC my boss actually took the bus to his hotel! > > For SC18 my colleagues and I will prob pre-book a van transfer as there?s a few of us. > Some of the Aussie IBM staff are hiring a car to get to their hotel, so if theres a few who can share, that?s also a good share option if you can park or drop the rental car at or near your hotel. > > Regards, Chris > >> On 2 Nov 2018, at 4:02 pm, gpfsug-discuss-request at spectrumscale.org wrote: >> >> Re: Slightly OT: Getting from DFW to SC17 hotels in Dallas > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From henrik.cednert at filmlance.se Tue Nov 6 06:23:44 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 06:23:44 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: Hi there For some reason my mail didn?t get through. Trying again. Apologies if there's duplicates... The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Nov 5 20:25:13 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 5 Nov 2018 20:25:13 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From viccornell at gmail.com Tue Nov 6 09:35:27 2018 From: viccornell at gmail.com (Vic Cornell) Date: Tue, 6 Nov 2018 09:35:27 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> References: <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell > On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) wrote: > > Hi there > > The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. > > But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. > > I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. > > I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( > > Versions involved: > Windows 10 Enterprise 2016 LTSB > IBM GPFS Express Edition 4.1.0.4 > IBM GPFS Express Edition License and Prerequisites 4.1 > IBM GPFS GSKit 8.0.0.32 > > Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. > > Cheers and many thanks in advance for helping me out here. I?m all ears. > > > root at M5-CLIPSTER02 ~ > $ cat /var/adm/ras/mmfs.log.latest > Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting > Removing old /var/adm/ras/mmfs.log.* files: > mmtrace: The tracefmt.exe or tracelog.exe command can not be found. > mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. > Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. > Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license > Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... > Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... > Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... > Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. > Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. > Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... > Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... > Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... > Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... > Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... > Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... > Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... > Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... > Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... > Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... > Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... > Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... > Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... > Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... > Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... > Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs > Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs > Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. > Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready > Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all > Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 > > > > > -- > Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance > Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? > > Disclaimer > > > The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From daniel.kidger at uk.ibm.com Tue Nov 6 10:08:03 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Tue, 6 Nov 2018 10:08:03 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Mon Nov 5 20:21:10 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Mon, 5 Nov 2018 20:21:10 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Message-ID: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgayne at us.ibm.com Tue Nov 6 13:46:34 2018 From: lgayne at us.ibm.com (Lyle Gayne) Date: Tue, 6 Nov 2018 08:46:34 -0500 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: Message-ID: Vipul or Heather should be able to assist. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/06/2018 07:00 AM Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ? \_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 6 13:52:03 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 6 Nov 2018 13:52:03 +0000 Subject: [gpfsug-discuss] Spectrum Scale and Firewalls In-Reply-To: References: <10239ED8-7E0D-4420-8BEC-F17F0606BE64@bham.ac.uk> Message-ID: Just to close the loop on this, IBM support confirmed it?s a bug in mmnetverify and will be fixed in a later PTF. (I didn?t feel the need for an EFIX for this) Simon From: on behalf of Simon Thompson Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 19 October 2018 at 14:39 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Spectrum Scale and Firewalls Yeah we have the perfmon ports open, and GUI ports open on the GUI nodes. But basically this is just a storage cluster and everything else (protocols etc) run in remote clusters. I?ve just opened a ticket ? no longer a PMR in the new support centre for Scale Simon From: on behalf of "knop at us.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 19 October 2018 at 14:05 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Spectrum Scale and Firewalls Simon, Depending on what functions are being used in Scale, other ports may also get used, as documented in https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_firewall.htm On the other hand, I'd initially speculate that you might be hitting a problem in mmnetverify itself. (perhaps some aspect in mmnetverify is not taking into account that ports other than 22, 1191, 60000-61000 may be getting blocked by the firewall) Could you open a PMR for this one? Thanks, Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Simon Thompson ---10/19/2018 06:41:27 AM---Hi, We?re having some issues bringing up firewalls on som]Simon Thompson ---10/19/2018 06:41:27 AM---Hi, We?re having some issues bringing up firewalls on some of our NSD nodes. The problem I was actua From: Simon Thompson To: "gpfsug-discuss at spectrumscale.org" Date: 10/19/2018 06:41 AM Subject: [gpfsug-discuss] Spectrum Scale and Firewalls Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi, We?re having some issues bringing up firewalls on some of our NSD nodes. The problem I was actually trying to diagnose I don?t think is firewall related but still ? We have port 22 and 1191 open and also 60000-61000, we also set: # mmlsconfig tscTcpPort tscTcpPort 1191 # mmlsconfig tscCmdPortRange tscCmdPortRange 60000-61000 https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.0/com.ibm.spectrum.scale.v5r00.doc/bl1adv_firewallforinternalcommn.htm Claims this is sufficient ? Running mmnetverify: # mmnetverify all --target-nodes rds-er-mgr01 rds-pg-mgr01 checking local configuration. Operation interface: Success. rds-pg-mgr01 checking communication with node rds-er-mgr01. Operation resolution: Success. Operation ping: Success. Operation shell: Success. Operation copy: Success. Operation time: Success. Operation daemon-port: Success. Operation sdrserv-port: Success. Operation tsccmd-port: Success. Operation data-small: Success. Operation data-medium: Success. Operation data-large: Success. Could not connect to port 46326 on node rds-pg-mgr01 (10.20.0.56): timed out. This may indicate a firewall configuration issue. Operation bandwidth-node: Fail. rds-pg-mgr01 checking cluster communications. Issues Found: rds-er-mgr01 could not connect to rds-pg-mgr01 (TCP, port 46326). mmnetverify: Command failed. Examine previous error messages to determine cause. Note that the port number mentioned changes if we run mmnetverify multiple times. The two clients in this test are running 5.0.2 code. If I run in verbose mode I see: Checking network communication with node rds-er-mgr01. Port range restricted by cluster configuration: 60000 - 61000. rds-er-mgr01: connecting to node rds-pg-mgr01. rds-er-mgr01: exchanged 256.0M bytes with rds-pg-mgr01. Write size: 16.0M bytes. Network statistics for rds-er-mgr01 during data exchange: packets sent: 68112 packets received: 72452 Network Traffic between rds-er-mgr01 and rds-pg-mgr01 port 60000 ok. Operation data-large: Success. Checking network bandwidth. rds-er-mgr01: connecting to node rds-pg-mgr01. Could not connect to port 36277 on node rds-pg-mgr01 (10.20.0.56): timed out. This may indicate a firewall configuration issue. Operation bandwidth-node: Fail. So for many of the tests it looks like its using port 60000 as expected, is this just a bug in mmnetverify or am I doing something silly? Thanks Simon_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 107 bytes Desc: image001.gif URL: From henrik.cednert at filmlance.se Tue Nov 6 11:25:57 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 11:25:57 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com> <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger > wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) > wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From lgayne at us.ibm.com Tue Nov 6 14:02:48 2018 From: lgayne at us.ibm.com (Lyle Gayne) Date: Tue, 6 Nov 2018 09:02:48 -0500 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com><45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se> Message-ID: Yes, Henrik. For information on which OS levels are supported at which Spectrum Scale release levels, you should always consult our Spectrum Scale FAQ. This info is in Section 2 or 3 of the FAQ. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: gpfsug main discussion list Date: 11/06/2018 09:00 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ? \_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) < henrik.cednert at filmlance.se> wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From henrik.cednert at filmlance.se Tue Nov 6 14:12:27 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Tue, 6 Nov 2018 14:12:27 +0000 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com><45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> <911C9601-096C-455C-8916-96DB15B5A92E@filmlance.se>, Message-ID: Hello Ah yes, I never thought I was an issue since DDN sent me the v4 installer. Now I know better. Cheers -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 15:02, Lyle Gayne > wrote: Yes, Henrik. For information on which OS levels are supported at which Spectrum Scale release levels, you should always consult our Spectrum Scale FAQ. This info is in Section 2 or 3 of the FAQ. Thanks, Lyle "Henrik Cednert (Filmlance)" ---11/06/2018 09:00:15 AM---Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to From: "Henrik Cednert (Filmlance)" > To: gpfsug main discussion list > Date: 11/06/2018 09:00 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thanks to all of you that on list and off list have replied. So it seems like we need to upgrade to V5 to be able to run Win10 clients. I have started the discussion with DDN. Not sure what?s included in maintenance and what?s not. -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? On 6 Nov 2018, at 11:08, Daniel Kidger > wrote: Henrik, Note too that Spectrum Scale 4.1.x being almost 4 years old is close to retirement : GA. 19-Jun-2015, 215-147 EOM. 19-Jan-2018, 917-114 EOS. 30-Apr-2019, 917-114 ref. https://www-01.ibm.com/software/support/lifecycleapp/PLCDetail.wss?q45=G222117W12805R88 Daniel _________________________________________________________ Daniel Kidger IBM Technical Sales Specialist Spectrum Scale, Spectrum NAS and IBM Cloud Object Store +44-(0)7818 522 266 daniel.kidger at uk.ibm.com ----- Original message ----- From: Vic Cornell > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug main discussion list > Cc: Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Date: Tue, Nov 6, 2018 9:35 AM Hi Cedric, Welcome to the mailing list! Looking at the Spectrum Scale FAQ: https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows Windows 10 support doesn't start until Spectrum Scale V5: " Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Looks like you have GPFS 4.1.0.4. If you want to upgrade to V5, please contact DDN support. I am also not sure if the DDN windows installer supports Win10 yet. (I work for DDN) Best Regards, Vic Cornell On 5 Nov 2018, at 20:25, Henrik Cednert (Filmlance) > wrote: Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: graycol.gif URL: From vpaul at us.ibm.com Tue Nov 6 16:54:38 2018 From: vpaul at us.ibm.com (Vipul Paul) Date: Tue, 6 Nov 2018 08:54:38 -0800 Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. In-Reply-To: References: Message-ID: Hello Henrik, I see that you are trying GPFS 4.1.0.4 on Windows 10. This will not work. You need to upgrade to GPFS 5.0.2 as that is the first release that supports Windows 10. Please see the FAQ https://www.ibm.com/support/knowledgecenter/en/STXKQY/gpfsclustersfaq.html#windows "Starting with V5.0.2, IBM Spectrum Scale additionally supports Windows 10 (Pro and Enterprise editions) in both heterogeneous and homogeneous clusters. At this time, Secure Boot must be disabled on Windows 10 nodes for IBM Spectrum Scale to install and function." Thanks. -- Vipul Paul | IBM Spectrum Scale (GPFS) Development | vpaul at us.ibm.com | (503) 747-1389 (tie 997) From: Lyle Gayne/Poughkeepsie/IBM To: gpfsug main discussion list Cc: Vipul Paul/Portland/IBM, Heather J MacPherson/Beaverton/IBM at IBMUS Date: 11/06/2018 05:46 AM Subject: Re: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Vipul or Heather should be able to assist. Thanks, Lyle From: "Henrik Cednert (Filmlance)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/06/2018 07:00 AM Subject: [gpfsug-discuss] Problems with GPFS on windows 10, stuck at mount. Never mounts. Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi there The welcome mail said that a brief introduction might be in order. So let?s start with that, just jump over this paragraph if that?s not of interest. So, Henrik is the name. I?m the CTO at a Swedish film production company with our own internal post production operation. At the heart of that we have a 1.8PB DDN MediaScaler with GPFS in a mixed macOS and Windows environment. The mac clients just use NFS but the Windows clients use the native GPFS client. We have a couple of successful windows deployments. But? Now when we?re replacing one of those windows clients I?m running into some seriously frustrating ball busting shenanigans. Basically this client gets stuck on (from mmfs.log.latest) "PM: mounting /dev/ddnnas0? . Nothing more happens. Just stuck and it never mounts. I have reinstalled it multiple times with full clean between, removed cygwin and the root user account and everything. I have verified that keyless ssh works between server and all nodes, including this node. With my limited experience I don?t find enough to go on in the logs nor windows event viewer. I?m honestly totally stuck. I?m using the same version on this clients as on the others. DDN have their own installer which installs cygwin and the gpfs packages. Have worked fine on other clients but not on this sucker. =( Versions involved: Windows 10 Enterprise 2016 LTSB IBM GPFS Express Edition 4.1.0.4 IBM GPFS Express Edition License and Prerequisites 4.1 IBM GPFS GSKit 8.0.0.32 Below is the log from the client. I don? find much useful at the server, point me to specific log file if you have a good idea of where I can find errors of this. Cheers and many thanks in advance for helping me out here. I?m all ears. root at M5-CLIPSTER02 ~ $ cat /var/adm/ras/mmfs.log.latest Mon, Nov 5, 2018 8:39:18 PM: runmmfs starting Removing old /var/adm/ras/mmfs.log.* files: mmtrace: The tracefmt.exe or tracelog.exe command can not be found. mmtrace: 6027-1639 Command failed. Examine previous error messages to determine cause. Mon Nov 05 20:39:50.045 2018: GPFS: 6027-1550 [W] Error: Unable to establish a session with an Active Directory server. ID remapping via Microsoft Identity Management for Unix will be unavailable. Mon Nov 05 20:39:50.046 2018: [W] This node does not have a valid extended license Mon Nov 05 20:39:50.047 2018: GPFS: 6027-310 [I] mmfsd initializing. {Version: 4.1.0.4 Built: Oct 28 2014 16:28:19} ... Mon Nov 05 20:39:50.048 2018: [I] Cleaning old shared memory ... Mon Nov 05 20:39:50.049 2018: [I] First pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.001 2018: [I] Enabled automated deadlock detection. Mon Nov 05 20:39:51.002 2018: [I] Enabled automated deadlock debug data collection. Mon Nov 05 20:39:51.003 2018: [I] Initializing the main process ... Mon Nov 05 20:39:51.004 2018: [I] Second pass parsing mmfs.cfg ... Mon Nov 05 20:39:51.005 2018: [I] Initializing the page pool ... Mon Nov 05 20:40:00.003 2018: [I] Initializing the mailbox message system ... Mon Nov 05 20:40:00.004 2018: [I] Initializing encryption ... Mon Nov 05 20:40:00.005 2018: [I] Initializing the thread system ... Mon Nov 05 20:40:00.006 2018: [I] Creating threads ... Mon Nov 05 20:40:00.007 2018: [I] Initializing inter-node communication ... Mon Nov 05 20:40:00.008 2018: [I] Creating the main SDR server object ... Mon Nov 05 20:40:00.009 2018: [I] Initializing the sdrServ library ... Mon Nov 05 20:40:00.010 2018: [I] Initializing the ccrServ library ... Mon Nov 05 20:40:00.011 2018: [I] Initializing the cluster manager ... Mon Nov 05 20:40:25.016 2018: [I] Initializing the token manager ... Mon Nov 05 20:40:25.017 2018: [I] Initializing network shared disks ... Mon Nov 05 20:41:06.001 2018: [I] Start the ccrServ ... Mon Nov 05 20:41:07.008 2018: GPFS: 6027-1710 [N] Connecting to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.009 2018: GPFS: 6027-1711 [I] Connected to 192.168.45.200 DDN-0-0-gpfs Mon Nov 05 20:41:07.010 2018: GPFS: 6027-2750 [I] Node 192.168.45.200 (DDN-0-0-gpfs) is now the Group Leader. Mon Nov 05 20:41:08.000 2018: GPFS: 6027-300 [N] mmfsd ready Mon, Nov 5, 2018 8:41:10 PM: mmcommon mmfsup invoked. Parameters: 192.168.45.144 192.168.45.200 all Mon, Nov 5, 2018 8:41:29 PM: mounting /dev/ddnnas0 -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_( ?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From henrik.cednert at filmlance.se Wed Nov 7 06:31:45 2018 From: henrik.cednert at filmlance.se (Henrik Cednert (Filmlance)) Date: Wed, 7 Nov 2018 06:31:45 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable Message-ID: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Hello I will try my luck here. Trying to monitor capacity on our gpfs system via observium. For some reason hrStorageTable doesn?t pick up that gpfs mount point though. In diskTable it?s visible but I cannot use diskTable when monitoring via observium, has to be hrStorageTable (I was told by observium dev). Output of a few snmpwalks and more at the bottom. Are there any obvious reasons for Centos 6.7 to not pick up a gpfs mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not sure if it?s even possible to in some way force it to include it in hrStorageTable?? Apologies if this isn?t the list for questions like this. But feels like there has to be one or two peeps here monitoring their systems here. =) All these commands ran on that host: df -h | grep ddnnas0 /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 mount | grep ddnnas0 /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) snmpwalk -v2c -c secret localhost hrStorageDescr HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy snmpwalk -v2c -c secret localhost dskPath UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 UCD-SNMP-MIB::dskPath.2 = STRING: / yum list | grep net-snmp Failed to set locale, defaulting to C net-snmp.x86_64 1:5.5-60.el6 @base net-snmp-libs.x86_64 1:5.5-60.el6 @base net-snmp-perl.x86_64 1:5.5-60.el6 @base net-snmp-utils.x86_64 1:5.5-60.el6 @base net-snmp-devel.i686 1:5.5-60.el6 base net-snmp-devel.x86_64 1:5.5-60.el6 base net-snmp-libs.i686 1:5.5-60.el6 base net-snmp-python.x86_64 1:5.5-60.el6 base Cheers and thanks -- Henrik Cednert / + 46 704 71 89 54 / CTO / Filmlance Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. ?\_(?)_/? Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Wed Nov 7 08:13:04 2018 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Wed, 7 Nov 2018 08:13:04 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Nov 7 11:20:37 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 7 Nov 2018 12:20:37 +0100 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: Looking at the CHANGELOG for net-snmp, it seems it needs to know about each filesystem it's going to support, and I see no GPFS/mmfs. It has entries like: - Added simfs (OpenVZ filesystem) to hrStorageTable and hrFSTable. - Added CVFS (CentraVision File System) to hrStorageTable and - Added OCFS2 (Oracle Cluster FS) to hrStorageTable and hrFSTable - report gfs filesystems in hrStorageTable and hrFSTable. and also it didn't understand filesystems larger than 8 TB before version 5.7. I think your best option is to look at implementing the GPFS snmp agent agent https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_snmp.htm -- and see if it provides the data you need -- but it most likely won't affect the hrStorage table. And of course, please upgrade to something newer than v4.1.x. There's lots of improvements on monitoring in v4.2.3 and v5.x (but beware that v5 doesn't work with RHEL6). -jf On Wed, Nov 7, 2018 at 9:05 AM Henrik Cednert (Filmlance) < henrik.cednert at filmlance.se> wrote: > Hello > > I will try my luck here. Trying to monitor capacity on our gpfs system via > observium. For some reason hrStorageTable doesn?t pick up that gpfs mount > point though. In diskTable it?s visible but I cannot use diskTable when > monitoring via observium, has to be hrStorageTable (I was told by observium > dev). Output of a few snmpwalks and more at the bottom. > > Are there any obvious reasons for Centos 6.7 to not pick up a gpfs > mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not > sure if it?s even possible to in some way force it to include it in > hrStorageTable?? > > Apologies if this isn?t the list for questions like this. But feels like > there has to be one or two peeps here monitoring their systems here. =) > > > All these commands ran on that host: > > df -h | grep ddnnas0 > /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 > > > mount | grep ddnnas0 > /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) > > > snmpwalk -v2c -c secret localhost hrStorageDescr > HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory > HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory > HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers > HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory > HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space > HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / > HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm > HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot > HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy > HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash > HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy > HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var > HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy > > > snmpwalk -v2c -c secret localhost dskPath > UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 > UCD-SNMP-MIB::dskPath.2 = STRING: / > > > yum list | grep net-snmp > Failed to set locale, defaulting to C > net-snmp.x86_64 1:5.5-60.el6 > @base > net-snmp-libs.x86_64 1:5.5-60.el6 > @base > net-snmp-perl.x86_64 1:5.5-60.el6 > @base > net-snmp-utils.x86_64 1:5.5-60.el6 > @base > net-snmp-devel.i686 1:5.5-60.el6 > base > net-snmp-devel.x86_64 1:5.5-60.el6 > base > net-snmp-libs.i686 1:5.5-60.el6 > base > net-snmp-python.x86_64 1:5.5-60.el6 > base > > > Cheers and thanks > > -- > Henrik Cednert */ * + 46 704 71 89 54 */* CTO */ * *Filmlance* > Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. > ?\_(?)_/? > > *Disclaimer* > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Nov 7 11:29:11 2018 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 7 Nov 2018 12:29:11 +0100 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: Looks like this is all it should take to add GPFS support to net-snmp: $ git diff diff --git a/agent/mibgroup/hardware/fsys/fsys_mntent.c b/agent/mibgroup/hardware/fsys/fsys_mntent.c index 62e2953..4950879 100644 --- a/agent/mibgroup/hardware/fsys/fsys_mntent.c +++ b/agent/mibgroup/hardware/fsys/fsys_mntent.c @@ -136,6 +136,7 @@ _fsys_type( char *typename ) else if ( !strcmp(typename, MNTTYPE_TMPFS) || !strcmp(typename, MNTTYPE_GFS) || !strcmp(typename, MNTTYPE_GFS2) || + !strcmp(typename, MNTTYPE_GPFS) || !strcmp(typename, MNTTYPE_XFS) || !strcmp(typename, MNTTYPE_JFS) || !strcmp(typename, MNTTYPE_VXFS) || diff --git a/agent/mibgroup/hardware/fsys/mnttypes.h b/agent/mibgroup/hardware/fsys/mnttypes.h index bb1b401..d3f0c60 100644 --- a/agent/mibgroup/hardware/fsys/mnttypes.h +++ b/agent/mibgroup/hardware/fsys/mnttypes.h @@ -121,6 +121,9 @@ #ifndef MNTTYPE_GFS2 #define MNTTYPE_GFS2 "gfs2" #endif +#ifndef MNTTYPE_GPFS +#define MNTTYPE_GPFS "gpfs" +#endif #ifndef MNTTYPE_XFS #define MNTTYPE_XFS "xfs" #endif On Wed, Nov 7, 2018 at 12:20 PM Jan-Frode Myklebust wrote: > Looking at the CHANGELOG for net-snmp, it seems it needs to know about > each filesystem it's going to support, and I see no GPFS/mmfs. It has > entries like: > > - Added simfs (OpenVZ filesystem) to hrStorageTable and hrFSTable. > - Added CVFS (CentraVision File System) to hrStorageTable and > - Added OCFS2 (Oracle Cluster FS) to hrStorageTable and hrFSTable > - report gfs filesystems in hrStorageTable and hrFSTable. > > > and also it didn't understand filesystems larger than 8 TB before version > 5.7. > > I think your best option is to look at implementing the GPFS snmp agent > agent > https://www.ibm.com/support/knowledgecenter/en/STXKQY_5.0.2/com.ibm.spectrum.scale.v5r02.doc/bl1adv_snmp.htm > -- and see if it provides the data you need -- but it most likely won't > affect the hrStorage table. > > And of course, please upgrade to something newer than v4.1.x. There's lots > of improvements on monitoring in v4.2.3 and v5.x (but beware that v5 > doesn't work with RHEL6). > > > -jf > > On Wed, Nov 7, 2018 at 9:05 AM Henrik Cednert (Filmlance) < > henrik.cednert at filmlance.se> wrote: > >> Hello >> >> I will try my luck here. Trying to monitor capacity on our gpfs system >> via observium. For some reason hrStorageTable doesn?t pick up that gpfs >> mount point though. In diskTable it?s visible but I cannot use diskTable >> when monitoring via observium, has to be hrStorageTable (I was told by >> observium dev). Output of a few snmpwalks and more at the bottom. >> >> Are there any obvious reasons for Centos 6.7 to not pick up a gpfs >> mountpoint in hrStorageTable? I?m not that snmp, nor gpfs, savvy so not >> sure if it?s even possible to in some way force it to include it in >> hrStorageTable?? >> >> Apologies if this isn?t the list for questions like this. But feels like >> there has to be one or two peeps here monitoring their systems here. =) >> >> >> All these commands ran on that host: >> >> df -h | grep ddnnas0 >> /dev/ddnnas0 1.7P 913T 739T 56% /ddnnas0 >> >> >> mount | grep ddnnas0 >> /dev/ddnnas0 on /ddnnas0 type gpfs (rw,relatime,mtime,nfssync,dev=ddnnas0) >> >> >> snmpwalk -v2c -c secret localhost hrStorageDescr >> HOST-RESOURCES-MIB::hrStorageDescr.1 = STRING: Physical memory >> HOST-RESOURCES-MIB::hrStorageDescr.3 = STRING: Virtual memory >> HOST-RESOURCES-MIB::hrStorageDescr.6 = STRING: Memory buffers >> HOST-RESOURCES-MIB::hrStorageDescr.7 = STRING: Cached memory >> HOST-RESOURCES-MIB::hrStorageDescr.10 = STRING: Swap space >> HOST-RESOURCES-MIB::hrStorageDescr.31 = STRING: / >> HOST-RESOURCES-MIB::hrStorageDescr.35 = STRING: /dev/shm >> HOST-RESOURCES-MIB::hrStorageDescr.36 = STRING: /boot >> HOST-RESOURCES-MIB::hrStorageDescr.37 = STRING: /boot-rcvy >> HOST-RESOURCES-MIB::hrStorageDescr.38 = STRING: /crash >> HOST-RESOURCES-MIB::hrStorageDescr.39 = STRING: /rcvy >> HOST-RESOURCES-MIB::hrStorageDescr.40 = STRING: /var >> HOST-RESOURCES-MIB::hrStorageDescr.41 = STRING: /var-rcvy >> >> >> snmpwalk -v2c -c secret localhost dskPath >> UCD-SNMP-MIB::dskPath.1 = STRING: /ddnnas0 >> UCD-SNMP-MIB::dskPath.2 = STRING: / >> >> >> yum list | grep net-snmp >> Failed to set locale, defaulting to C >> net-snmp.x86_64 1:5.5-60.el6 >> @base >> net-snmp-libs.x86_64 1:5.5-60.el6 >> @base >> net-snmp-perl.x86_64 1:5.5-60.el6 >> @base >> net-snmp-utils.x86_64 1:5.5-60.el6 >> @base >> net-snmp-devel.i686 1:5.5-60.el6 >> base >> net-snmp-devel.x86_64 1:5.5-60.el6 >> base >> net-snmp-libs.i686 1:5.5-60.el6 >> base >> net-snmp-python.x86_64 1:5.5-60.el6 >> base >> >> >> Cheers and thanks >> >> -- >> Henrik Cednert */ * + 46 704 71 89 54 */* CTO */ * *Filmlance* >> Disclaimer, the hideous bs disclaimer at the bottom is forced, sorry. >> ?\_(?)_/? >> >> *Disclaimer* >> >> The information contained in this communication from the sender is >> confidential. It is intended solely for use by the recipient and others >> authorized to receive it. If you are not the recipient, you are hereby >> notified that any disclosure, copying, distribution or taking action in >> relation of the contents of this information is strictly prohibited and may >> be unlawful. >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Wed Nov 7 13:02:32 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Wed, 7 Nov 2018 13:02:32 +0000 Subject: [gpfsug-discuss] gpfs mount point not visible in snmp hrStorageTable In-Reply-To: References: <2F3636D5-3201-49D1-BF0B-CB07F6D01986@filmlance.se> Message-ID: <5da641cc-a171-2d9b-f917-a4470279237f@strath.ac.uk> On 07/11/2018 11:20, Jan-Frode Myklebust wrote: [SNIP] > > And of course, please upgrade to something newer than v4.1.x. There's > lots of improvements on monitoring in v4.2.3 and v5.x (but beware that > v5 doesn't work with RHEL6). > I would suggest that getting off CentOS 6.7 to more recent release should also be a priority. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From aaron.s.knister at nasa.gov Wed Nov 7 23:37:37 2018 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 7 Nov 2018 18:37:37 -0500 Subject: [gpfsug-discuss] Unexpected data in message/Bad message Message-ID: We're experiencing client nodes falling out of the cluster with errors that look like this: ?Tue Nov 6 15:10:34.939 2018: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000047 00000000 00 00 0000 00000000 00000000 0000 0000 Tue Nov 6 15:10:34.942 2018: [E] [0/0] 512 more bytes were available: Tue Nov 6 15:10:34.965 2018: [N] Close connection to 10.100.X.X nsdserver1 (Unexpected error 120) Tue Nov 6 15:10:34.966 2018: [E] Network error on 10.100.X.X nsdserver1 , Check connectivity Tue Nov 6 15:10:36.726 2018: [N] Restarting mmsdrserv Tue Nov 6 15:10:38.850 2018: [E] Bad message Tue Nov 6 15:10:38.851 2018: [X] The mmfs daemon is shutting down abnormally. Tue Nov 6 15:10:38.852 2018: [N] mmfsd is shutting down. Tue Nov 6 15:10:38.853 2018: [N] Reason for shutdown: LOGSHUTDOWN called The cluster is running various PTF Levels of 4.1.1. Has anyone seen this before? I'm struggling to understand what it means from a technical point of view. Was GPFS expecting a larger message than it received? Did it receive all of the bytes it expected and some of it was corrupt? It says "512 more bytes were available" but then doesn't show any additional bytes. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Robert.Oesterlin at nuance.com Thu Nov 8 20:40:05 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 8 Nov 2018 20:40:05 +0000 Subject: [gpfsug-discuss] SSUG @ SC18 - Location details Message-ID: <07491620-1D44-4A9C-9C92-A7DA634304CE@nuance.com> Location: Omni Dallas Hotel 555 S Lamar Dallas, Texas 75202 United States The Omni is connected to Kay Bailey Convention Center via skybridge on 2nd Floor. Dallas Ballroom A - 3rd Floor IBM Spectrum Scale User Group Meeting Sunday, November 11, 2018 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From heiner.billich at psi.ch Fri Nov 9 12:46:31 2018 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Fri, 9 Nov 2018 12:46:31 +0000 Subject: [gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots Message-ID: Hello, we run CES with smbd on a filesystem _without_ snapshots. I would like to completely remove the shadow_copy2 vfs object in samba which exposes the snapshots to windows clients: We don't offer snapshots as service to clients and if I create a snapshot I don't want it to be exposed to clients. I'm also not sure how much additional directory traversals this vfs object causes, shadow_copy2 has to search for the snapshot directories again and again, just to learn that there are no snapshots available. Now the file samba_registry.def (/usr/lpp/mmfs/share/samba/samba_registry.def) doesn't allow to change the settings for shadow_config2 in samba's configuration. Hm, is it o.k. to edit samba_registry.def? That's probably not what IBM intended. But with mmsnapdir I can change the name of the snapshot directories, which would require me to edit the locked settings, too, so it seems a bit restrictive. I didn?t search all documentation, if there is an option do disable shadow_copy2 with some command I would be happy to learn. Any comments or ideas are welcome. Also if you think I should just create a bogus .snapdirs at root level to get rid of the error messages and that's it, please let me know. we run scale 5.0.1-1 on RHEL4 x86_64. We will upgrade to 5.0.2-1 soon, but I didn?t' check that version yet. Cheers Heiner Billich What I would like to change in samba's configuration: 52c52 < vfs objects = syncops gpfs fileid time_audit --- > vfs objects = shadow_copy2 syncops gpfs fileid time_audit 72a73,76 > shadow:snapdir = .snapshots > shadow:fixinodes = yes > shadow:snapdirseverywhere = yes > shadow:sort = desc -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch From jonathan.buzzard at strath.ac.uk Fri Nov 9 13:26:50 2018 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 9 Nov 2018 13:26:50 +0000 Subject: [gpfsug-discuss] CES - samba - how can I disable shadow_copy2, i.e. snapshots In-Reply-To: References: Message-ID: <82e1aa3d-566a-3a4a-e841-9f92f30546c6@strath.ac.uk> On 09/11/2018 12:46, Billich Heinrich Rainer (PSI) wrote: > Hello, > > we run CES with smbd on a filesystem _without_ snapshots. I would > like to completely remove the shadow_copy2 vfs object in samba which > exposes the snapshots to windows clients: > > We don't offer snapshots as service to clients and if I create a > snapshot I don't want it to be exposed to clients. I'm also not sure > how much additional directory traversals this vfs object causes, > shadow_copy2 has to search for the snapshot directories again and > again, just to learn that there are no snapshots available. > The shadow_copy2 VFS only exposes snapshots to clients if they are in a very specific format. The chances of you doing this with "management" snapshots you are creating are about ?, assuming you are using the command line. If you are using the GUI then all bets are off. Perhaps someone with experience of the GUI can add their wisdom here. The VFS even if loaded will only create I/O on the server if the client clicks on previous versions tab in Windows Explorer. Given that you don't offer previous version snapshots, then there will be very little of this going on and even if they do then the initial amount of I/O will be limited to basically the equivalent of an `ls` in the shadow copy snapshot directory. So absolutely nothing to get worked up about. With the proviso about doing snapshots from the GUI (never used the new GUI in GPFS, only played with the old one, and don't trust IBM to change it again) you are completely overthinking this. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Robert.Oesterlin at nuance.com Fri Nov 9 14:07:01 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 9 Nov 2018 14:07:01 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Message-ID: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Nov 9 14:13:19 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 9 Nov 2018 09:13:19 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen > On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: > > Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. > > ESS 5.3.1.1 > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 17.0 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Fri Nov 9 16:02:12 2018 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 9 Nov 2018 11:02:12 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 ("4.2.3" may not get recognized by the command) The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Stephen Ulmer To: gpfsug main discussion list Date: 11/09/2018 09:13 AM Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Sent by: gpfsug-discuss-bounces at spectrumscale.org It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From ulmer at ulmer.org Fri Nov 9 16:08:17 2018 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 9 Nov 2018 11:08:17 -0500 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: <8EDF9168-698B-4EA7-9D2A-F33D7B8AF265@ulmer.org> You rock. -- Stephen > On Nov 9, 2018, at 11:02 AM, Felipe Knop > wrote: > > Stephen, Bob, > > A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: > 4.2.3.0 and 4.2.3.9 > > ("4.2.3" may not get recognized by the command) > > The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) > > Felipe > > ---- > Felipe Knop knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 T/L 293-9314 > > > > Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- > > From: Stephen Ulmer > > To: gpfsug main discussion list > > Date: 11/09/2018 09:13 AM > Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > It had better work ? I?m literally going to be doing exactly the same thing in two weeks? > > -- > Stephen > > > On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: > > Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. > > ESS 5.3.1.1 > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 > mmvdisk: Creating file system 'test'. > mmvdisk: (mmcrfs) Incorrect option: --version 17.0 > mmvdisk: Error creating file system. > mmvdisk: Command failed. Examine previous error messages to determine cause. > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Nov 9 16:11:05 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 9 Nov 2018 16:11:05 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Message-ID: <6951D57C-3714-4F26-A7AD-92B8D79501EC@nuance.com> That did it, thanks. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Felipe Knop Reply-To: gpfsug main discussion list Date: Friday, November 9, 2018 at 10:04 AM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Nov 9 16:12:02 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 9 Nov 2018 16:12:02 +0000 Subject: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails In-Reply-To: References: Message-ID: Looking in ?mmprodname? looks like if you wanted to use 17, it would be 1700 (for 1709 based on what Felipe mentions below). (I wonder what 99.0.0.0 does ?) Simon From: on behalf of "knop at us.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Friday, 9 November 2018 at 16:02 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Stephen, Bob, A colleague has suggested that the version number should be of the form (of 4-number) V.R.M.F. I believe two values are available for 4.2.3: 4.2.3.0 and 4.2.3.9 ("4.2.3" may not get recognized by the command) The latter would be recommended since it includes a FS format 'tweak' to allow GPFS to fix a complex deadlock. (I'm searching for a publication on that) Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 [Inactive hide details for Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exac]Stephen Ulmer ---11/09/2018 09:13:32 AM---It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- From: Stephen Ulmer To: gpfsug main discussion list Date: 11/09/2018 09:13 AM Subject: Re: [gpfsug-discuss] ESS: mmvdisk filesystem with "--version" fails Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ It had better work ? I?m literally going to be doing exactly the same thing in two weeks? -- Stephen On Nov 9, 2018, at 9:07 AM, Oesterlin, Robert > wrote: Why doesn?t this work? I want to create a file system with an older version level that my remote cluster running 4.2.3 can use. ESS 5.3.1.1 [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 4.2.3 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 4.2.3 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. [root at ess01node1 ~]# mmvdisk filesystem create --file-system test --vdisk-set test1 --mmcrfs -A yes -Q yes -T /gpfs/test --version 17.0 mmvdisk: Creating file system 'test'. mmvdisk: (mmcrfs) Incorrect option: --version 17.0 mmvdisk: Error creating file system. mmvdisk: Command failed. Examine previous error messages to determine cause. Bob Oesterlin Sr Principal Storage Engineer, Nuance _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 106 bytes Desc: image001.gif URL: From bzhang at ca.ibm.com Fri Nov 9 16:37:08 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Fri, 9 Nov 2018 11:37:08 -0500 Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: Hi all, We are going to host our next technical webinar. everyone is welcome to register and attend. Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B383659.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B754293.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B720798.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B162231.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B680907.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1B309846.gif Type: image/gif Size: 1243 bytes Desc: not available URL: From jonbernard at gmail.com Sat Nov 10 20:37:35 2018 From: jonbernard at gmail.com (Jon Bernard) Date: Sat, 10 Nov 2018 14:37:35 -0600 Subject: [gpfsug-discuss] If you're attending KubeCon'18 In-Reply-To: References: Message-ID: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov Folks, > > Please let me know if anyone is attending KubeCon'18 in Seattle this > December (via private e-mail). We will be there and would like to meet in > person with people that already use or consider using > Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, > problems, visions. > > P.S. If you are not attending KubeCon, but are interested in the topic, > shoot me an e-mail anyway. > > Best, > -- > Vasily Tarasov, > Research Staff Member, > Storage Systems Research, > IBM Research - Almaden > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sun Nov 11 18:07:17 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sun, 11 Nov 2018 13:07:17 -0500 Subject: [gpfsug-discuss] Unexpected data in message/Bad message In-Reply-To: References: Message-ID: Hi Aaron, The header dump shows all zeroes were received for the header. So no valid magic, version, originator, etc. The "512 more bytes" would have been the meat after the header. Very unexpected hence the shutdown. Logs around that event involving the machines noted in that trace would be required to evaluate further. This is not common. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 11/07/2018 06:38 PM Subject: [gpfsug-discuss] Unexpected data in message/Bad message Sent by: gpfsug-discuss-bounces at spectrumscale.org We're experiencing client nodes falling out of the cluster with errors that look like this: Tue Nov 6 15:10:34.939 2018: [E] Unexpected data in message. Header dump: 00000000 0000 0000 00000047 00000000 00 00 0000 00000000 00000000 0000 0000 Tue Nov 6 15:10:34.942 2018: [E] [0/0] 512 more bytes were available: Tue Nov 6 15:10:34.965 2018: [N] Close connection to 10.100.X.X nsdserver1 (Unexpected error 120) Tue Nov 6 15:10:34.966 2018: [E] Network error on 10.100.X.X nsdserver1 , Check connectivity Tue Nov 6 15:10:36.726 2018: [N] Restarting mmsdrserv Tue Nov 6 15:10:38.850 2018: [E] Bad message Tue Nov 6 15:10:38.851 2018: [X] The mmfs daemon is shutting down abnormally. Tue Nov 6 15:10:38.852 2018: [N] mmfsd is shutting down. Tue Nov 6 15:10:38.853 2018: [N] Reason for shutdown: LOGSHUTDOWN called The cluster is running various PTF Levels of 4.1.1. Has anyone seen this before? I'm struggling to understand what it means from a technical point of view. Was GPFS expecting a larger message than it received? Did it receive all of the bytes it expected and some of it was corrupt? It says "512 more bytes were available" but then doesn't show any additional bytes. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From madhu at corehive.com Sun Nov 11 19:58:27 2018 From: madhu at corehive.com (Madhu Konidena) Date: Sun, 11 Nov 2018 14:58:27 -0500 Subject: [gpfsug-discuss] If you're attending KubeCon'18 In-Reply-To: References: Message-ID: <2a3e90be-92bd-489d-a9bc-c1f6b6eae5de@corehive.com> I will be there at both. Please stop by our booth at SC18 for a quick chat. ? Madhu Konidena Madhu at CoreHive.com? On Nov 10, 2018, at 3:37 PM, Jon Bernard wrote: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: >Hi Vasily, > >I will be at Kubecon with colleagues from Tower Research Capital (and >at >SC). We have a few hundred nodes across several Kubernetes clusters, >most >of them mounting Scale from the host. > >Jon > >On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: > >> Folks, >> >> Please let me know if anyone is attending KubeCon'18 in Seattle this >> December (via private e-mail). We will be there and would like to >meet in >> person with people that already use or consider using >> Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, >> problems, visions. >> >> P.S. If you are not attending KubeCon, but are interested in the >topic, >> shoot me an e-mail anyway. >> >> Best, >> -- >> Vasily Tarasov, >> Research Staff Member, >> Storage Systems Research, >> IBM Research - Almaden >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > >------------------------------------------------------------------------ > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 102.png Type: image/png Size: 18340 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: sc18-campaignasset_10.png Type: image/png Size: 354312 bytes Desc: not available URL: From heiner.billich at psi.ch Wed Nov 14 16:20:12 2018 From: heiner.billich at psi.ch (Billich Heinrich Rainer (PSI)) Date: Wed, 14 Nov 2018 16:20:12 +0000 Subject: [gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot Message-ID: Hello, how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node or restart gpfs on a suspended ces node? Being able to do this would make updates much easier With # mmces node suspend ?stop I can move all IPs to other CES nodes and stop all CES services, what also releases the ces-shared-root-directory and allows to unmount the underlying filesystem. But after a reboot/restart only the IPs stay on the on the other nodes, the CES services start up. Hm, sometimes I would very much prefer the services to stay down as long as the nodes is suspended and to keep the node out of the CES cluster as much as possible. I did not try rough things like just renaming smbd, this seems likely to create unwanted issues. Thank you, Cheers, Heiner Billich -- Paul Scherrer Institut Heiner Billich System Engineer Scientific Computing Science IT / High Performance Computing WHGA/106 Forschungsstrasse 111 5232 Villigen PSI Switzerland Phone +41 56 310 36 02 heiner.billich at psi.ch https://www.psi.ch From: on behalf of Madhu Konidena Reply-To: gpfsug main discussion list Date: Sunday 11 November 2018 at 22:06 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18 I will be there at both. Please stop by our booth at SC18 for a quick chat. Madhu Konidena [cid:ii_d4d3894a4c2f4773] Madhu at CoreHive.com On Nov 10, 2018, at 3:37 PM, Jon Bernard > wrote: Hi Vasily, I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. Jon On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: Folks, Please let me know if anyone is attending KubeCon'18 in Seattle this December (via private e-mail). We will be there and would like to meet in person with people that already use or consider using Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, problems, visions. P.S. If you are not attending KubeCon, but are interested in the topic, shoot me an e-mail anyway. Best, -- Vasily Tarasov, Research Staff Member, Storage Systems Research, IBM Research - Almaden _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 18341 bytes Desc: image001.png URL: From skylar2 at uw.edu Wed Nov 14 16:27:31 2018 From: skylar2 at uw.edu (Skylar Thompson) Date: Wed, 14 Nov 2018 16:27:31 +0000 Subject: [gpfsug-discuss] CES - suspend a node and don't start smb/nfs at mmstartup/boot In-Reply-To: References: Message-ID: <20181114162731.a7etjs4g3gftgsyv@utumno.gs.washington.edu> Hi Heiner, Try doing "mmces service stop -N " and/or "mmces service disable -N ". You'll definitely want the node suspended first, since I don't think the service commands do an address migration first. On Wed, Nov 14, 2018 at 04:20:12PM +0000, Billich Heinrich Rainer (PSI) wrote: > Hello, > > how can I prevent smb, ctdb, nfs (and object) to start when I reboot the node or restart gpfs on a suspended ces node? Being able to do this would make updates much easier > > With > > # mmces node suspend ???stop > > I can move all IPs to other CES nodes and stop all CES services, what also releases the ces-shared-root-directory and allows to unmount the underlying filesystem. > But after a reboot/restart only the IPs stay on the on the other nodes, the CES services start up. Hm, sometimes I would very much prefer the services to stay down as long as the nodes is suspended and to keep the node out of the CES cluster as much as possible. > > I did not try rough things like just renaming smbd, this seems likely to create unwanted issues. > > Thank you, > > Cheers, > > Heiner Billich > -- > Paul Scherrer Institut > Heiner Billich > System Engineer Scientific Computing > Science IT / High Performance Computing > WHGA/106 > Forschungsstrasse 111 > 5232 Villigen PSI > Switzerland > > Phone +41 56 310 36 02 > heiner.billich at psi.ch > https://www.psi.ch > > > > From: on behalf of Madhu Konidena > Reply-To: gpfsug main discussion list > Date: Sunday 11 November 2018 at 22:06 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] If you're attending KubeCon'18 > > I will be there at both. Please stop by our booth at SC18 for a quick chat. > > Madhu Konidena > [cid:ii_d4d3894a4c2f4773] > Madhu at CoreHive.com > > > > On Nov 10, 2018, at 3:37 PM, Jon Bernard > wrote: > Hi Vasily, > I will be at Kubecon with colleagues from Tower Research Capital (and at SC). We have a few hundred nodes across several Kubernetes clusters, most of them mounting Scale from the host. > Jon > On Fri, Oct 26, 2018, 5:58 PM Vasily Tarasov wrote: > Folks, Please let me know if anyone is attending KubeCon'18 in Seattle this December (via private e-mail). We will be there and would like to meet in person with people that already use or consider using Kubernetes/Swarm/Mesos with Scale. The goal is to share experiences, problems, visions. P.S. If you are not attending KubeCon, but are interested in the topic, shoot me an e-mail anyway. Best, -- Vasily Tarasov, Research Staff Member, Storage Systems Research, IBM Research - Almaden > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From novosirj at rutgers.edu Wed Nov 14 15:28:31 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Wed, 14 Nov 2018 15:28:31 +0000 Subject: [gpfsug-discuss] GSS Software Release? Message-ID: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- From scale at us.ibm.com Thu Nov 15 13:26:18 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Thu, 15 Nov 2018 08:26:18 -0500 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> Message-ID: AFAIK GSS/DSS are handled by Lenovo not IBM so you would need to contact them for release plans. I do not know which version of GPFS was included in GSS 3.3a but I can tell you that GPFS 3.5 is out of service and GPFS 4.1.x will be end of service in April 2019. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Ryan Novosielski To: "gpfsug-discuss at spectrumscale.org" Date: 11/15/2018 12:03 AM Subject: [gpfsug-discuss] GSS Software Release? Sent by: gpfsug-discuss-bounces at spectrumscale.org -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From carlz at us.ibm.com Thu Nov 15 14:01:28 2018 From: carlz at us.ibm.com (Carl Zetie) Date: Thu, 15 Nov 2018 14:01:28 +0000 Subject: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) Message-ID: >any idea when a newer GSS software release than 3.3a will be released? That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com From alvise.dorigo at psi.ch Thu Nov 15 15:22:25 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Thu, 15 Nov 2018 15:22:25 +0000 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon Message-ID: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> Hello, I'm using mmperfmon to get writing stats on NSD during a write activity on a GPFS filesystem (Lenovo system with dss-g-2.0a). I use this command: # mmperfmon query 'sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --number-buckets 48 -b 1 to get the stats. What it returns is a list of valid values followed by a longer list of 'null' as shown below: Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp gpfs_nsdfs_bytes_written 1 2018-11-15-16:15:57 746586112 2 2018-11-15-16:15:58 704643072 3 2018-11-15-16:15:59 805306368 4 2018-11-15-16:16:00 754974720 5 2018-11-15-16:16:01 754974720 6 2018-11-15-16:16:02 763363328 7 2018-11-15-16:16:03 746586112 8 2018-11-15-16:16:04 746848256 9 2018-11-15-16:16:05 780140544 10 2018-11-15-16:16:06 679923712 11 2018-11-15-16:16:07 746618880 12 2018-11-15-16:16:08 780140544 13 2018-11-15-16:16:09 746586112 14 2018-11-15-16:16:10 763363328 15 2018-11-15-16:16:11 780173312 16 2018-11-15-16:16:12 721420288 17 2018-11-15-16:16:13 796917760 18 2018-11-15-16:16:14 763363328 19 2018-11-15-16:16:15 738197504 20 2018-11-15-16:16:16 738197504 21 2018-11-15-16:16:17 null 22 2018-11-15-16:16:18 null 23 2018-11-15-16:16:19 null 24 2018-11-15-16:16:20 null 25 2018-11-15-16:16:21 null 26 2018-11-15-16:16:22 null 27 2018-11-15-16:16:23 null 28 2018-11-15-16:16:24 null 29 2018-11-15-16:16:25 null 30 2018-11-15-16:16:26 null 31 2018-11-15-16:16:27 null 32 2018-11-15-16:16:28 null 33 2018-11-15-16:16:29 null 34 2018-11-15-16:16:30 null 35 2018-11-15-16:16:31 null 36 2018-11-15-16:16:32 null 37 2018-11-15-16:16:33 null 38 2018-11-15-16:16:34 null 39 2018-11-15-16:16:35 null 40 2018-11-15-16:16:36 null 41 2018-11-15-16:16:37 null 42 2018-11-15-16:16:38 null 43 2018-11-15-16:16:39 null 44 2018-11-15-16:16:40 null 45 2018-11-15-16:16:41 null 46 2018-11-15-16:16:42 null 47 2018-11-15-16:16:43 null 48 2018-11-15-16:16:44 null If I run again and again I still get the same pattern: valid data (even 0 in case of not write activity) followed by more null data. Is that normal ? If not, is there a way to get only non-null data by fine-tuning pmcollector's configuration file ? The corresponding ZiMon sensor (GPFSNSDFS) have period=1. The ZiMon version is 4.2.3-7. Thanks, Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From aposthuma at lenovo.com Thu Nov 15 15:56:44 2018 From: aposthuma at lenovo.com (Andre Posthuma) Date: Thu, 15 Nov 2018 15:56:44 +0000 Subject: [gpfsug-discuss] [External] Re: GSS Software Release? (Ryan Novosielski) In-Reply-To: References: Message-ID: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> Hello, GSS 3.3b was released last week, with a number of Spectrum Scale versions available : 5.0.1.2 4.2.3.11 4.1.1.20 DSS-G 2.2a was released yesterday, with 2 Spectrum Scale versions available : 5.0.2.1 4.2.3.11 Best Regards Andre Posthuma IT Specialist HPC Services Lenovo United Kingdom +44 7841782363 aposthuma at lenovo.com ? Lenovo.com Twitter | Facebook | Instagram | Blogs | Forums -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Carl Zetie Sent: Thursday, November 15, 2018 2:01 PM To: gpfsug-discuss at spectrumscale.org Subject: [External] Re: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) >any idea when a newer GSS software release than 3.3a will be released? That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. Carl Zetie Program Director Offering Management for Spectrum Scale, IBM ---- (540) 882 9353 ][ Research Triangle Park carlz at us.ibm.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From matthew.robinson02 at gmail.com Thu Nov 15 17:53:14 2018 From: matthew.robinson02 at gmail.com (Matthew Robinson) Date: Thu, 15 Nov 2018 12:53:14 -0500 Subject: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) In-Reply-To: References: Message-ID: Hi Ryan, As an Ex-GSS PE guy for Lenovo, a new GSS update could almost be expected every 3-4 months in a year. I would not be surprised if Lenovo GSS-DSS development started to not update the GSS solution and only focused on DSS updates. That is just my best guess from this point. I agree with Carl this should be a quick open and close case for the Lenovo product engineer that still works on the GSS solution. Kind regards, MattRob On Thu, Nov 15, 2018 at 9:02 AM Carl Zetie wrote: > > >any idea when a newer GSS software release than 3.3a will be released? > > That is definitely a question only our friends at Lenovo can answer. If > you don't get a response here (I'm not sure if any Lenovites are active on > the list), you'll need to address it directly to Lenovo, e.g. your account > team. > > > Carl Zetie > Program Director > Offering Management for Spectrum Scale, IBM > ---- > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Matthew Robinson Comptia A+, Net+ 919.909.0494 matthew.robinson02 at gmail.com The greatest discovery of my generation is that man can alter his life simply by altering his attitude of mind. - William James, Harvard Psychologist. -------------- next part -------------- An HTML attachment was scrubbed... URL: From andy_kurth at ncsu.edu Thu Nov 15 18:28:46 2018 From: andy_kurth at ncsu.edu (Andy Kurth) Date: Thu, 15 Nov 2018 13:28:46 -0500 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu> Message-ID: Public information on GSS updates seems nonexistent. You can find some clues if you have access to Lenovo's restricted download site . It looks like gss3.3b was released in late September. There are gss3.3b download options that include either 4.2.3-9 or 4.1.1-20. Earlier this month they also released some GPFS-only updates for 4.3.2-11 and 5.0.1-2. It looks like these are meant to be applied on top of gss3.3b. For DSS-G, it looks like dss-g-2.2a is the latest full release with options that include 4.2.3-11 or 5.0.2-1. There are also separate DSS-G GPFS-only updates for 4.2.3-11 and 5.0.1-2. Regards, Andy Kurth / NCSU On Thu, Nov 15, 2018 at 12:01 AM Ryan Novosielski wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > I know this might not be the perfect venue, but I know IBM developers > participate and will occasionally share this sort of thing: any idea > when a newer GSS software release than 3.3a will be released? We are > attempting to plan our maintenance schedule. At the moment, the DSS-G > software seems to be getting updated and we'd prefer to remain at the > same GPFS release on DSS-G and GSS. > > - -- > ____ > || \\UTGERS, |----------------------*O*------------------------ > ||_// the State | Ryan Novosielski - novosirj at rutgers.edu > || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus > || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark > `' > -----BEGIN PGP SIGNATURE----- > > iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl > qvcAn2ql2U8peuGuhptTIejVgnDFSWEf > =7Iue > -----END PGP SIGNATURE----- > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- *Andy Kurth* Research Storage Specialist NC State University Office of Information Technology P: 919-513-4090 311A Hillsborough Building Campus Box 7109 Raleigh, NC 27695 -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Nov 15 20:35:29 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 15 Nov 2018 21:35:29 +0100 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon In-Reply-To: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Nov 16 02:22:55 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 16 Nov 2018 02:22:55 +0000 Subject: [gpfsug-discuss] Presentations - User Group Meeting at SC18 Message-ID: <917D0EB2-BE2C-4445-AE12-B68DA3D2B6F1@nuance.com> I?ve uploaded the first batch of presentation to the spectrumscale.org site - More coming once I receive them. https://www.spectrumscaleug.org/presentations/2018/ Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From LNakata at SLAC.STANFORD.EDU Fri Nov 16 03:06:46 2018 From: LNakata at SLAC.STANFORD.EDU (Lance Nakata) Date: Thu, 15 Nov 2018 19:06:46 -0800 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Message-ID: <20181116030646.GA28141@slac.stanford.edu> We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory From Greg.Lehmann at csiro.au Fri Nov 16 03:46:01 2018 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Fri, 16 Nov 2018 03:46:01 +0000 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: <20181116030646.GA28141@slac.stanford.edu> References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: <6be5c9834bc747b7b7145e884f98caa2@exch1-cdc.nexus.csiro.au> Hi Lance, We are doing it with beegfs (mdadm and NVMe drives in the same HW.) For GPFS have you updated the nsddevices sample script to look at the mdadm devices and put it in /var/mmfs/etc? BTW I'm interested to see how you go with that configuration. Cheers, Greg -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Lance Nakata Sent: Friday, November 16, 2018 1:07 PM To: gpfsug-discuss at spectrumscale.org Cc: Jon L. Bergman Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From alvise.dorigo at psi.ch Fri Nov 16 08:29:46 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Fri, 16 Nov 2018 08:29:46 +0000 Subject: [gpfsug-discuss] Wrong behavior of mmperfmon In-Reply-To: References: <83A6EEB0EC738F459A39439733AE80452679A021@MBX214.d.ethz.ch>, Message-ID: <83A6EEB0EC738F459A39439733AE80452679A101@MBX214.d.ethz.ch> Indeed, I just realized that after last recent update to dssg-2.0a ntpd is crashing very frequently. Thanks for the hint. Alvise ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Olaf Weiser [olaf.weiser at de.ibm.com] Sent: Thursday, November 15, 2018 9:35 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Wrong behavior of mmperfmon ntp running / time correct ? From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/15/2018 04:30 PM Subject: [gpfsug-discuss] Wrong behavior of mmperfmon Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, I'm using mmperfmon to get writing stats on NSD during a write activity on a GPFS filesystem (Lenovo system with dss-g-2.0a). I use this command: # mmperfmon query 'sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written' --number-buckets 48 -b 1 to get the stats. What it returns is a list of valid values followed by a longer list of 'null' as shown below: Legend: 1: sf-dssio-1.psi.ch|GPFSNSDFS|RAW|gpfs_nsdfs_bytes_written Row Timestamp gpfs_nsdfs_bytes_written 1 2018-11-15-16:15:57 746586112 2 2018-11-15-16:15:58 704643072 3 2018-11-15-16:15:59 805306368 4 2018-11-15-16:16:00 754974720 5 2018-11-15-16:16:01 754974720 6 2018-11-15-16:16:02 763363328 7 2018-11-15-16:16:03 746586112 8 2018-11-15-16:16:04 746848256 9 2018-11-15-16:16:05 780140544 10 2018-11-15-16:16:06 679923712 11 2018-11-15-16:16:07 746618880 12 2018-11-15-16:16:08 780140544 13 2018-11-15-16:16:09 746586112 14 2018-11-15-16:16:10 763363328 15 2018-11-15-16:16:11 780173312 16 2018-11-15-16:16:12 721420288 17 2018-11-15-16:16:13 796917760 18 2018-11-15-16:16:14 763363328 19 2018-11-15-16:16:15 738197504 20 2018-11-15-16:16:16 738197504 21 2018-11-15-16:16:17 null 22 2018-11-15-16:16:18 null 23 2018-11-15-16:16:19 null 24 2018-11-15-16:16:20 null 25 2018-11-15-16:16:21 null 26 2018-11-15-16:16:22 null 27 2018-11-15-16:16:23 null 28 2018-11-15-16:16:24 null 29 2018-11-15-16:16:25 null 30 2018-11-15-16:16:26 null 31 2018-11-15-16:16:27 null 32 2018-11-15-16:16:28 null 33 2018-11-15-16:16:29 null 34 2018-11-15-16:16:30 null 35 2018-11-15-16:16:31 null 36 2018-11-15-16:16:32 null 37 2018-11-15-16:16:33 null 38 2018-11-15-16:16:34 null 39 2018-11-15-16:16:35 null 40 2018-11-15-16:16:36 null 41 2018-11-15-16:16:37 null 42 2018-11-15-16:16:38 null 43 2018-11-15-16:16:39 null 44 2018-11-15-16:16:40 null 45 2018-11-15-16:16:41 null 46 2018-11-15-16:16:42 null 47 2018-11-15-16:16:43 null 48 2018-11-15-16:16:44 null If I run again and again I still get the same pattern: valid data (even 0 in case of not write activity) followed by more null data. Is that normal ? If not, is there a way to get only non-null data by fine-tuning pmcollector's configuration file ? The corresponding ZiMon sensor (GPFSNSDFS) have period=1. The ZiMon version is 4.2.3-7. Thanks, Alvise _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Fri Nov 16 09:19:07 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 16 Nov 2018 10:19:07 +0100 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: <20181116030646.GA28141@slac.stanford.edu> References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From UWEFALKE at de.ibm.com Fri Nov 16 09:35:25 2018 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Fri, 16 Nov 2018 10:35:25 +0100 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Hi, Having mentioned nsddevices, I do not know how Scale treats different device types differently, so generic would be a fine choice unless development tells you differently. Currently known device types are listed in the comments of the script /usr/lpp/mmfs/bin/mmdevdiscover Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 16/11/2018 10:19 Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Fri Nov 16 12:31:57 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 16 Nov 2018 07:31:57 -0500 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? In-Reply-To: References: <20181116030646.GA28141@slac.stanford.edu> Message-ID: Note, RHEL 7.6 is not yet a supported platform for Spectrum Scale so you may want to use RHEL 7.5 or wait for RHEL 7.6 to be supported. Using "generic" for the device type should be the proper option here. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Uwe Falke" To: gpfsug main discussion list Date: 11/16/2018 04:35 AM Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having mentioned nsddevices, I do not know how Scale treats different device types differently, so generic would be a fine choice unless development tells you differently. Currently known device types are listed in the comments of the script /usr/lpp/mmfs/bin/mmdevdiscover Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Uwe Falke/Germany/IBM To: gpfsug main discussion list Date: 16/11/2018 10:19 Subject: Re: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Hi Lance, you might need to use /var/mmfs/etc/nsddevices to tell GPFS about these devices (template in /usr/lpp/mmfs/samples/nsddevices.sample) Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Thomas Wolter, Sven Schoo? Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Lance Nakata To: gpfsug-discuss at spectrumscale.org Cc: "Jon L. Bergman" Date: 16/11/2018 04:14 Subject: [gpfsug-discuss] How to use RHEL 7 mdadm NVMe devices with Spectrum Scale 4.2.3.10? Sent by: gpfsug-discuss-bounces at spectrumscale.org We have a Dell R740xd with 24 x 1TB NVMe SSDs in the internal slots. Since PERC RAID cards don't see these devices, we are using mdadm software RAID to build NSDs. We took 12 NVMe SSDs and used mdadm to create a 10 + 1 + 1 hot spare RAID 5 stripe named /dev/md101. We took the other 12 NVMe SSDs and created a similar /dev/md102. mmcrnsd worked without errors. The problem is that Spectrum Scale does not see the /dev/md10x devices as proper NSDs; the Device and Devtype columns are blank: host2:~> sudo mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- nsd0001 864FD12858A36E79 /dev/sdb generic host1.slac.stanford.edu server node nsd0002 864FD12858A36E7A /dev/sdc generic host1.slac.stanford.edu server node nsd0021 864FD1285956B0A7 /dev/sdd generic host1.slac.stanford.edu server node nsd0251a 864FD1545BD0CCDF /dev/dm-9 dmm host2.slac.stanford.edu server node nsd0251b 864FD1545BD0CCE0 /dev/dm-11 dmm host2.slac.stanford.edu server node nsd0252a 864FD1545BD0CCE1 /dev/dm-10 dmm host2.slac.stanford.edu server node nsd0252b 864FD1545BD0CCE2 /dev/dm-8 dmm host2.slac.stanford.edu server node nsd02nvme1 864FD1545BEC5D72 - - host2.slac.stanford.edu (not found) server node nsd02nvme2 864FD1545BEC5D73 - - host2.slac.stanford.edu (not found) server node I know we can access the internal NVMe devices by their individual /dev/nvmeXX paths, but non-ESS-based Spectrum Scale does not have built-in RAID functionality. Hence, the only option in that scenario is replication, which is expensive and won't give us enough usable space. Software Environment: RHEL 7.6 with kernel 3.10.0-862.14.4.el7.x86_64 Spectrum Scale 4.2.3.10 Spectrum Scale Support has implied we can't use mdadm for NVMe devices. Is that really true? Does anyone use an mdadm-based NVMe config? If so, did you have to do some kind of customization to get it working? Thank you, Lance Nakata SLAC National Accelerator Laboratory _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From novosirj at rutgers.edu Thu Nov 15 17:17:15 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 15 Nov 2018 17:17:15 +0000 Subject: [gpfsug-discuss] [External] Re: GSS Software Release? (Ryan Novosielski) In-Reply-To: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> References: <91CD73DD10338C4E833129D08C24BB04ACF56861@usmailmbx04> Message-ID: <28988E74-6BAC-47FB-AEE2-015D2B784A40@rutgers.edu> Thanks, all. I was looking around FlexNet this week and didn?t see it, but it?s good to know it exists/likely will appear soon. -- ____ || \\UTGERS, |---------------------------*O*--------------------------- ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 (2x0922) ~*~ RBHS Campus || \\ of NJ | Office of Advanced Research Computing - MSB C630, Newark `' > On Nov 15, 2018, at 10:56 AM, Andre Posthuma wrote: > > Hello, > > GSS 3.3b was released last week, with a number of Spectrum Scale versions available : > 5.0.1.2 > 4.2.3.11 > 4.1.1.20 > > DSS-G 2.2a was released yesterday, with 2 Spectrum Scale versions available : > > 5.0.2.1 > 4.2.3.11 > > Best Regards > > > Andre Posthuma > IT Specialist > HPC Services > Lenovo United Kingdom > +44 7841782363 > aposthuma at lenovo.com > > > Lenovo.com > Twitter | Facebook | Instagram | Blogs | Forums > > > > > > > > -----Original Message----- > From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Carl Zetie > Sent: Thursday, November 15, 2018 2:01 PM > To: gpfsug-discuss at spectrumscale.org > Subject: [External] Re: [gpfsug-discuss] GSS Software Release? (Ryan Novosielski) > > >> any idea when a newer GSS software release than 3.3a will be released? > > That is definitely a question only our friends at Lenovo can answer. If you don't get a response here (I'm not sure if any Lenovites are active on the list), you'll need to address it directly to Lenovo, e.g. your account team. > > > Carl Zetie > Program Director > Offering Management for Spectrum Scale, IBM > ---- > (540) 882 9353 ][ Research Triangle Park > carlz at us.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From novosirj at rutgers.edu Thu Nov 15 18:33:12 2018 From: novosirj at rutgers.edu (Ryan Novosielski) Date: Thu, 15 Nov 2018 18:33:12 +0000 Subject: [gpfsug-discuss] GSS Software Release? In-Reply-To: References: <41a5c465-b025-f0b1-1f25-65678f8665b0@rutgers.edu>, Message-ID: Thanks, Andy. I just realized our entitlement lapsed on GSS and that?s probably why I don?t see it there at the moment. Helpful to know what?s in there though for planning while that is worked out. Sent from my iPhone On Nov 15, 2018, at 13:29, Andy Kurth > wrote: Public information on GSS updates seems nonexistent. You can find some clues if you have access to Lenovo's restricted download site. It looks like gss3.3b was released in late September. There are gss3.3b download options that include either 4.2.3-9 or 4.1.1-20. Earlier this month they also released some GPFS-only updates for 4.3.2-11 and 5.0.1-2. It looks like these are meant to be applied on top of gss3.3b. For DSS-G, it looks like dss-g-2.2a is the latest full release with options that include 4.2.3-11 or 5.0.2-1. There are also separate DSS-G GPFS-only updates for 4.2.3-11 and 5.0.1-2. Regards, Andy Kurth / NCSU On Thu, Nov 15, 2018 at 12:01 AM Ryan Novosielski > wrote: -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I know this might not be the perfect venue, but I know IBM developers participate and will occasionally share this sort of thing: any idea when a newer GSS software release than 3.3a will be released? We are attempting to plan our maintenance schedule. At the moment, the DSS-G software seems to be getting updated and we'd prefer to remain at the same GPFS release on DSS-G and GSS. - -- ____ || \\UTGERS, |----------------------*O*------------------------ ||_// the State | Ryan Novosielski - novosirj at rutgers.edu || \\ University | Sr. Technologist - 973/972.0922 ~*~ RBHS Campus || \\ of NJ | Office of Advanced Res. Comp. - MSB C630, Newark `' -----BEGIN PGP SIGNATURE----- iEYEARECAAYFAlvsPxcACgkQmb+gadEcsb46oQCgoeSgzV6HaJ6NzNSgFZQSzMDl qvcAn2ql2U8peuGuhptTIejVgnDFSWEf =7Iue -----END PGP SIGNATURE----- _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Andy Kurth Research Storage Specialist NC State University Office of Information Technology P: 919-513-4090 311A Hillsborough Building Campus Box 7109 Raleigh, NC 27695 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Tue Nov 20 15:01:36 2018 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Tue, 20 Nov 2018 15:01:36 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS Message-ID: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> On one of our clusters, from time to time if users try to access files or folders via the direct full path over NFS, the NFS-client gets invalid information from the server. For instance, if I run "ls /gpfs/filesystem/test/test2/test3" over NFS-mount, result is just full of ???????? If I recurse through the path once, for instance by ls'ing or cd'ing through the folders one at a time or running ls -R, I can then access directly via the full path afterwards. This seem to be intermittent, and I haven't found how to reliably recreate the issue. Possibly, it can be connected to creating or changing files or folders via a GPFS mount, and then accessing them through NFS, but it doesn't happen consistently. Is this a known behaviour or bug, and does anyone know how to fix the issue? These NSD-servers currently run Scale 4.2.2.3, while the CES is on 5.0.1.1. GPFS clients run Scale 5.0.1.1, and NFS clients run CentOS 7.5. Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4575 bytes Desc: not available URL: From valdis.kletnieks at vt.edu Tue Nov 20 15:25:16 2018 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Tue, 20 Nov 2018 10:25:16 -0500 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: <17828.1542727516@turing-police.cc.vt.edu> On Tue, 20 Nov 2018 15:01:36 +0000, Andreas Mattsson said: > On one of our clusters, from time to time if users try to access files or > folders via the direct full path over NFS, the NFS-client gets invalid > information from the server. > > For instance, if I run "ls /gpfs/filesystem/test/test2/test3" over > NFS-mount, result is just full of ???????? I've seen the Ganesha server do this sort of thing once in a while. Never tracked it down, because it was always in the middle of bigger misbehaviors... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Tue Nov 20 23:39:37 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Tue, 20 Nov 2018 23:39:37 +0000 Subject: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a still happy day seeing we're still above ground! ? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 mmfsd 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate. Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu -------------- next part -------------- An HTML attachment was scrubbed... URL: From jjdoherty at yahoo.com Wed Nov 21 13:01:54 2018 From: jjdoherty at yahoo.com (Jim Doherty) Date: Wed, 21 Nov 2018 13:01:54 +0000 (UTC) Subject: [gpfsug-discuss] mmfsd recording High CPU usage In-Reply-To: References: Message-ID: <1913697205.666954.1542805314669@mail.yahoo.com> At a guess with no data ....?? if the application is opening more files than can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to support the open files,? but it will also scan to try and free any unused objects.??? If you can identify the user job that is causing this? you could monitor a system more closely. Jim On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi wrote: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a?still happy day seeing we're still above ground!? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs ? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND 231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd ?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate.?Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted Science & TechnologyNORTH DAKOTA STATE UNIVERSITY? Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Nov 21 15:32:55 2018 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 21 Nov 2018 07:32:55 -0800 Subject: [gpfsug-discuss] mmfsd recording High CPU usage In-Reply-To: References: Message-ID: Hi, the best way to debug something like that is to start with top. start top then press 1 and check if any of the cores has almost 0% idle while others have plenty of CPU left. if that is the case you have one very hot thread. to further isolate it you can press 1 again to collapse the cores, now press shirt-h which will break down each thread of a process and show them as an individual line. now you either see one or many mmfsd's causing cpu consumption, if its many your workload is just doing a lot of work, what is more concerning is if you have just 1 thread running at the 90%+ . if thats the case write down the PID of the thread that runs so hot and run mmfsadm dump threads,kthreads >dum. you will see many entries in the file like : MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm 0x7FC9700008C0 highStackP 0x7FC783F7E530 pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21 ThPoolCommands per-thread gbls: 0:0x0 1:0x0 2:0x0 3:0x3 4:0xFFFFFFFFFFFFFFFF 5:0x0 6:0x0 7:0x7FC98C0067B0 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x400000E 14:0x7FC98C004C10 15:0x0 16:0x4 17:0x0 18:0x0 find the pid behind 'thread id' and post that section, that would be the first indication on what that thread does ... sven On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi < oluwasijibomi.saula at ndsu.edu> wrote: > Hello Scalers, > > > First, let me say Happy Thanksgiving to those of us in the US and to those > beyond, well, it's a still happy day seeing we're still above ground! ? > > > Now, what I have to discuss isn't anything extreme so don't skip the > turkey for this, but lately, on a few of our compute GPFS client nodes, > we've been noticing high CPU usage by the mmfsd process and are wondering > why. Here's a sample: > > > [~]# top -b -n 1 | grep mmfs > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 > *mmfs*d > > 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 run > *mmfs* > > Obviously, this behavior was likely triggered by a not-so-convenient user > job that in most cases is long finished by the time we > investigate. Nevertheless, does anyone have an idea why this might be > happening? Any thoughts on preventive steps even? > > > This is GPFS v4.2.3 on Redhat 7.4, btw... > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > > Research 2 > Building > ? Room 220B > Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu | www.ndsu.edu > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From bzhang at ca.ibm.com Wed Nov 21 18:52:12 2018 From: bzhang at ca.ibm.com (Bohai Zhang) Date: Wed, 21 Nov 2018 13:52:12 -0500 Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) In-Reply-To: References: <970D2C02-7381-4938-9BAF-EBEEFC5DE1A0@gmail.com>, <45118669-0B80-4F6D-A6C5-5A1B702D3C34@filmlance.se> Message-ID: Hi all, This is a reminder for our next week's technical webinar. Everyone is welcome to register and attend. Thanks, Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. From: "Bohai Zhang" To: gpfsug main discussion list Date: 2018/11/09 11:37 AM Subject: [gpfsug-discuss] IBM Spectrum Scale Webinars: Debugging Network Related Issues (Nov 27/29) Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We are going to host our next technical webinar. everyone is welcome to register and attend. Information: IBM Spectrum Scale Webinars: Debugging Network Related Issues? ?? ?IBM Spectrum Scale Webinars are hosted by IBM Spectrum Scale Support to share expertise and knowledge of the Spectrum Scale product, as well as product updates and best practices.? ?? ?This webinar will focus on debugging the network component of two general classes of Spectrum Scale issues. Please see the agenda below for more details. Note that our webinars are free of charge and will be held online via WebEx.? ?? ?Agenda? 1) Expels and related network flows? 2) Recent enhancements to Spectrum Scale code? 3) Performance issues related to Spectrum Scale's use of the network? ?4) FAQs? ?? ?NA/EU Session? Date: Nov 27, 2018? Time: 10 AM - 11AM EST (3PM GMT)? Register at: https://ibm.biz/BdYJY6? Audience: Spectrum Scale administrators? ?? ?AP/JP Session? Date: Nov 29, 2018? Time: 10AM - 11AM Beijing Time (11AM Tokyo Time)? Register at: https://ibm.biz/BdYJYU? Audience: Spectrum Scale administrators? Regards, IBM Spectrum Computing Bohai Zhang Critical Senior Technical Leader, IBM Systems Situation Tel: 1-905-316-2727 Resolver Mobile: 1-416-897-7488 Expert Badge Email: bzhang at ca.ibm.com 3600 STEELES AVE EAST, MARKHAM, ON, L3R 9Z7, Canada Live Chat at IBMStorageSuptMobile Apps Support Portal | Fix Central | Knowledge Center | Request for Enhancement | Product SMC IBM | dWA We meet our service commitment only when you are very satisfied and EXTREMELY LIKELY to recommend IBM. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16927775.gif Type: image/gif Size: 2665 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16361907.gif Type: image/gif Size: 275 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16531853.gif Type: image/gif Size: 305 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16209659.gif Type: image/gif Size: 331 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16604524.gif Type: image/gif Size: 3621 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 16509495.gif Type: image/gif Size: 1243 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From oluwasijibomi.saula at ndsu.edu Wed Nov 21 20:55:29 2018 From: oluwasijibomi.saula at ndsu.edu (Saula, Oluwasijibomi) Date: Wed, 21 Nov 2018 20:55:29 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 82, Issue 31 In-Reply-To: References: Message-ID: Sven/Jim, Thanks for sharing your thoughts! - Currently, we have mFTC set as such: maxFilesToCache 4000 However, since we have a very diverse workload, we'd have to cycle through a vast majority of our apps to find the most fitting mFTC value as this page (https://www.ibm.com/support/knowledgecenter/en/linuxonibm/liaag/wecm/l0wecm00_maxfilestocache.htm) suggests. In the meantime, I was able to gather some more info for the lone mmfsd thread (pid: 34096) running at high CPU utilization, and right away I can see the number of nonvoluntary_ctxt_switches is quite high, compared to the other threads in the dump; however, I think I need some help interpreting all of this. Although, I should add that heavy HPC workloads (i.e. vasp, ansys...) are running on these nodes and may be somewhat related to this issue: Scheduling info for kernel thread 34096 mmfsd (34096, #threads: 309) ------------------------------------------------------------------- se.exec_start : 8057632237.613486 se.vruntime : 4914854123.640008 se.sum_exec_runtime : 1042598557.420591 se.nr_migrations : 8337485 nr_switches : 15824325 nr_voluntary_switches : 4110 nr_involuntary_switches : 15820215 se.load.weight : 88761 policy : 0 prio : 100 clock-delta : 24 mm->numa_scan_seq : 88980 numa_migrations, 5216521 numa_faults_memory, 0, 0, 1, 1, 1 numa_faults_memory, 1, 0, 0, 1, 1030 numa_faults_memory, 0, 1, 0, 0, 1 numa_faults_memory, 1, 1, 0, 0, 1 Status for kernel thread 34096 Name: mmfsd Umask: 0022 State: R (running) Tgid: 58921 Ngid: 34395 Pid: 34096 PPid: 3941 TracerPid: 0 Uid: 0 0 0 0 Gid: 0 0 0 0 FDSize: 256 Groups: VmPeak: 15137612 kB VmSize: 15126340 kB VmLck: 4194304 kB VmPin: 8388712 kB VmHWM: 4424228 kB VmRSS: 4420420 kB RssAnon: 4350128 kB RssFile: 50512 kB RssShmem: 19780 kB VmData: 14843812 kB VmStk: 132 kB VmExe: 23672 kB VmLib: 121856 kB VmPTE: 9652 kB VmSwap: 0 kB Threads: 309 SigQ: 5/257225 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000010017a07 SigIgn: 0000000000000000 SigCgt: 0000000180015eef CapInh: 0000000000000000 CapPrm: 0000001fffffffff CapEff: 0000001fffffffff CapBnd: 0000001fffffffff CapAmb: 0000000000000000 Seccomp: 0 Cpus_allowed: ffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff Cpus_allowed_list: 0-239 Mems_allowed: 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003 Mems_allowed_list: 0-1 voluntary_ctxt_switches: 4110 nonvoluntary_ctxt_switches: 15820215 Thanks, Siji Saula HPC System Administrator Center for Computationally Assisted Science & Technology NORTH DAKOTA STATE UNIVERSITY Research 2 Building ? Room 220B Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 p:701.231.7749 www.ccast.ndsu.edu | www.ndsu.edu ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Wednesday, November 21, 2018 9:33:10 AM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 82, Issue 31 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: mmfsd recording High CPU usage (Jim Doherty) 2. Re: mmfsd recording High CPU usage (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Wed, 21 Nov 2018 13:01:54 +0000 (UTC) From: Jim Doherty To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: <1913697205.666954.1542805314669 at mail.yahoo.com> Content-Type: text/plain; charset="utf-8" At a guess with no data ....?? if the application is opening more files than can fit in the maxFilesToCache (MFTC) objects? GPFS will expand the MFTC to support the open files,? but it will also scan to try and free any unused objects.??? If you can identify the user job that is causing this? you could monitor a system more closely. Jim On Wednesday, November 21, 2018, 2:10:45 AM EST, Saula, Oluwasijibomi wrote: Hello Scalers, First, let me say Happy Thanksgiving to those of us in the US and to those beyond, well, it's a?still happy day seeing we're still above ground!? Now, what I have to discuss isn't anything extreme so don't skip the turkey for this, but lately, on a few of our compute GPFS client nodes, we've been noticing high CPU usage by the mmfsd process and are wondering why. Here's a sample: [~]# top -b -n 1 | grep mmfs ? ?PID USER? ? ?PR? NI? ?VIRT? ? RES? ?SHR S? %CPU %MEM ? ? TIME+ COMMAND 231898 root ? ? ? 0 -20 14.508g 4.272g? 70168 S?93.8? 6.8?69503:41 mmfsd ?4161 root ? ? ? 0 -20?121876 ? 9412 ? 1492 S ? 0.0?0.0 ? 0:00.22 runmmfs Obviously, this behavior was likely triggered by a not-so-convenient user job that in most cases is long finished by the time we investigate.?Nevertheless, does anyone have an idea why this might be happening? Any thoughts on preventive steps even? This is GPFS v4.2.3 on Redhat 7.4, btw... Thanks,?Siji SaulaHPC System AdministratorCenter for Computationally Assisted Science & TechnologyNORTH DAKOTA STATE UNIVERSITY? Research 2 Building???Room 220BDept 4100, PO Box 6050? / Fargo, ND 58108-6050p:701.231.7749www.ccast.ndsu.edu?|?www.ndsu.edu _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Wed, 21 Nov 2018 07:32:55 -0800 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmfsd recording High CPU usage Message-ID: Content-Type: text/plain; charset="utf-8" Hi, the best way to debug something like that is to start with top. start top then press 1 and check if any of the cores has almost 0% idle while others have plenty of CPU left. if that is the case you have one very hot thread. to further isolate it you can press 1 again to collapse the cores, now press shirt-h which will break down each thread of a process and show them as an individual line. now you either see one or many mmfsd's causing cpu consumption, if its many your workload is just doing a lot of work, what is more concerning is if you have just 1 thread running at the 90%+ . if thats the case write down the PID of the thread that runs so hot and run mmfsadm dump threads,kthreads >dum. you will see many entries in the file like : MMFSADMDumpCmdThread: desc 0x7FC84C002980 handle 0x4C0F02FA parm 0x7FC9700008C0 highStackP 0x7FC783F7E530 pthread 0x83F80700 kernel thread id 49878 (slot -1) pool 21 ThPoolCommands per-thread gbls: 0:0x0 1:0x0 2:0x0 3:0x3 4:0xFFFFFFFFFFFFFFFF 5:0x0 6:0x0 7:0x7FC98C0067B0 8:0x0 9:0x0 10:0x0 11:0x0 12:0x0 13:0x400000E 14:0x7FC98C004C10 15:0x0 16:0x4 17:0x0 18:0x0 find the pid behind 'thread id' and post that section, that would be the first indication on what that thread does ... sven On Tue, Nov 20, 2018 at 11:10 PM Saula, Oluwasijibomi < oluwasijibomi.saula at ndsu.edu> wrote: > Hello Scalers, > > > First, let me say Happy Thanksgiving to those of us in the US and to those > beyond, well, it's a still happy day seeing we're still above ground! ? > > > Now, what I have to discuss isn't anything extreme so don't skip the > turkey for this, but lately, on a few of our compute GPFS client nodes, > we've been noticing high CPU usage by the mmfsd process and are wondering > why. Here's a sample: > > > [~]# top -b -n 1 | grep mmfs > > PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ > COMMAND > > > 231898 root 0 -20 14.508g 4.272g 70168 S 93.8 6.8 69503:41 > *mmfs*d > > 4161 root 0 -20 121876 9412 1492 S 0.0 0.0 0:00.22 run > *mmfs* > > Obviously, this behavior was likely triggered by a not-so-convenient user > job that in most cases is long finished by the time we > investigate. Nevertheless, does anyone have an idea why this might be > happening? Any thoughts on preventive steps even? > > > This is GPFS v4.2.3 on Redhat 7.4, btw... > > > Thanks, > > Siji Saula > HPC System Administrator > Center for Computationally Assisted Science & Technology > *NORTH DAKOTA STATE UNIVERSITY* > > > Research 2 > Building > ? Room 220B > Dept 4100, PO Box 6050 / Fargo, ND 58108-6050 > p:701.231.7749 > www.ccast.ndsu.edu | www.ndsu.edu > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 82, Issue 31 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: From mnaineni at in.ibm.com Thu Nov 22 10:32:27 2018 From: mnaineni at in.ibm.com (Malahal R Naineni) Date: Thu, 22 Nov 2018 10:32:27 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.image001.png at 01D480EA.4FF3B020.png Type: image/png Size: 5610 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Fri Nov 23 08:12:25 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Fri, 23 Nov 2018 08:12:25 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Message-ID: Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= From andreas.mattsson at maxiv.lu.se Fri Nov 23 13:41:37 2018 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 23 Nov 2018 13:41:37 +0000 Subject: [gpfsug-discuss] Filesystem access issues via CES NFS In-Reply-To: References: <717f49aade0b439eb1b99fc620a21cac@maxiv.lu.se> Message-ID: <9456645b0a1f4b488b13874ea672b9b8@maxiv.lu.se> Yes, this is repeating. We?ve ascertained that it has nothing to do at all with file operations on the GPFS side. Randomly throughout the filesystem mounted via NFS, ls or file access will give ? > ls: reading directory /gpfs/filessystem/test/testdir: Invalid argument ? Trying again later might work on that folder, but might fail somewhere else. We have tried exporting the same filesystem via a standard kernel NFS instead of the CES Ganesha-NFS, and then the problem doesn?t exist. So it is definitely related to the Ganesha NFS server, or its interaction with the file system. Will see if I can get a tcpdump of the issue. Regards, Andreas _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se Fr?n: gpfsug-discuss-bounces at spectrumscale.org F?r Malahal R Naineni Skickat: den 22 november 2018 11:32 Till: gpfsug-discuss at spectrumscale.org Kopia: gpfsug-discuss at spectrumscale.org ?mne: Re: [gpfsug-discuss] Filesystem access issues via CES NFS We have seen empty lists (ls showing nothing). If this repeats, please take tcpdump from the client and we will investigate. Regards, Malahal. ----- Original message ----- From: Andreas Mattsson Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "gpfsug-discuss at spectrumscale.org" Cc: Subject: [gpfsug-discuss] Filesystem access issues via CES NFS Date: Tue, Nov 20, 2018 8:47 PM On one of our clusters, from time to time if users try to access files or folders via the direct full path over NFS, the NFS-client gets invalid information from the server. For instance, if I run ?ls /gpfs/filesystem/test/test2/test3? over NFS-mount, result is just full of ???????? If I recurse through the path once, for instance by ls?ing or cd?ing through the folders one at a time or running ls ?R, I can then access directly via the full path afterwards. This seem to be intermittent, and I haven?t found how to reliably recreate the issue. Possibly, it can be connected to creating or changing files or folders via a GPFS mount, and then accessing them through NFS, but it doesn?t happen consistently. Is this a known behaviour or bug, and does anyone know how to fix the issue? These NSD-servers currently run Scale 4.2.2.3, while the CES is on 5.0.1.1. GPFS clients run Scale 5.0.1.1, and NFS clients run CentOS 7.5. Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 224 84 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.lu.se www.maxiv.se _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 4575 bytes Desc: not available URL: From jtolson at us.ibm.com Mon Nov 26 14:31:29 2018 From: jtolson at us.ibm.com (John T Olson) Date: Mon, 26 Nov 2018 07:31:29 -0700 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Mon Nov 26 14:55:06 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Mon, 26 Nov 2018 14:55:06 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: Hallo John, record is open, TS001631590. Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From alvise.dorigo at psi.ch Mon Nov 26 15:43:59 2018 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 26 Nov 2018 15:43:59 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Message-ID: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Mon Nov 26 15:54:57 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Mon, 26 Nov 2018 15:54:57 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Try an dedicated extra ? -p ? foreach Attribute Von meinem iPhone gesendet > Am 26.11.2018 um 16:50 schrieb Dorigo Alvise (PSI) : > > Good evening, > I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm > to setup AFM parallel transfer. > > Why the following command (grabbed directly from the web page above) fires out that error ? > > [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome > mmcrfileset: Incorrect extra argument: ?p > Usage: > mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] > [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] > [--allow-permission-change PermissionChangeMode] > > The mapping was correctly created: > > [root at sf-export-3 ~]# mmafmconfig show > Map name: afmgw1 > Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch > > Is this a known bug ? > > Thanks, > Regards. > > Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Nov 26 16:33:58 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Mon, 26 Nov 2018 16:33:58 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Is that an 'ndash' rather than "-"? Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of alvise.dorigo at psi.ch [alvise.dorigo at psi.ch] Sent: 26 November 2018 15:43 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes[:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise From kenneth.waegeman at ugent.be Mon Nov 26 16:26:51 2018 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Mon, 26 Nov 2018 17:26:51 +0100 Subject: [gpfsug-discuss] mmfsck output Message-ID: Hi all, We had some leftover files with IO errors on a GPFS FS, so we ran a mmfsck. Does someone know what these mmfsck errors mean: Error in inode 38422 snap 0: has nlink field as 1 Error in inode 281057 snap 0: is unreferenced ?Attach inode to lost+found of fileset root filesetId 0? no Thanks! Kenneth From daniel.kidger at uk.ibm.com Mon Nov 26 17:03:14 2018 From: daniel.kidger at uk.ibm.com (Daniel Kidger) Date: Mon, 26 Nov 2018 17:03:14 +0000 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: References: , <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: An HTML attachment was scrubbed... URL: From abhisdav at in.ibm.com Tue Nov 27 06:38:27 2018 From: abhisdav at in.ibm.com (Abhishek Dave) Date: Tue, 27 Nov 2018 12:08:27 +0530 Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping In-Reply-To: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> References: <83A6EEB0EC738F459A39439733AE8045267A6DE1@MBX114.d.ethz.ch> Message-ID: Hi, Looks like some issue with syntax. Please try below one. mmcrfileset ?p afmmode=,afmtarget=://// --inode-space new #mmcrfileset gpfs1 sw1 ?p afmmode=sw,afmtarget=gpfs://mapping1/gpfs/gpfs2/swhome --inode-space new #mmcrfileset gpfs1 ro1 ?p afmmode=ro,afmtarget=gpfs://mapping2/gpfs/gpfs2/swhome --inode-space new Thanks, Abhishek, Dave From: "Dorigo Alvise (PSI)" To: "gpfsug-discuss at spectrumscale.org" Date: 11/26/2018 09:20 PM Subject: [gpfsug-discuss] Error with AFM fileset creation with mapping Sent by: gpfsug-discuss-bounces at spectrumscale.org Good evening, I'm following this guide: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_paralleldatatransfersafm.htm to setup AFM parallel transfer. Why the following command (grabbed directly from the web page above) fires out that error ? [root at sf-export-3 ~]# mmcrfileset RAW test1 --inode-space new ?p afmmode=sw,afmtarget=afmgw1://gpfs/gpfs2/swhome mmcrfileset: Incorrect extra argument: ?p Usage: mmcrfileset Device FilesetName [-p afmAttribute=Value...] [-t Comment] [--inode-space {new [--inode-limit MaxNumInodes [:NumInodesToPreallocate]] | ExistingFileset}] [--allow-permission-change PermissionChangeMode] The mapping was correctly created: [root at sf-export-3 ~]# mmafmconfig show Map name: afmgw1 Export server map: 172.16.1.2/sf-export-2.psi.ch,172.16.1.3/sf-export-3.psi.ch Is this a known bug ? Thanks, Regards. Alvise_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 15:24:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 15:24:25 +0000 Subject: [gpfsug-discuss] Hanging file-systems Message-ID: <06FF0D9C-9ED7-434E-A7FF-C56518048E25@bham.ac.uk> I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it. (I have a ticket with IBM open about this as well) We see for example: Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 and on that node: Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 8 (TokenCondvar), reason 'wait for SubToken to become stable' On this node, if you dump tscomm, you see entries like: Pending messages: msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1 this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) dest status pending , err 0, reply len 0 by TCP connection c0n9 is itself. This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster. Any pointers for next place to look/how to fix? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Tue Nov 27 16:02:44 2018 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Tue, 27 Nov 2018 16:02:44 +0000 Subject: [gpfsug-discuss] Hanging file-systems Message-ID: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> I have seen something like this in the past, and I have resorted to a cluster restart as well. :-( IBM and I could never really track it down, because I could not get a dump at the time of occurrence. However, you might take a look at your NSD servers, one at a time. As I recall, we thought it was a stuck thread on one of the NSD servers, and when we restarted the ?right? one it cleared the block. The other thing I?ve done in the past to isolate problems like this (since this is related to tokens) is to look at the ?token revokes? on each node, looking for ones that are sticking around for a long time. I tossed together a quick script and ran it via mmdsh on all the node. Not pretty, but it got the job done. Run this a few times, see if any of the revokes are sticking around for a long time #!/bin/sh rm -f /tmp/revokelist /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > /tmp/revokelist 2> /dev/null if [ $? -eq 0 ]; then /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out for n in `cat /tmp/revokelist | grep msgHdr | awk '{print $5}'`; do grep $n /tmp/tscomm.out | tail -1 done rm -f /tmp/tscomm.out fi Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Simon Thompson Reply-To: gpfsug main discussion list Date: Tuesday, November 27, 2018 at 9:27 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] [gpfsug-discuss] Hanging file-systems I have a file-system which keeps hanging over the past few weeks. Right now, its offline and taken a bunch of services out with it. (I have a ticket with IBM open about this as well) We see for example: Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 (MsgRecordCondvar), re ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 and on that node: Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 8 (TokenCondvar), reason 'wait for SubToken to become stable' On this node, if you dump tscomm, you see entries like: Pending messages: msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, n_pending 1 this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) dest status pending , err 0, reply len 0 by TCP connection c0n9 is itself. This morning when this happened, the only way to get the FS back online was to shutdown the entire cluster. Any pointers for next place to look/how to fix? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 16:14:20 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 08:14:20 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> Message-ID: if this happens you should check a couple of things : 1. are you under memory pressure or even worse started swapping . 2. is there any core running at ~ 0% idle - run top , press 1 and check the idle column. 3. is there any single thread running at ~100% - run top , press shift - h and check what the CPU % shows for the top 5 processes. if you want to go the extra mile, you could run perf top -p $PID_OF_MMFSD and check what the top cpu consumers are. confirming and providing data to any of the above being true could be the missing piece why nobody was able to find it, as this is stuff unfortunate nobody ever looks at. even a trace won't help if any of the above is true as all you see is that the system behaves correct according to the trace, its doesn't appear busy, Sven On Tue, Nov 27, 2018 at 8:03 AM Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: > I have seen something like this in the past, and I have resorted to a > cluster restart as well. :-( IBM and I could never really track it down, > because I could not get a dump at the time of occurrence. However, you > might take a look at your NSD servers, one at a time. As I recall, we > thought it was a stuck thread on one of the NSD servers, and when we > restarted the ?right? one it cleared the block. > > > > The other thing I?ve done in the past to isolate problems like this (since > this is related to tokens) is to look at the ?token revokes? on each node, > looking for ones that are sticking around for a long time. I tossed > together a quick script and ran it via mmdsh on all the node. Not pretty, > but it got the job done. Run this a few times, see if any of the revokes > are sticking around for a long time > > > > #!/bin/sh > > rm -f /tmp/revokelist > > /usr/lpp/mmfs/bin/mmfsadm dump tokenmgr | grep -A 2 'revokeReq list' > > /tmp/revokelist 2> /dev/null > > if [ $? -eq 0 ]; then > > /usr/lpp/mmfs/bin/mmfsadm dump tscomm > /tmp/tscomm.out > > for n in `cat /tmp/revokelist | grep msgHdr | awk '{print $5}'`; do > > grep $n /tmp/tscomm.out | tail -1 > > done > > rm -f /tmp/tscomm.out > > fi > > > > > > Bob Oesterlin > > Sr Principal Storage Engineer, Nuance > > > > > > > > *From: * on behalf of Simon > Thompson > *Reply-To: *gpfsug main discussion list > *Date: *Tuesday, November 27, 2018 at 9:27 AM > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[EXTERNAL] [gpfsug-discuss] Hanging file-systems > > > > I have a file-system which keeps hanging over the past few weeks. Right > now, its offline and taken a bunch of services out with it. > > > > (I have a ticket with IBM open about this as well) > > > > We see for example: > > Waiting 305.0391 sec since 15:17:02, monitored, thread 24885 > SharedHashTabFetchHandlerThread: on ThCond 0x7FE30000B408 > (MsgRecordCondvar), re > > ason 'RPC wait' for tmMsgTellAcquire1 on node 10.10.12.42 > > > > and on that node: > > Waiting 292.4581 sec since 15:17:22, monitored, thread 20368 > SharedHashTabFetchHandlerThread: on ThCond 0x7F3C2929719 > > 8 (TokenCondvar), reason 'wait for SubToken to become stable' > > > > On this node, if you dump tscomm, you see entries like: > > Pending messages: > > msg_id 376617, service 13.1, msg_type 20 'tmMsgTellAcquire1', n_dest 1, > n_pending 1 > > this 0x7F3CD800B930, n_xhold 1, cl 0, cbFn 0x0, age 303 sec > > sent by 'SharedHashTabFetchHandlerThread' (0x7F3DD800A6C0) > > dest status pending , err 0, reply len 0 by TCP > connection > > > > c0n9 is itself. > > > > This morning when this happened, the only way to get the FS back online > was to shutdown the entire cluster. > > > > Any pointers for next place to look/how to fix? > > > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 17:53:58 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 17:53:58 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> Message-ID: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Nov 27 17:54:03 2018 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 27 Nov 2018 23:24:03 +0530 Subject: [gpfsug-discuss] mmfsck output In-Reply-To: References: Message-ID: This means that the files having the below inode numbers 38422 and 281057 are orphan files (i.e. files not referenced by any directory/folder) and they will be moved to the lost+found folder of the fileset owning these files by mmfsck repair. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Kenneth Waegeman To: gpfsug main discussion list Date: 11/26/2018 10:10 PM Subject: [gpfsug-discuss] mmfsck output Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi all, We had some leftover files with IO errors on a GPFS FS, so we ran a mmfsck. Does someone know what these mmfsck errors mean: Error in inode 38422 snap 0: has nlink field as 1 Error in inode 281057 snap 0: is unreferenced Attach inode to lost+found of fileset root filesetId 0? no Thanks! Kenneth _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwIGaQ&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=-J2C2ZYYUsp42fIyYHg3aYSR8wC5SKNhl6ZztfRJMvI&s=4OPQpDp8v56fvska0-O-pskIfONFMnZFydDo0T6KwJM&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 18:19:04 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 10:19:04 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson wrote: > Thanks Sven ? > > > > We found a node with kswapd running 100% (and swap was off)? > > > > Killing that node made access to the FS spring into life. > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 16:14 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From skylar2 at uw.edu Tue Nov 27 18:06:57 2018 From: skylar2 at uw.edu (Skylar Thompson) Date: Tue, 27 Nov 2018 18:06:57 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine From Dwayne.Hart at med.mun.ca Tue Nov 27 19:25:08 2018 From: Dwayne.Hart at med.mun.ca (Dwayne.Hart at med.mun.ca) Date: Tue, 27 Nov 2018 19:25:08 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> , <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: Hi Simon, Was there a reason behind swap being disabled? Best, Dwayne ? Dwayne Hart | Systems Administrator IV CHIA, Faculty of Medicine Memorial University of Newfoundland 300 Prince Philip Drive St. John?s, Newfoundland | A1B 3V6 Craig L Dobbin Building | 4M409 T 709 864 6631 On Nov 27, 2018, at 2:24 PM, Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From TOMP at il.ibm.com Tue Nov 27 19:35:36 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Tue, 27 Nov 2018 21:35:36 +0200 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com><4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Message-ID: "paging to disk" sometimes means mmap as well - there were several issues around that recently as well. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Skylar Thompson To: gpfsug-discuss at spectrumscale.org Date: 27/11/2018 20:28 Subject: Re: [gpfsug-discuss] Hanging file-systems Sent by: gpfsug-discuss-bounces at spectrumscale.org Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 20:02:14 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 20:02:14 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <20181127180657.ihwuv6jismgghrvc@utumno.gs.washington.edu> Message-ID: Yes, but we?d upgraded all out HPC client nodes to 5.0.2-1 last week as well when this first happened ? Unless it?s necessary to upgrade the NSD servers as well for this? Simon From: on behalf of "TOMP at il.ibm.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 19:48 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems "paging to disk" sometimes means mmap as well - there were several issues around that recently as well. Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Skylar Thompson To: gpfsug-discuss at spectrumscale.org Date: 27/11/2018 20:28 Subject: Re: [gpfsug-discuss] Hanging file-systems Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Despite its name, kswapd isn't directly involved in paging to disk; it's the kernel process that's involved in finding committed memory that can be reclaimed for use (either immediately, or possibly by flushing dirty pages to disk). If kswapd is using a lot of CPU, it's a sign that the kernel is spending a lot of time to find free pages to allocate to processes. On Tue, Nov 27, 2018 at 05:53:58PM +0000, Simon Thompson wrote: > Thanks Sven ??? > > We found a node with kswapd running 100% (and swap was off)??? > > Killing that node made access to the FS spring into life. > > Simon > > From: on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems > > 1. are you under memory pressure or even worse started swapping . > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -- -- Skylar Thompson (skylar2 at u.washington.edu) -- Genome Sciences Department, System Administrator -- Foege Building S046, (206)-685-7354 -- University of Washington School of Medicine _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=mLPyKeOa1gNDrORvEXBgMw&m=sgaWNOJnHka2HBtMNNXBur4p2KbQ8q786tWza40tcLQ&s=CWkCUHu4-uwZQ6r1x_VFAGqQ5FFSBGXMSVa5t2pk424&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 20:09:25 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 20:09:25 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> Message-ID: <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 I think). So is this memory pressure on the NSD nodes then? I thought it was documented somewhere that GFPS won?t use more than 50% of the host memory. And actually if you look at the values for maxStatCache and maxFilesToCache, the memory footprint is quite small. Sure on these NSD servers we had a pretty big pagepool (which we?ve dropped by some), but there still should have been quite a lot of memory space on the nodes ? If only someone as going to do a talk in December at the CIUK SSUG on memory usage ? Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 18:19 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 20:43:04 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 12:43:04 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: was the node you rebooted a client or a server that was running kswapd at 100% ? sven On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson wrote: > The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 > I think). > > > > So is this memory pressure on the NSD nodes then? I thought it was > documented somewhere that GFPS won?t use more than 50% of the host memory. > > > > And actually if you look at the values for maxStatCache and > maxFilesToCache, the memory footprint is quite small. > > > > Sure on these NSD servers we had a pretty big pagepool (which we?ve > dropped by some), but there still should have been quite a lot of memory > space on the nodes ? > > > > If only someone as going to do a talk in December at the CIUK SSUG on > memory usage ? > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 18:19 > > > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > Hi, > > > > now i need to swap back in a lot of information about GPFS i tried to swap > out :-) > > > > i bet kswapd is not doing anything you think the name suggest here, which > is handling swap space. i claim the kswapd thread is trying to throw > dentries out of the cache and what it tries to actually get rid of are > entries of directories very high up in the tree which GPFS still has a > refcount on so it can't free it. when it does this there is a single thread > (unfortunate was never implemented with multiple threads) walking down the > tree to find some entries to steal, it it can't find any it goes to the > next , next , etc and on a bus system it can take forever to free anything > up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which > i pushed for the weeks before i left IBM. you never see this in a trace > with default traces which is why nobody would have ever suspected this, you > need to set special trace levels to even see this. > > i don't know the exact version the changes went into, but somewhere in the > 5.0.1.X timeframe. the change was separating the cache list to prefer > stealing files before directories, also keep a minimum percentages of > directories in the cache (10 % by default) before it would ever try to get > rid of a directory. it also tries to keep a list of free entries all the > time (means pro active cleaning them) and also allows to go over the hard > limit compared to just block as in previous versions. so i assume you run a > version prior to 5.0.1.x and what you see is kspwapd desperately get rid of > entries, but can't find one its already at the limit so it blocks and > doesn't allow a new entry to be created or promoted from the statcache . > > > > again all this is without source code access and speculation on my part > based on experience :-) > > > > what version are you running and also share mmdiag --stats of that node > > > > sven > > > > > > > > > > > > > > On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: > > Thanks Sven ? > > > > We found a node with kswapd running 100% (and swap was off)? > > > > Killing that node made access to the FS spring into life. > > > > Simon > > > > *From: * on behalf of " > oehmes at gmail.com" > *Reply-To: *"gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > *Date: *Tuesday, 27 November 2018 at 16:14 > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *Re: [gpfsug-discuss] Hanging file-systems > > > > 1. are you under memory pressure or even worse started swapping . > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Tue Nov 27 20:44:26 2018 From: oehmes at gmail.com (Sven Oehme) Date: Tue, 27 Nov 2018 12:44:26 -0800 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: and i already talk about NUMA stuff at the CIUK usergroup meeting, i won't volunteer for a 2nd advanced topic :-D On Tue, Nov 27, 2018 at 12:43 PM Sven Oehme wrote: > was the node you rebooted a client or a server that was running kswapd at > 100% ? > > sven > > > On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson > wrote: > >> The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 >> I think). >> >> >> >> So is this memory pressure on the NSD nodes then? I thought it was >> documented somewhere that GFPS won?t use more than 50% of the host memory. >> >> >> >> And actually if you look at the values for maxStatCache and >> maxFilesToCache, the memory footprint is quite small. >> >> >> >> Sure on these NSD servers we had a pretty big pagepool (which we?ve >> dropped by some), but there still should have been quite a lot of memory >> space on the nodes ? >> >> >> >> If only someone as going to do a talk in December at the CIUK SSUG on >> memory usage ? >> >> >> >> Simon >> >> >> >> *From: * on behalf of " >> oehmes at gmail.com" >> *Reply-To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Date: *Tuesday, 27 November 2018 at 18:19 >> >> >> *To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Subject: *Re: [gpfsug-discuss] Hanging file-systems >> >> >> >> Hi, >> >> >> >> now i need to swap back in a lot of information about GPFS i tried to >> swap out :-) >> >> >> >> i bet kswapd is not doing anything you think the name suggest here, which >> is handling swap space. i claim the kswapd thread is trying to throw >> dentries out of the cache and what it tries to actually get rid of are >> entries of directories very high up in the tree which GPFS still has a >> refcount on so it can't free it. when it does this there is a single thread >> (unfortunate was never implemented with multiple threads) walking down the >> tree to find some entries to steal, it it can't find any it goes to the >> next , next , etc and on a bus system it can take forever to free anything >> up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which >> i pushed for the weeks before i left IBM. you never see this in a trace >> with default traces which is why nobody would have ever suspected this, you >> need to set special trace levels to even see this. >> >> i don't know the exact version the changes went into, but somewhere in >> the 5.0.1.X timeframe. the change was separating the cache list to prefer >> stealing files before directories, also keep a minimum percentages of >> directories in the cache (10 % by default) before it would ever try to get >> rid of a directory. it also tries to keep a list of free entries all the >> time (means pro active cleaning them) and also allows to go over the hard >> limit compared to just block as in previous versions. so i assume you run a >> version prior to 5.0.1.x and what you see is kspwapd desperately get rid of >> entries, but can't find one its already at the limit so it blocks and >> doesn't allow a new entry to be created or promoted from the statcache . >> >> >> >> again all this is without source code access and speculation on my part >> based on experience :-) >> >> >> >> what version are you running and also share mmdiag --stats of that node >> >> >> >> sven >> >> >> >> >> >> >> >> >> >> >> >> >> >> On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson >> wrote: >> >> Thanks Sven ? >> >> >> >> We found a node with kswapd running 100% (and swap was off)? >> >> >> >> Killing that node made access to the FS spring into life. >> >> >> >> Simon >> >> >> >> *From: * on behalf of " >> oehmes at gmail.com" >> *Reply-To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Date: *Tuesday, 27 November 2018 at 16:14 >> *To: *"gpfsug-discuss at spectrumscale.org" < >> gpfsug-discuss at spectrumscale.org> >> *Subject: *Re: [gpfsug-discuss] Hanging file-systems >> >> >> >> 1. are you under memory pressure or even worse started swapping . >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From constance.rice at us.ibm.com Tue Nov 27 20:28:14 2018 From: constance.rice at us.ibm.com (Constance M Rice) Date: Tue, 27 Nov 2018 20:28:14 +0000 Subject: [gpfsug-discuss] Introduction Message-ID: Hello, I am a new member here. I work for IBM in the Washington System Center supporting Spectrum Scale and ESS across North America. I live in Leesburg, Virginia, USA northwest of Washington, DC. Connie Rice Storage Specialist Washington Systems Center Mobile: 202-821-6747 E-mail: constance.rice at us.ibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 56935 bytes Desc: not available URL: From S.J.Thompson at bham.ac.uk Tue Nov 27 21:01:07 2018 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Tue, 27 Nov 2018 21:01:07 +0000 Subject: [gpfsug-discuss] Hanging file-systems In-Reply-To: References: <2CDEFE00-54F1-4037-A4EC-561C65D70566@nuance.com> <4A4E68C3-D654-48CF-A2F7-C60F50CF4644@bham.ac.uk> <80A9C5FE-F5DD-4740-A83E-5730ADE8CA81@bham.ac.uk> Message-ID: <66C52F6F-5193-4DD7-B87E-C88E9ADBB53D@bham.ac.uk> It was an NSD server ? we?d already shutdown all the clients in the remote clusters! And Tomer has already agreed to do a talk on memory ? (but I?m still looking for a user talk if anyone is interested!) Simon From: on behalf of "oehmes at gmail.com" Reply-To: "gpfsug-discuss at spectrumscale.org" Date: Tuesday, 27 November 2018 at 20:44 To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Hanging file-systems and i already talk about NUMA stuff at the CIUK usergroup meeting, i won't volunteer for a 2nd advanced topic :-D On Tue, Nov 27, 2018 at 12:43 PM Sven Oehme > wrote: was the node you rebooted a client or a server that was running kswapd at 100% ? sven On Tue, Nov 27, 2018 at 12:09 PM Simon Thompson > wrote: The nsd nodes were running 5.0.1-2 (though we just now rolling to 5.0.2-1 I think). So is this memory pressure on the NSD nodes then? I thought it was documented somewhere that GFPS won?t use more than 50% of the host memory. And actually if you look at the values for maxStatCache and maxFilesToCache, the memory footprint is quite small. Sure on these NSD servers we had a pretty big pagepool (which we?ve dropped by some), but there still should have been quite a lot of memory space on the nodes ? If only someone as going to do a talk in December at the CIUK SSUG on memory usage ? Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 18:19 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems Hi, now i need to swap back in a lot of information about GPFS i tried to swap out :-) i bet kswapd is not doing anything you think the name suggest here, which is handling swap space. i claim the kswapd thread is trying to throw dentries out of the cache and what it tries to actually get rid of are entries of directories very high up in the tree which GPFS still has a refcount on so it can't free it. when it does this there is a single thread (unfortunate was never implemented with multiple threads) walking down the tree to find some entries to steal, it it can't find any it goes to the next , next , etc and on a bus system it can take forever to free anything up. there have been multiple fixes in this area in 5.0.1.x and 5.0.2 which i pushed for the weeks before i left IBM. you never see this in a trace with default traces which is why nobody would have ever suspected this, you need to set special trace levels to even see this. i don't know the exact version the changes went into, but somewhere in the 5.0.1.X timeframe. the change was separating the cache list to prefer stealing files before directories, also keep a minimum percentages of directories in the cache (10 % by default) before it would ever try to get rid of a directory. it also tries to keep a list of free entries all the time (means pro active cleaning them) and also allows to go over the hard limit compared to just block as in previous versions. so i assume you run a version prior to 5.0.1.x and what you see is kspwapd desperately get rid of entries, but can't find one its already at the limit so it blocks and doesn't allow a new entry to be created or promoted from the statcache . again all this is without source code access and speculation on my part based on experience :-) what version are you running and also share mmdiag --stats of that node sven On Tue, Nov 27, 2018 at 9:54 AM Simon Thompson > wrote: Thanks Sven ? We found a node with kswapd running 100% (and swap was off)? Killing that node made access to the FS spring into life. Simon From: > on behalf of "oehmes at gmail.com" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Tuesday, 27 November 2018 at 16:14 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Hanging file-systems 1. are you under memory pressure or even worse started swapping . _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Thu Nov 29 07:29:36 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 29 Nov 2018 07:29:36 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: Message-ID: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From TOMP at il.ibm.com Thu Nov 29 07:45:00 2018 From: TOMP at il.ibm.com (Tomer Perry) Date: Thu, 29 Nov 2018 09:45:00 +0200 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Message-ID: Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" To: 'gpfsug main discussion list' Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From Renar.Grunenberg at huk-coburg.de Thu Nov 29 08:03:34 2018 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Thu, 29 Nov 2018 08:03:34 +0000 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> Message-ID: <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Hallo Tomer, thanks for this Info, but can you explain in witch release all these points fixed now? Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Tomer Perry Gesendet: Donnerstag, 29. November 2018 08:45 An: gpfsug main discussion list ; Olaf Weiser Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" > To: 'gpfsug main discussion list' > Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list > Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. [Inactive hide details for "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in wi]"Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" > To: "gpfsug-discuss at spectrumscale.org" > Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From olaf.weiser at de.ibm.com Thu Nov 29 08:39:01 2018 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 29 Nov 2018 09:39:01 +0100 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de> <44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From MDIETZ at de.ibm.com Thu Nov 29 10:45:25 2018 From: MDIETZ at de.ibm.com (Mathias Dietz) Date: Thu, 29 Nov 2018 11:45:25 +0100 Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first In-Reply-To: References: <81d5be3544d349649b1458b0ad05bce4@SMXRF105.msg.hukrf.de><44a1c2f8541b410ea02f73548d58ccfa@SMXRF105.msg.hukrf.de> Message-ID: Hi Renar, the tsctl problem is described in APAR IV93896 https://www-01.ibm.com/support/docview.wss?uid=isg1IV93896 You can easily find out if your system has the problem: Run "tsctl shownodes up" and check if the hostnames are valid, if the hostnames are wrong/mixed up then you are affected. This APAR has been fixed with 5.0.2 Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale Development - Release Lead Architect (4.2.x) Spectrum Scale RAS Architect --------------------------------------------------------------------------- IBM Deutschland Am Weiher 24 65451 Kelsterbach Phone: +49 70342744105 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ----------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk WittkoppSitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: "Olaf Weiser" To: "Grunenberg, Renar" Cc: gpfsug main discussion list Date: 29/11/2018 09:39 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org HI Tomer, send my work around wrapper to Renar.. I've seen to less data to be sure, that's the same (tsctl shownodes ...) issue but he'll try and let us know .. From: "Grunenberg, Renar" To: gpfsug main discussion list , "Olaf Weiser" Date: 11/29/2018 09:04 AM Subject: AW: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hallo Tomer, thanks for this Info, but can you explain in witch release all these points fixed now? Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Tomer Perry Gesendet: Donnerstag, 29. November 2018 08:45 An: gpfsug main discussion list ; Olaf Weiser Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Hi, I remember there was some defect around tsctl and mixed domains - bot sure if it was fixed and in what version. A workaround in the past was to "wrap" tsctl with a script that would strip those. Olaf might be able to provide more info ( I believe he had some sample script). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: "Grunenberg, Renar" To: 'gpfsug main discussion list' Date: 29/11/2018 09:29 Subject: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, in this relation to the Alert, i had some question about experiences to establish remote cluster with different FQDN?s. What we see here that the owning (5.0.1.1) and the local Cluster (5.0.2.1) has different Domain-Names and both are connected to a firewall. Icmp,1191 and ephemeral Ports ports are open. If we dump the tscomm component of both daemons, we see connections to nodes that are named [hostname+ FGDN localCluster+ FGDN remote Cluster]. We analyzed nscd, DNS and make some tcp-dumps and so on and come to the conclusion that tsctl generate this wrong nodename and then if a Cluster Manager takeover are happening, because of a shutdown of these daemon (at Owning Cluster side), the join protocol rejected these connection. Are there any comparable experiences in the field. And if yes what are the solution of that? Thanks Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. Von: gpfsug-discuss-bounces at spectrumscale.org[ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John T Olson Gesendet: Montag, 26. November 2018 15:31 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first This sounds like a different issue because the alert was only for clusters with file audit logging enabled due to an incompatibility with the policy rules that are used in file audit logging. I would suggest opening a problem ticket. Thanks, John John T. Olson, Ph.D., MI.C., K.EY. Master Inventor, Software Defined Storage 957/9032-1 Tucson, AZ, 85744 (520) 799-5185, tie 321-5185 (FAX: 520-799-4237) Email: jtolson at us.ibm.com "Do or do not. There is no try." - Yoda Olson's Razor: Any situation that we, as humans, can encounter in life can be modeled by either an episode of The Simpsons or Seinfeld. "Grunenberg, Renar" ---11/23/2018 01:22:05 AM---Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday From: "Grunenberg, Renar" To: "gpfsug-discuss at spectrumscale.org" Date: 11/23/2018 01:22 AM Subject: [gpfsug-discuss] Status for Alert: remotely mounted filesystem panic on accessing cluster after upgrading the owning cluster first Sent by: gpfsug-discuss-bounces at spectrumscale.org Hallo All, are there any news about these Alert in witch Version will it be fixed. We had yesterday this problem but with a reziproke szenario. The owning Cluster are on 5.0.1.1 an the mounting Cluster has 5.0.2.1. On the Owning Cluster(3Node 3 Site Cluster) we do a shutdown of the deamon. But the Remote mount was paniced because of: A node join was rejected. This could be due to incompatible daemon versions, failure to find the node in the configuration database, or no configuration manager found. We had no FAL active, what the Alert says, and the owning Cluster are not on the affected Version. Any Hint, please. Regards Renar Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ======================================================================= HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas. ======================================================================= Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ======================================================================= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From spectrumscale at kiranghag.com Thu Nov 29 15:42:48 2018 From: spectrumscale at kiranghag.com (KG) Date: Thu, 29 Nov 2018 21:12:48 +0530 Subject: [gpfsug-discuss] high cpu usage by mmfsadm Message-ID: One of our scale node shows 30-50% CPU utilisation by mmfsadm while filesystem is being accessed. Is this normal? (The node is configured as server node but not a manager node for any filesystem or NSD) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Thu Nov 29 17:57:00 2018 From: bbanister at jumptrading.com (Bryan Banister) Date: Thu, 29 Nov 2018 17:57:00 +0000 Subject: [gpfsug-discuss] high cpu usage by mmfsadm In-Reply-To: References: Message-ID: <671bbd4db92d496abbbceead1b9a7d5c@jumptrading.com> I wouldn?t call that normal? probably take a gpfs.snap and open a PMR to get the quickest answer from IBM support, -B From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of KG Sent: Thursday, November 29, 2018 9:43 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] high cpu usage by mmfsadm [EXTERNAL EMAIL] One of our scale node shows 30-50% CPU utilisation by mmfsadm while filesystem is being accessed. Is this normal? (The node is configured as server node but not a manager node for any filesystem or NSD) ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential, or privileged information and/or personal data. If you are not the intended recipient, you are hereby notified that any review, dissemination, or copying of this email is strictly prohibited, and requested to notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request, or solicitation of any kind to buy, sell, subscribe, redeem, or perform any type of transaction of a financial product. Personal data, as defined by applicable data privacy laws, contained in this email may be processed by the Company, and any of its affiliated or related companies, for potential ongoing compliance and/or business-related purposes. You may have rights regarding your personal data; for information on exercising these rights or the Company?s treatment of personal data, please email datarequests at jumptrading.com. -------------- next part -------------- An HTML attachment was scrubbed... URL: