From douglasof at us.ibm.com Thu Jul 1 03:28:26 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Thu, 1 Jul 2021 02:28:26 +0000 Subject: [gpfsug-discuss] SuperPOD and GDS Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Jul 1 11:07:46 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 1 Jul 2021 11:07:46 +0100 Subject: [gpfsug-discuss] PVU question In-Reply-To: References:

Message-ID: <8be6c60f-8c01-d498-a9d3-1d9d25b82682@strath.ac.uk> On 29/06/2021 15:41, IBM Spectrum Scale wrote: > My suggestion for this question is that it should be directed to your > IBM sales team and not the Spectrum Scale support team. ?My reading of > the information you provided is that your processor counts as 2 cores. > ?As for the PVU value my guess is that at a minimum it is 50 but again > that should be a question for your IBM sales team. But that would require either being able to call a sales representative who understands what you are talking about or for a sales representative to call you back. Both options seem to be next to impossible hence my question. > > One other option is to switch from processor based licensing for Scale > to storage (TB) based licensing. ?I think one of the reasons for storage > based licensing was to avoid issues like the one you are raising. > Techincally it's for a Spectrum Protect license for the node that backs up the Spectrum Scale system. The DSS-G is on disk based licensing so that's not a problem. However the PVU per machine is the same between the two and given the difficulties actually getting someone in sales to talk on the subject I thought I might ask here. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From douglasof at us.ibm.com Thu Jul 1 18:45:10 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Thu, 1 Jul 2021 13:45:10 -0400 Subject: [gpfsug-discuss] SuperPOD and GDS In-Reply-To: References: Message-ID: I saw that the short URL didn't go through https://community.ibm.com/community/user/storage/blogs/douglas-oflaherty1/2021/06/22/ibm-nvidia-team-on-supercomputing-scalability hopefully, this one works ok... or community.ibm.com/community/user/storage/blogs/douglas-oflaherty1/2021/06/22/ibm-nvidia-team-on-supercomputing-scalability thanks, doug Douglas O'Flaherty douglasof at us.ibm.com From: Douglas O'flaherty/Waltham/IBM To: gpfsug-discuss at spectrumscale.org Date: 06/30/2021 10:28 PM Subject: SuperPOD and GDS Greetings: Highlighting the announcements about upcoming SuperPOD offerings, GPUDirect Storage going GA from NVIDIA, and our latest with Tech Preview GDS read support in Spectrum Scale 5.1.1 http://ibm.biz/IBMStorageandNVIDIA I am looking for those who have test cases of CUDA code with GDS. It is supported on any A100 GPU. Reach out off list. doug Douglas O'Flaherty Global Ecosystems Leader douglasofusibm.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Sat Jul 3 14:20:54 2021 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sat, 3 Jul 2021 09:20:54 -0400 Subject: [gpfsug-discuss] GUI refresh task error In-Reply-To: References: <72d50b96-c6a3-f075-8f47-84bf2346f0ae@docum.org> <975f874a066c4ba6a45c62f9b280efa2@postbank.de> Message-ID: Ed, I have not received any feedback about your inquiry. Could you please open a help case with Scale support to have the matter fully investigated. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Wahl, Edward" To: gpfsug main discussion list Date: 06/28/2021 05:04 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] GUI refresh task error Sent by: gpfsug-discuss-bounces at spectrumscale.org Curious if this was ever fixed or someone has an APAR # ? I'm still running into it on 5.0.5.6 Ed Wahl OSC -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Stef Coene Sent: Thursday, July 16, 2020 9:47 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] GUI refresh task error Ok, thanx for the answer. I will wait for the fix. Stef On 2020-07-16 15:25, Roland Schuemann wrote: > Hi Stef, > > we already recognized this error too and opened a PMR/Case at IBM. > You can set this task to inactive, but this is not persistent. After gui restart it comes again. > > This was the answer from IBM Support. >>>>>>>>>>>>>>>>>> > This will be fixed in the next release of 5.0.5.2, right now there is no work-around but will not cause issue besides the cosmetic task failed message. > Is this OK for you? >>>>>>>>>>>>>>>>>> > > So we ignore (Gui is still degraded) it and wait for the fix. > > Kind regards > Roland Sch?mann > > > Freundliche Gr??e / Kind regards > Roland Sch?mann > > ____________________________________________ > > Roland Sch?mann > Infrastructure Engineering (BTE) > CIO PB Germany > > Deutsche Bank I Technology, Data and Innovation Postbank Systems AG > > > -----Urspr?ngliche Nachricht----- > Von: gpfsug-discuss-bounces at spectrumscale.org > Im Auftrag von Stef Coene > Gesendet: Donnerstag, 16. Juli 2020 15:14 > An: gpfsug main discussion list > Betreff: [gpfsug-discuss] GUI refresh task error > > Hi, > > On brand new 5.0.5 cluster we have the following errors on all nodes: > "The following GUI refresh task(s) failed: WATCHFOLDER" > > It also says > "Failure reason: Command mmwatch all functional --list-clustered-status > failed" > > Running mmwatch manually gives: > mmwatch: The Clustered Watch Folder function is only available in the IBM Spectrum Scale Advanced Edition or the Data Management Edition. > mmwatch: Command failed. Examine previous error messages to determine cause. > > How can I get rid of this error? > > I tried to disable the task with: > chtask WATCHFOLDER --inactive > EFSSG1811C The task with the name WATCHFOLDER is already not scheduled. > > > Stef > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > INVALID URI REMOVED > discuss__;!!KGKeukY!iZZSS4baXvM4hp_EgmAlElMFeU23jbACq1CMPtkf-Q5ShrsQv_ > gi9hZJP8mT$ Die Europ?ische Kommission hat unter > http://ec.europa.eu/consumers/odr/ eine Europ?ische Online-Streitbeilegungsplattform (OS-Plattform) errichtet. Verbraucher k?nnen die OS-Plattform f?r die au?ergerichtliche Beilegung von Streitigkeiten aus Online-Vertr?gen mit in der EU niedergelassenen Unternehmen nutzen. > > Informationen (einschlie?lich Pflichtangaben) zu einzelnen, innerhalb der EU t?tigen Gesellschaften und Zweigniederlassungen des Konzerns Deutsche Bank finden Sie unter https://www.deutsche-bank.de/Pflichtangaben . Diese E-Mail enth?lt vertrauliche und/ oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese E-Mail irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese E-Mail. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser E-Mail ist nicht gestattet. > > The European Commission has established a European online dispute resolution platform (OS platform) under http://ec.europa.eu/consumers/odr/ . Consumers may use the OS platform to resolve disputes arising from online contracts with providers established in the EU. > > Please refer to https://www.db.com/disclosures for information (including mandatory corporate particulars) on selected Deutsche Bank branches and group companies registered or incorporated in the European Union. This e-mail may contain confidential and/or privileged information. If you are not the intended recipient (or have received this e-mail in error) please notify the sender immediately and delete this e-mail. Any unauthorized copying, disclosure or distribution of the material in this e-mail is strictly forbidden. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > INVALID URI REMOVED > discuss__;!!KGKeukY!iZZSS4baXvM4hp_EgmAlElMFeU23jbACq1CMPtkf-Q5ShrsQv_ > gi9hZJP8mT$ > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From rp2927 at gsb.columbia.edu Wed Jul 7 19:57:21 2021 From: rp2927 at gsb.columbia.edu (Popescu, Razvan) Date: Wed, 7 Jul 2021 18:57:21 +0000 Subject: [gpfsug-discuss] Thousands of empty ccrChangedCallback folders Message-ID: <338E0C2C-FE8D-4E97-A4F6-1545422071A8@gsb.columbia.edu> Hi, I have a month and a half old case with IBM Support that seems to go nowhere (!!) and I thought that maybe some of you might have seen or heard of something similar?. I thank you in advance for any clue, tip, solution, or recommendation you might have for me, as apparently IBM is chasing its tail on this matter (although escalated to Sev 2)?. On one of our Scale NSD servers, which is also our GUI master, I have ~50,000 (fifty thousand!!) empty folders of each: /var/mmfs/ssl/keyServ/tmp/ccrChangedCallback_421.sh.NNNNN /var/mmfs/tmp/cmdTmpDir.ccrChangedCallback_421.sh.NNNNN (NNNNN is a 5 digit number/counter) The count varies, I?ve seen it over 100 thousand (!) at times, but never under 30k or so. I asked why these empty (temp) folders are not cleaned up, given their excessively high count, but IBM is still struggling to understand their origin. It appears that I started to have these folders after we upgrade our system to 5.1.0.3 (The folders trip our backup monitor so that?s how we even discovered them) Any idea? Many thanks, Razvan -- Razvan N. Popescu Research Computing Director Office: (212) 851-9298 razvan.popescu at columbia.edu Columbia Business School At the Very Center of Business -------------- next part -------------- An HTML attachment was scrubbed... URL: From cabrillo at ifca.unican.es Fri Jul 9 12:19:07 2021 From: cabrillo at ifca.unican.es (Iban Cabrillo) Date: Fri, 9 Jul 2021 13:19:07 +0200 (CEST) Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR Message-ID: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es> Dear, Since a couple of hours we are seen lots off IB error at GPFS logs, on every IB node (gpfs version is 5.0.4-3 ): 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.73 (node157) on mlx5_0 port 1 fabnum 0 index 251 cookie 648 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.18 (node102) on mlx5_0 port 1 fabnum 0 index 227 cookie 687 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.17 (node101) on mlx5_0 port 1 fabnum 0 index 298 cookie 693 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.151.6 (node6) on mlx5_0 port 1 fabnum 0 index 18 cookie 696 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.152.46 (node130) on mlx5_0 port 1 fabnum 0 index 254 cookie 680 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.151.81 (node81) on mlx5_0 port 1 fabnum 0 index 289 cookie 679 RDMA read error IBV_WC_RETRY_EXC_ERR and ofcourse long waiters: === mmdiag: waiters === Waiting 34.8493 sec since 13:11:35, ignored, thread 2935 VerbsReconnectThread: delaying for 25.150686000 more seconds, reason: delaying for next reconnect attempt Waiting 34.6249 sec since 13:11:35, ignored, thread 10198 VerbsReconnectThread: delaying for 25.375072000 more seconds, reason: delaying for next reconnect attempt Waiting 27.0957 sec since 13:11:43, ignored, thread 10052 VerbsReconnectThread: delaying for 32.904264000 more seconds, reason: delaying for next reconnect attempt Waiting 14.8909 sec since 13:11:55, monitored, thread 23135 NSDThread: for RDMA write completion fast on node 10.10.151.65 Waiting 14.8891 sec since 13:11:55, monitored, thread 23109 NSDThread: for RDMA write completion fast on node 10.10.152.32 Waiting 14.8865 sec since 13:11:55, monitored, thread 23302 NSDThread: for RDMA write completion fast on node 10.10.150.1 [common] verbsRdma enable verbsPorts mlx4_0/1/0 [gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08] verbsPorts mlx5_0/1/0 [gpfs01] verbsPorts mlx5_1/1/0 [gpfs03] verbsPorts mlx5_0/1/0 mlx5_1/1/0 [common] verbsRdma enable verbsPorts mlx4_0/1/0 [gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08,wngpu001,wngpu002,wngpu003,wngpu004,wngpu005] verbsPorts mlx5_0/1/0 [gpfs01] verbsPorts mlx5_1/1/0 [gpfs03] verbsPorts mlx5_0/1/0 mlx5_1/1/0 Any advise is welcomed regards, I -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Jul 9 12:36:26 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 9 Jul 2021 11:36:26 +0000 Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es> References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es> Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Jul 9 16:00:37 2021 From: S.J.Thompson at bham.ac.uk (Simon Thompson) Date: Fri, 9 Jul 2021 15:00:37 +0000 Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es> Message-ID: If you have multiple switches, this could be a faulty ISL (or to your NSDs). So I would look for SYMBOL errors on the ports, high churning numbers indicates a cable fault. Simon From: on behalf of "olaf.weiser at de.ibm.com" Reply to: "gpfsug-discuss at spectrumscale.org" Date: Friday, 9 July 2021 at 12:36 To: "gpfsug-discuss at spectrumscale.org" Cc: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR smells like a network problem .. IBV_WC_RETRY_EXC_ERR comes from OFED and clearly says that the data didn't get through successfully, further help .. check ibstat iblinkinfo ibdiagnet and the sminfo .. (should be the same on all members) ----- Urspr?ngliche Nachricht ----- Von: "Iban Cabrillo" Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: "gpfsug-discuss" CC: Betreff: [EXTERNAL] [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR Datum: Fr, 9. Jul 2021 13:29 Dear, Since a couple of hours we are seen lots off IB error at GPFS logs, on every IB node (gpfs version is 5.0.4-3): 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.73 (node157) on mlx5_0 port 1 fabnum 0 index 251 cookie 648 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.18 (node102) on mlx5_0 port 1 fabnum 0 index 227 cookie 687 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.17 (node101) on mlx5_0 port 1 fabnum 0 index 298 cookie 693 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.151.6 (node6) on mlx5_0 port 1 fabnum 0 index 18 cookie 696 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.152.46 (node130) on mlx5_0 port 1 fabnum 0 index 254 cookie 680 RDMA write error IBV_WC_RETRY_EXC_ERR 2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.151.81 (node81) on mlx5_0 port 1 fabnum 0 index 289 cookie 679 RDMA read error IBV_WC_RETRY_EXC_ERR and ofcourse long waiters: === mmdiag: waiters === Waiting 34.8493 sec since 13:11:35, ignored, thread 2935 VerbsReconnectThread: delaying for 25.150686000 more seconds, reason: delaying for next reconnect attempt Waiting 34.6249 sec since 13:11:35, ignored, thread 10198 VerbsReconnectThread: delaying for 25.375072000 more seconds, reason: delaying for next reconnect attempt Waiting 27.0957 sec since 13:11:43, ignored, thread 10052 VerbsReconnectThread: delaying for 32.904264000 more seconds, reason: delaying for next reconnect attempt Waiting 14.8909 sec since 13:11:55, monitored, thread 23135 NSDThread: for RDMA write completion fast on node 10.10.151.65 Waiting 14.8891 sec since 13:11:55, monitored, thread 23109 NSDThread: for RDMA write completion fast on node 10.10.152.32 Waiting 14.8865 sec since 13:11:55, monitored, thread 23302 NSDThread: for RDMA write completion fast on node 10.10.150.1 [common] verbsRdma enable verbsPorts mlx4_0/1/0 [gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08] verbsPorts mlx5_0/1/0 [gpfs01] verbsPorts mlx5_1/1/0 [gpfs03] verbsPorts mlx5_0/1/0 mlx5_1/1/0 [common] verbsRdma enable verbsPorts mlx4_0/1/0 [gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08,wngpu001,wngpu002,wngpu003,wngpu004,wngpu005] verbsPorts mlx5_0/1/0 [gpfs01] verbsPorts mlx5_1/1/0 [gpfs03] verbsPorts mlx5_0/1/0 mlx5_1/1/0 Any advise is welcomed regards, I _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From cabrillo at ifca.unican.es Fri Jul 9 16:56:57 2021 From: cabrillo at ifca.unican.es (Iban Cabrillo) Date: Fri, 9 Jul 2021 17:56:57 +0200 (CEST) Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es>

Message-ID: <2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es> Thanks both of you, for your fast answer, I just restart the server with the biggest waiters, and seems that every thing is working now Using de ib diag command I see these errors: -E- lid=0x0380 dev=4115 gpfs03/U1/P1 Performance Monitor counter : Value port_xmit_discard : 65535 (overflow) -E- lid=0x0ed0 dev=4115 gpfs01/U2/P1 Performance Monitor counter : Value port_xmit_discard : 65535 (overflow) ...... -E- Link: ib2s5/U1/P6<-->node152/U1/P1 - Unexpected actual link speed 10 (enable_speed1="2.5 or 5 or 10 or FDR10", enable_speed2="2.5 or 5 or 10 or FDR10" ther efore final speed should be FDR10) Regards, I From ewahl at osc.edu Fri Jul 9 19:30:51 2021 From: ewahl at osc.edu (Wahl, Edward) Date: Fri, 9 Jul 2021 18:30:51 +0000 Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: <2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es> References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es>

<2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es> Message-ID: >-E- Link: ib2s5/U1/P6<-->node152/U1/P1 - Unexpected actual link speed 10 This looks like a bad cable (or port). Trying re-seating the cable on both ends, or replacing it to get to full Link Speed. Re-run ibdiagnet to confirm or use something like 'ibportstate' to check it. Ed Wahl OSC -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Iban Cabrillo Sent: Friday, July 9, 2021 11:57 AM To: gpfsug-discuss Subject: Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR Thanks both of you, for your fast answer, I just restart the server with the biggest waiters, and seems that every thing is working now Using de ib diag command I see these errors: -E- lid=0x0380 dev=4115 gpfs03/U1/P1 Performance Monitor counter : Value port_xmit_discard : 65535 (overflow) -E- lid=0x0ed0 dev=4115 gpfs01/U2/P1 Performance Monitor counter : Value port_xmit_discard : 65535 (overflow) ...... -E- Link: ib2s5/U1/P6<-->node152/U1/P1 - Unexpected actual link speed 10 (enable_speed1="2.5 or 5 or 10 or FDR10", enable_speed2="2.5 or 5 or 10 or FDR10" ther efore final speed should be FDR10) Regards, I From YARD at il.ibm.com Sun Jul 11 10:47:56 2021 From: YARD at il.ibm.com (Yaron Daniel) Date: Sun, 11 Jul 2021 12:47:56 +0300 Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es>

<2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es> Message-ID: Hi Did u upgrade OFED version in some of the servers to v5.x ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Lab Services Consultant ? Storage and Cloud Petach Tiqva, 49527 IBM Global Markets, Systems HW Sales Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com Webex: https://ibm.webex.com/meet/yard IBM Israel From: "Wahl, Edward" To: "gpfsug main discussion list" Date: 07/09/2021 10:21 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR Sent by: gpfsug-discuss-bounces at spectrumscale.org >-E- Link: ib2s5/U1/P6<-->node152/U1/P1 - Unexpected actual link speed 10 This looks like a bad cable (or port). Trying re-seating the cable on both ends, or replacing it to get to full Link Speed. Re-run ibdiagnet to confirm or use something like 'ibportstate' to check it. Ed Wahl OSC -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Iban Cabrillo Sent: Friday, July 9, 2021 11:57 AM To: gpfsug-discuss Subject: Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR Thanks both of you, for your fast answer, I just restart the server with the biggest waiters, and seems that every thing is working now Using de ib diag command I see these errors: -E- lid=0x0380 dev=4115 gpfs03/U1/P1 Performance Monitor counter : Value port_xmit_discard : 65535 (overflow) -E- lid=0x0ed0 dev=4115 gpfs01/U2/P1 Performance Monitor counter : Value port_xmit_discard : 65535 (overflow) ...... -E- Link: ib2s5/U1/P6<-->node152/U1/P1 - Unexpected actual link speed 10 (enable_speed1="2.5 or 5 or 10 or FDR10", enable_speed2="2.5 or 5 or 10 or FDR10" ther efore final speed should be FDR10) Regards, I _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1114 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 8361 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5211 bytes Desc: not available URL: From cabrillo at ifca.unican.es Mon Jul 12 15:24:52 2021 From: cabrillo at ifca.unican.es (Iban Cabrillo) Date: Mon, 12 Jul 2021 16:24:52 +0200 (CEST) Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es>

<2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es>

Message-ID: <1015759757.9165103.1626099892933.JavaMail.zimbra@ifca.unican.es> Hi, old servers: [root at gpfs01 ~]# rpm -qa| grep ofed ofed-scripts-4.3-OFED.4.3.1.0.1.x86_64 mlnxofed-docs-4.3-1.0.1.0.noarch and newest servers: [root at gpfs08 ~]# rpm -qa| grep ofed ofed-scripts-5.0-OFED.5.0.2.1.8.x86_64 mlnxofed-docs-5.0-2.1.8.0.noarch regards, I From YARD at il.ibm.com Mon Jul 12 15:44:54 2021 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 12 Jul 2021 17:44:54 +0300 Subject: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR In-Reply-To: <1015759757.9165103.1626099892933.JavaMail.zimbra@ifca.unican.es> References: <241624317.8906867.1625829547638.JavaMail.zimbra@ifca.unican.es>

<2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es>

<1015759757.9165103.1626099892933.JavaMail.zimbra@ifca.unican.es> Message-ID: Hi I had this error is such mix env, does new servers can run OFed v4.9.x ? In parallel - please open case in Mellanox, since it might be also firmware/driver issue with Ofed - or HCA which is not supported with Ofed 5.x. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Lab Services Consultant ? Storage and Cloud Petach Tiqva, 49527 IBM Global Markets, Systems HW Sales Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com Webex: https://ibm.webex.com/meet/yard IBM Israel From: "Iban Cabrillo" To: "gpfsug-discuss" Date: 07/12/2021 05:25 PM Subject: [EXTERNAL] Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, old servers: [root at gpfs01 ~]# rpm -qa| grep ofed ofed-scripts-4.3-OFED.4.3.1.0.1.x86_64 mlnxofed-docs-4.3-1.0.1.0.noarch and newest servers: [root at gpfs08 ~]# rpm -qa| grep ofed ofed-scripts-5.0-OFED.5.0.2.1.8.x86_64 mlnxofed-docs-5.0-2.1.8.0.noarch regards, I_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1114 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 8361 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 5211 bytes Desc: not available URL: From Robert.Oesterlin at nuance.com Mon Jul 19 17:16:59 2021 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 19 Jul 2021 16:16:59 +0000 Subject: [gpfsug-discuss] 30 second survey - User Group meeting at SC21 Message-ID: Spectrum Scale users: Give us 30 seconds of your time! We really need an accurate headcount for a possible user group meeting at SC21. Take a quick survey: https://www.surveymonkey.com/r/RRPYSGY Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From jeff.bernstein at vcinity.io Tue Jul 20 22:22:33 2021 From: jeff.bernstein at vcinity.io (Jeff Bernstein) Date: Tue, 20 Jul 2021 21:22:33 +0000 Subject: [gpfsug-discuss] Newbie Message-ID: Hello Everyone!! I am Jeff Bernstein and I work for Vcinity where we can stretch GPFS over a WAN at line speed with sustained 95% utilization. Thus, we turn your WAN into a SAN! Now you can stretch the cluster or use AFM, in a Hub and Spoke configuration, anywhere in the world. Imagine if your WAN performed like Fibre Channel or Infiniband. That?s what we do. Thanks for the add! Jeff Bernstein, Director of Media & Entertainment 2055 Gateway Pl. #650 San Jose, CA 95110 cell: 310.927.2089 [signature_20664221] This correspondence, and any attachments or files transmitted with this correspondence, contains information which may be confidential and privileged and is intended solely for the use of the addressee. Unless you are the addressee or are authorized to receive messages for the addressee, you may not use, copy, disseminate, or disclose this correspondence or any information contained in this correspondence to any third party. If you have received this correspondence in error, please notify the sender immediately and delete this correspondence and any attachments or files transmitted with this correspondence from your system, and destroy any and all copies thereof, electronic or otherwise. Your cooperation and understanding are greatly appreciated. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 27729 bytes Desc: image001.png URL: From committee at io500.org Wed Jul 21 17:40:39 2021 From: committee at io500.org (IO500 Committee) Date: Wed, 21 Jul 2021 10:40:39 -0600 Subject: [gpfsug-discuss] Call for Information IO500 Future Directions Message-ID: <2ff6cd6703f8851fa89c3a5cdf8b50f1@io500.org> The IO500 Foundation requests your help with determining the future direction for the IO500 lists and data repositories. We ask you complete a short survey that will take less than 5 minutes. The survey is here: https://forms.gle/cFMV4sA3iDUBuQ73A Deadline for responses is 27 August 2021 to allow time for us to potentially incorporate changes in time for the SC21 submission season. Thank you for your time and support. -- The IO500 Committee From jonathan.buzzard at strath.ac.uk Fri Jul 23 13:50:54 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 23 Jul 2021 13:50:54 +0100 Subject: [gpfsug-discuss] CVE-2021-33909 and 3.10.0-1160.36.2.el7.x86_64 Message-ID: <88017177-aee9-c779-1bc2-8907a90145c6@strath.ac.uk> Anyone know what GPFS versions will work with kernel version 3.10.0-1160.36.2 on RHEL7 rebuilds to patch for the above local privilege escalation bug? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From scale at us.ibm.com Fri Jul 23 20:03:54 2021 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Sat, 24 Jul 2021 03:03:54 +0800 Subject: [gpfsug-discuss] CVE-2021-33909 and 3.10.0-1160.36.2.el7.x86_64 In-Reply-To: <88017177-aee9-c779-1bc2-8907a90145c6@strath.ac.uk> References: <88017177-aee9-c779-1bc2-8907a90145c6@strath.ac.uk> Message-ID: Jonathan, CVE-2021-33909 and 3.10.0-1160.36.2.el7.x86_64 was published on July 20, 2021. GPFS has not been tested on this RHEL kernel yet per our FAQ https://www.ibm.com/docs/en/spectrum-scale/5.1.1?topic=spectrum-scale-faq. For both IBM Spectrum Scale 5.1.1.2 and IBM Spectrum Scale 5.0.5.8, The latest tested RHEL kernel is 3.10.0-1160.31.1.el7 (RHEL 7.9) tile now. 3.10.0-1160.36.2.el7.x86_64 is a kernel errata of 3.10.0-1160. According to IBM Spectrum Scale FAQ, it's a supported kernel version (IBM will update kernel support list if incompatibility issues were found in subsequent tests) https://www.ibm.com/docs/en/spectrum-scale/5.1.1?topic=spectrum-scale-faq. Kernel errata can be applied to the current kernel version unless they are explicitly listed in the FAQ as not supported. Always validate kernel changes including errata with IBM Spectrum Scale in a test environment before rolling out to production. Always rebuild the portability layer after any kernel changes. See also https://www.ibm.com/support/pages/full-story-ibm-spectrum-scale-and-linux-version-compatibility Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Jonathan Buzzard To: gpfsug main discussion list Date: 2021/07/23 08:51 PM Subject: [EXTERNAL] [gpfsug-discuss] CVE-2021-33909 and 3.10.0-1160.36.2.el7.x86_64 Sent by: gpfsug-discuss-bounces at spectrumscale.org Anyone know what GPFS versions will work with kernel version 3.10.0-1160.36.2 on RHEL7 rebuilds to patch for the above local privilege escalation bug? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From pinto at scinet.utoronto.ca Fri Jul 30 05:16:41 2021 From: pinto at scinet.utoronto.ca (Jaime Pinto) Date: Fri, 30 Jul 2021 00:16:41 -0400 Subject: [gpfsug-discuss] kernel 3.10.0-1160.36.2.el7.x86_64 (CVE-2021-33909) not compatible with DB2 (for TSM, HPSS, possibly other IBM apps) Message-ID: <91a8e18b-4737-d4c0-9d68-d69648b9698f@scinet.utoronto.ca> Alert related to sysadmins managing TSM/DB2 servers and those responsible for applying security patches, in particular kernel 3.10.0-1160.36.2.el7.x86_64, despite security concerns raised by CVE-2021-33909: Please hold off on upgrading your RedHat systems (possibly centos too). I just found out the hard way that kernel 3.10.0-1160.36.2.el7.x86_64 is not compatible with DB2, and after the node reboot DB2 would not work anymore, not only on TSM, but neither on HPSS. I had to revert the kernel to 3.10.0-1062.18.1.el7.x86_64 to get DB2 working properly again. --- Jaime Pinto - Storage Analyst SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 From jonathan.buzzard at strath.ac.uk Fri Jul 30 12:27:49 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 30 Jul 2021 12:27:49 +0100 Subject: [gpfsug-discuss] kernel 3.10.0-1160.36.2.el7.x86_64 (CVE-2021-33909) not compatible with DB2 (for TSM, HPSS, possibly other IBM apps) In-Reply-To: <91a8e18b-4737-d4c0-9d68-d69648b9698f@scinet.utoronto.ca> References: <91a8e18b-4737-d4c0-9d68-d69648b9698f@scinet.utoronto.ca> Message-ID: <455b7bc4-f28e-c8f3-0da4-b40a05726bd4@strath.ac.uk> On 30/07/2021 05:16, Jaime Pinto wrote: > > Alert related to sysadmins managing TSM/DB2 servers and those > responsible for applying security patches, in particular kernel > 3.10.0-1160.36.2.el7.x86_64, despite security concerns raised by > CVE-2021-33909: > > Please hold off on upgrading your RedHat systems (possibly centos too). > I just found out the hard way that kernel 3.10.0-1160.36.2.el7.x86_64 is > not compatible with DB2, and after the node reboot DB2 would not work > anymore, not only on TSM, but neither on HPSS. I had to revert the > kernel to 3.10.0-1062.18.1.el7.x86_64 to get DB2 working properly again. > For the record I have been running Spectrum Protect Extended Edition 8.1.12 on 3.10.0-1160.31.1 (genuine RHEL 7.9) since the 11th of June this year. I would say therefore there is no need to roll back quite so far as 3.10.0-1062.18.1 which is quite ancient now. Can't test anything newer as I am literally in the middle of migrating our TSM server to new hardware and a RHEL 8.4 install. Spent yesterday in the data centre re-cabling the disk arrays to the new server; neat, tidy and labelled this time :-) JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From pinto at scinet.utoronto.ca Fri Jul 30 15:11:45 2021 From: pinto at scinet.utoronto.ca (Jaime Pinto) Date: Fri, 30 Jul 2021 10:11:45 -0400 Subject: [gpfsug-discuss] kernel 3.10.0-1160.36.2.el7.x86_64 (CVE-2021-33909) not compatible with DB2 (for TSM, HPSS, possibly other IBM apps) In-Reply-To: <455b7bc4-f28e-c8f3-0da4-b40a05726bd4@strath.ac.uk> References: <91a8e18b-4737-d4c0-9d68-d69648b9698f@scinet.utoronto.ca> <455b7bc4-f28e-c8f3-0da4-b40a05726bd4@strath.ac.uk> Message-ID: <634b12c2-27b4-eac0-c583-2fbf79ee63c7@scinet.utoronto.ca> Hey Jonathan 3.10.0-1160.31.1 seems to be one of the last kernel releases prior to the CVE-2021-33909 exploit. 3.10.0-1160.36.2.el7.x86_64 seems to be the first on the Redhat repo that fixes the exploit, but it's not working for our combination of TSM/DB2 versions: * TSM 8.1.8 * DB2 v11.1.4.4 I'll just keep one eye on the repo for the next kernel available and try it again. Until then I'll stick with 3.10.0-1062.18.1 On the HPSS side 3.10.0-1160.36.2.el7.x86_64 worked fine with DB2 11.5, but not with 10.5 Thanks Jaime On 7/30/2021 07:27:49, Jonathan Buzzard wrote: > On 30/07/2021 05:16, Jaime Pinto wrote: >> >> Alert related to sysadmins managing TSM/DB2 servers and those responsible for applying security patches, in particular kernel 3.10.0-1160.36.2.el7.x86_64, despite security concerns raised by CVE-2021-33909: >> >> Please hold off on upgrading your RedHat systems (possibly centos too). I just found out the hard way that kernel 3.10.0-1160.36.2.el7.x86_64 is not compatible with DB2, and after the node reboot DB2 would not work anymore, not only on TSM, but neither on HPSS. I had to revert the kernel to 3.10.0-1062.18.1.el7.x86_64 to get DB2 working properly again. >> > > For the record I have been running Spectrum Protect Extended Edition 8.1.12 on 3.10.0-1160.31.1 (genuine RHEL 7.9) since the 11th of June this year. > > I would say therefore there is no need to roll back quite so far as 3.10.0-1062.18.1 which is quite ancient now. > > Can't test anything newer as I am literally in the middle of migrating our TSM server to new hardware and a RHEL 8.4 install. Spent yesterday in the data centre re-cabling the disk arrays to the new server; neat, tidy and labelled this time :-) > > > JAB. > --- Jaime Pinto - Storage Analyst SciNet HPC Consortium - Compute/Calcul Canada www.scinet.utoronto.ca - www.computecanada.ca University of Toronto 661 University Ave. (MaRS), Suite 1140 Toronto, ON, M5G1M1 P: 416-978-2755 C: 416-505-1477 From jonathan.buzzard at strath.ac.uk Sat Jul 31 23:47:10 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sat, 31 Jul 2021 23:47:10 +0100 Subject: [gpfsug-discuss] kernel 3.10.0-1160.36.2.el7.x86_64 (CVE-2021-33909) not compatible with DB2 (for TSM, HPSS, possibly other IBM apps) In-Reply-To: <634b12c2-27b4-eac0-c583-2fbf79ee63c7@scinet.utoronto.ca> References: <91a8e18b-4737-d4c0-9d68-d69648b9698f@scinet.utoronto.ca> <455b7bc4-f28e-c8f3-0da4-b40a05726bd4@strath.ac.uk> <634b12c2-27b4-eac0-c583-2fbf79ee63c7@scinet.utoronto.ca> Message-ID: On 30/07/2021 15:11, Jaime Pinto wrote: > Hey Jonathan > > 3.10.0-1160.31.1 seems to be one of the last kernel releases prior to > the CVE-2021-33909 exploit. It is the release immediately prior to 3.10.0-1160.31.2. To be fair I didn't consider it important to install 3.10.0-1160.31.2 on our TSM server because the only people able to log onto it can all get root anyway. So a local privilege escalation bug is like meh to begin with and the replacement hardware for migrating to a fully patched RHEL 8.4 server was ready and waiting to go in the rack. Now on the nodes in the HPC cluster any privilege escalation bug is an issue as the unwashed masses have access to that. > 3.10.0-1160.36.2.el7.x86_64 seems to be the first on the Redhat repo > that fixes the exploit, but it's not working for our combination of > TSM/DB2 versions: > * TSM 8.1.8 > * DB2 v11.1.4.4 Well yikes you need to upgrade your TSM server ASAP as 8.1.8 has a number of security holes. My TSM is my get of jail card should we be hit by ransomware, which seems to the most likely "disaster" these days, so patch, patch, patch is my moto. Besides I am not allowed to run a version that is riddled with security issues. Being public sector and funded by the Scottish government we have to be CyberEssentials compliant :-) Basically you are supposed to apply security patches within 10 days of availability. > I'll just keep one eye on the repo for the next kernel available and try > it again. Until then I'll stick with 3.10.0-1062.18.1 Which has a whole slew of bugs too. See above I don't get to run such old versions :-) > On the HPSS side 3.10.0-1160.36.2.el7.x86_64 worked fine with DB2 11.5, > but not with 10.5 > Only DB2 usage I have is on our TSM server. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From douglasof at us.ibm.com Thu Jul 1 03:28:26 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Thu, 1 Jul 2021 02:28:26 +0000 Subject: [gpfsug-discuss] SuperPOD and GDS Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Jul 1 11:07:46 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 1 Jul 2021 11:07:46 +0100 Subject: [gpfsug-discuss] PVU question In-Reply-To: References:

<2124627715.8924801.1625846217069.JavaMail.zimbra@ifca.unican.es>