From alvise.dorigo at psi.ch Tue Dec 7 13:44:24 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 7 Dec 2021 13:44:24 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files Message-ID: Dear users/developers/support, I'd like to ask if there is a fast way to manipulate the permission mask of many files (millions). I tried on 900k files and a recursive chmod (chmod 0### -R path) takes about 1000s, with about 50% usage of mmfsd daemon. I tried with the perl's internal function chmod that can operate on an array of files, and it takes about 1/3 of the previous method. Which is already a good result. I've seen the possibility to run a policy to execute commands, but I would avoid to execute external commands through mmxargs, 1M of times; would you ? Does anybody have any suggestion to do this operation with minimum disruption on the system ? Thank you, Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Dec 7 14:01:41 2021 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 7 Dec 2021 14:01:41 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Dec 7 14:10:20 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 7 Dec 2021 14:10:20 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: Message-ID: I have 5.0.4 for the moment (planned to be updated next year) and what I see is: [root at sf-dss-1 tmp]# locate mmfind /usr/lpp/mmfs/samples/ilm/mmfind /usr/lpp/mmfs/samples/ilm/mmfind.README /usr/lpp/mmfs/samples/ilm/mmfindUtil_processOutputFile.c /usr/lpp/mmfs/samples/ilm/mmfindUtil_processOutputFile.sampleMakefile Is that what you are talking about ? Thanks, Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Frederick Stock Inviato: marted? 7 dicembre 2021 15:02 A: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files If you are running on a more recent version of Scale you might want to look at the mmfind command. It provides a find-like wrapper around the execution of policy rules. Fred _______________________________________________________ Fred Stock | Spectrum Scale Development Advocacy | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: "Dorigo Alvise (PSI)" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug main discussion list'" > Cc: Subject: [EXTERNAL] [gpfsug-discuss] Question on changing mode on many files Date: Tue, Dec 7, 2021 8:53 AM Dear users/developers/support, I?d like to ask if there is a fast way to manipulate the permission mask of many files (millions). I tried on 900k files and a recursive chmod (chmod 0### -R path) takes about 1000s, with about 50% usage of mmfsd daemon. I tried with the perl?s internal function chmod that can operate on an array of files, and it takes about 1/3 of the previous method. Which is already a good result. I?ve seen the possibility to run a policy to execute commands, but I would avoid to execute external commands through mmxargs, 1M of times; would you ? Does anybody have any suggestion to do this operation with minimum disruption on the system ? Thank you, Alvise _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Dec 7 14:19:42 2021 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 7 Dec 2021 14:19:42 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Dec 7 14:28:54 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 7 Dec 2021 14:28:54 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: Message-ID: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> On 07/12/2021 14:01, Frederick Stock wrote: > If you are running on a more recent version of Scale you might want to > look at the mmfind command.? It provides a find-like wrapper around the > execution of policy rules. > I am not sure that will be any faster than a "chmod -R" as it will exec millions of instances of chmod. What you gain on the swings you are going to loose on the roundabouts. TL;DR is you want to change permissions on millions of files expect it to take a considerable period of time. Even a modern NVMe SSD probably does around 50k IOPS per second, so best case scenario is one million files taking 40 seconds, at one read and one write per file and that is frankly unlikely. Also get ready to back them up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From s.j.thompson at bham.ac.uk Tue Dec 7 14:55:15 2021 From: s.j.thompson at bham.ac.uk (Simon Thompson) Date: Tue, 7 Dec 2021 14:55:15 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: Or add: UPDATECTIME yes SKIPACLUPDATECHECK yes To you dsm.opt file to skip checking for those updates and don?t back them up again. Actually I thought TSM only updated the metadata if the mode/owner changed, not re-backed the file? Simon From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Jonathan Buzzard Date: Tuesday, 7 December 2021 at 14:29 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Question on changing mode on many files On 07/12/2021 14:01, Frederick Stock wrote: > If you are running on a more recent version of Scale you might want to > look at the mmfind command. It provides a find-like wrapper around the > execution of policy rules. > I am not sure that will be any faster than a "chmod -R" as it will exec millions of instances of chmod. What you gain on the swings you are going to loose on the roundabouts. TL;DR is you want to change permissions on millions of files expect it to take a considerable period of time. Even a modern NVMe SSD probably does around 50k IOPS per second, so best case scenario is one million files taking 40 seconds, at one read and one write per file and that is frankly unlikely. Also get ready to back them up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Dec 7 15:42:58 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 7 Dec 2021 15:42:58 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: On 07/12/2021 14:55, Simon Thompson wrote: > > Or add: > ? UPDATECTIME?????????????? yes > ? SKIPACLUPDATECHECK??????? yes > > To you dsm.opt file to skip checking for those updates and don?t back > them up again. Yeah, but then a restore gives you potentially an unusable file system as the ownership of the files and ACL's are all wrong. Better to bite the bullet and back them up again IMHO. > > Actually I thought TSM only updated the metadata if the mode/owner > changed, not re-backed the file? That was my understanding but I have seen TSM rebacked up large amounts of data where the owner of the file changed in the past, so your mileage may vary. Also ACL's are stored in extended attributes which are stored with the files and changes will definitely cause the file to be backed up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Walter.Sklenka at EDV-Design.at Thu Dec 9 09:26:40 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 09:26:40 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Message-ID: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ... connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,... to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Visio-eodc-2-fabs.pdf Type: application/pdf Size: 35768 bytes Desc: Visio-eodc-2-fabs.pdf URL: From janfrode at tanso.net Thu Dec 9 10:25:17 2021 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 9 Dec 2021 11:25:17 +0100 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: I believe this should be a fully working solution. I see no problem enabling RDMA between a subset of nodes -- just disable verbsRdma on the nodes you want to use plain IP. -jf On Thu, Dec 9, 2021 at 11:04 AM Walter Sklenka wrote: > Dear spectrum scale users! > > May I ask you a design question? > > We have an IB environment which is very mixed at the moment ( connecX3 ? > connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also > HDR100 and HDR switches. We still have some big troubles in this fabric > when using RDMA , a case at Mellanox and IBM is open . > > The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where > we want to migrate the data to ess5000 , ( mmdelvdisk +qos) > > Due to the current problems with RDMA we though eventually we could try a > workaround : > > If you are interested there is Maybe you can find the attachment ? > > We build 2 separate fabrics , the ess-IO servers attached to both blue and > green and all other cluster members and all remote clusters only to fabric > blue > > The daemon interfaces (IPoIP) are on fabric blue > > > > It is the aim to setup rdma only on the ess-ioServers in the fabric green > , in the blue we must use IPoIB (tcp) > > Do you think datamigration would work between ess01,ess02,? to ess07,ess08 > via RDMA ? > > Or is it principally not possible to make a rdma network only for a > subset of a cluster (though this subset would be reachable via other > fabric) ? > > > > Thank you very much for any input ! > > Best regards walter > > > > > > > > Mit freundlichen Gr??en > *Walter Sklenka* > *Technical Consultant* > > > > EDV-Design Informationstechnologie GmbH > Giefinggasse 6/1/2, A-1210 Wien > Tel: +43 1 29 22 165-31 > Fax: +43 1 29 22 165-90 > E-Mail: sklenka at edv-design.at > Internet: www.edv-design.at > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Dec 9 10:41:29 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 10:41:29 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: Hi Jan! That great to hear So we will try this Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at Von: Jan-Frode Myklebust Gesendet: Thursday, December 9, 2021 11:25 AM An: Walter Sklenka Cc: gpfsug-discuss at spectrumscale.org Betreff: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration I believe this should be a fully working solution. I see no problem enabling RDMA between a subset of nodes -- just disable verbsRdma on the nodes you want to use plain IP. -jf On Thu, Dec 9, 2021 at 11:04 AM Walter Sklenka > wrote: Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Dec 9 12:04:28 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 9 Dec 2021 12:04:28 +0000 Subject: [gpfsug-discuss] =?utf-8?q?alternate_path_between_ESS_Servers_for?= =?utf-8?q?=09Datamigration?= In-Reply-To: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Dec 9 12:36:08 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 9 Dec 2021 12:36:08 +0000 Subject: [gpfsug-discuss] Adding a quorum node Message-ID: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> I am looking to replace the quorum node in our cluster. The RAID card in the server we are currently using is a casualty of the RHEL8 SAS card purge :-( I have a "new" dual core server that is fully supported by RHEL8. After some toing and throwing with IBM they agreed a Pentium G6400 is 70PVU a core and two cores :-) That said it is currently running RHEL7 because that's what the DSS-G nodes are running. The upgrade to RHEL8 is planned for next year. Anyway I have added it into the GPFS cluster all well and good and GPFS is mounted just fine. However when I ran the command to make it a quorum node I got the following error (sanitized to remove actual DNS names and IP addresses initialize (113, '', ('', 1191)) failed (err 79) server initialization failed (err 79) mmchnode: Unexpected error from chnodes -n 1=:1191,2:1191,3=:1191,113=:1191 -f 1 -P 1191 . Return code: 149 mmchnode: Unable to change the CCR quorum node configuration. mmchnode: Command failed. Examine previous error messages to determine cause. fqdn-new is the new node and fqdn1/2/3 are the existing quorum nodes. I want to remove fqdn3 in due course. Anyone any idea what is going on? I thought you could change the quorum nodes on the fly? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From douglasof at us.ibm.com Thu Dec 9 16:04:28 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Thu, 9 Dec 2021 16:04:28 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: Message-ID: Walter: Though not directly about your design, our work with NVIDIA on GPUdirect Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both MOFED and Firmware version compatibility can be. I would suggest anyone debugging RDMA issues should look at those closely. Doug by carrier pigeon On Dec 9, 2021, 5:04:36 AM, gpfsug-discuss-request at spectrumscale.org wrote: From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Date: Dec 9, 2021, 5:04:36 AM Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 119, Issue 5 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..." Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..."Today's Topics: 1. alternate path between ESS Servers for Datamigration (Walter Sklenka) Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.eberhard at de.ibm.com Thu Dec 9 16:43:26 2021 From: ralf.eberhard at de.ibm.com (Ralf Eberhard) Date: Thu, 9 Dec 2021 16:43:26 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 119, Issue 7 - Adding a quorum node In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Dec 9 19:09:44 2021 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 9 Dec 2021 19:09:44 +0000 Subject: [gpfsug-discuss] Adding a quorum node In-Reply-To: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> References: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> Message-ID: I frequently change quorum on the fly on both our 4.x and 5.0 clusters during upgrades/maintenance. You have sanity in the CCR to start with? (mmccr query, lsnodes, etc,etc) Anything useful in the logs or if you drop debug on it? ('export DEBUG=1'and then re-run command) Ed Wahl OSC -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Jonathan Buzzard Sent: Thursday, December 9, 2021 7:36 AM To: 'gpfsug-discuss at spectrumscale.org' Subject: [gpfsug-discuss] Adding a quorum node I am looking to replace the quorum node in our cluster. The RAID card in the server we are currently using is a casualty of the RHEL8 SAS card purge :-( I have a "new" dual core server that is fully supported by RHEL8. After some toing and throwing with IBM they agreed a Pentium G6400 is 70PVU a core and two cores :-) That said it is currently running RHEL7 because that's what the DSS-G nodes are running. The upgrade to RHEL8 is planned for next year. Anyway I have added it into the GPFS cluster all well and good and GPFS is mounted just fine. However when I ran the command to make it a quorum node I got the following error (sanitized to remove actual DNS names and IP addresses initialize (113, '', ('', 1191)) failed (err 79) server initialization failed (err 79) mmchnode: Unexpected error from chnodes -n 1=:1191,2:1191,3=:1191,113=:1191 -f 1 -P 1191 . Return code: 149 mmchnode: Unable to change the CCR quorum node configuration. mmchnode: Command failed. Examine previous error messages to determine cause. fqdn-new is the new node and fqdn1/2/3 are the existing quorum nodes. I want to remove fqdn3 in due course. Anyone any idea what is going on? I thought you could change the quorum nodes on the fly? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!KGKeukY!hO7wULtfr6n28eBJ0BB8sYyRMFo6Xl5_XDpsNZz3GiD_3nXlPf6nKHNR-X99$ From Walter.Sklenka at EDV-Design.at Thu Dec 9 19:38:45 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 19:38:45 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: Hi Olaf!! Many thanks OK well we will do mmvdisk vs delete So #mmvdisk vs delete ? -N ess01,ess02?.. would be correct , or? Best regards walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Olaf Weiser Sent: Donnerstag, 9. Dezember 2021 13:04 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Hallo Walter, ;-) yes !AND! no .. for sure , you can specifiy a subset of nodes to use RDMA and other nodes just communicating TCPIP But that's only half of the truth . The other half is.. who and how , you are going to migrate/copy the data in case you 'll use mmrestripe .... you will have to make sure , that only nodes, connected(green) and configured for RDMA doing the work otherwise.. if will also work to migrate the data, but then data is send throught the Ethernet as well , (as long all those nodes are in the same cluster) laff ----- Urspr?ngliche Nachricht ----- Von: "Walter Sklenka" > Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: "'gpfsug-discuss at spectrumscale.org'" > CC: Betreff: [EXTERNAL] [gpfsug-discuss] alternate path between ESS Servers for Datamigration Datum: Do, 9. Dez 2021 11:04 Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Dec 9 19:43:31 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 19:43:31 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: Message-ID: <4f6b41f6a3b44c7a80cb588add2056dd@Mail.EDVDesign.cloudia> Hello Douglas! Many thanks for your advice ! Well we are in a horrible situation regarding firmware and MOFED of old equipment Mellanox advised us to use a special version of subnetmanager 5.0-2.1.8.0 from MOFED I hope this helps Let?s see how we can proceed Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Douglas O'flaherty Sent: Donnerstag, 9. Dezember 2021 17:04 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Walter: Though not directly about your design, our work with NVIDIA on GPUdirect Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both MOFED and Firmware version compatibility can be. I would suggest anyone debugging RDMA issues should look at those closely. Doug by carrier pigeon ________________________________ On Dec 9, 2021, 5:04:36 AM, gpfsug-discuss-request at spectrumscale.org wrote: From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Date: Dec 9, 2021, 5:04:36 AM Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 119, Issue 5 ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..." Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..."Today's Topics: 1. alternate path between ESS Servers for Datamigration (Walter Sklenka) Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Dec 9 20:19:41 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 9 Dec 2021 20:19:41 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 119, Issue 7 - Adding a quorum node In-Reply-To: References: Message-ID: On 09/12/2021 16:43, Ralf Eberhard wrote: > Jonathan, > > my suspicion is that?the GPFS daemon on fqdn-new is not reachable via > port 1191. > You can double check that by?sending a lightweight CCR RPC to this > daemon from another quorum node by attempting: > > mmccr echo -n fqdn-new;echo $? > > If this echo returns with a non-zero exit code the network settings must > be verified. And even?the other direction must > work: Node fqdn-new must?reach another quorum node, like (attempting on > fqdn-new): > > mmccr echo -n ;echo $? > Duh, that's my Homer Simpson moment for today. I forgotten to move the relevant network interfaces on the new server to the trusted zone in the firewall. So of course my normal testing with ping and ssh was working just fine. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Fri Dec 10 00:27:23 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 10 Dec 2021 00:27:23 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: Message-ID: On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From abeattie at au1.ibm.com Fri Dec 10 01:09:57 2021 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Fri, 10 Dec 2021 01:09:57 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16390972812300.png Type: image/png Size: 98384 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16390972812301.png Type: image/png Size: 101267 bytes Desc: not available URL: From douglasof at us.ibm.com Fri Dec 10 04:24:21 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Fri, 10 Dec 2021 00:24:21 -0400 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Fri Dec 10 10:17:20 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Fri, 10 Dec 2021 10:17:20 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Hello Douglas! May I ask a basic question regarding GPUdirect Storage or all local attached storage like NVME disks. Do you think it outerperforms "classical" shared storagesystems which are attached via FC connected to NSD servers HDR attached? With FC you have also bounce copies and more delay , isn?t it? There are solutions around which work with local NVME disks building some protection level with Raid (or duplication) . I am curious if it would be a better approach than shared storage which has it?s limitation (cost intensive scale out, extra infrstructure, max 64Gb at this time ... ) Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Douglas O'flaherty Sent: Freitag, 10. Dezember 2021 05:24 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard > on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Fri Dec 10 10:28:38 2021 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Fri, 10 Dec 2021 10:28:38 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> References: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Message-ID: Hallo Walter, we had many experiences now to change our Storage-Systems in our Backup-Environment to RDMA-IB with HDR and EDR Connections. What we see now (came from a 16Gbit FC Infrastructure) we enhance our throuhput from 7 GB/s to 30 GB/s. The main reason are the elimination of the driver-layers in the client-systems and make a Buffer to Buffer communication because of RDMA. The latency reduction are significant. Regards Renar. We use now ESS3k and ESS5k systems with 6.1.1.2-Code level. Renar Grunenberg Abteilung Informatik - Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder, Thomas Sehn, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org Im Auftrag von Walter Sklenka Gesendet: Freitag, 10. Dezember 2021 11:17 An: gpfsug-discuss at spectrumscale.org Betreff: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Hello Douglas! May I ask a basic question regarding GPUdirect Storage or all local attached storage like NVME disks. Do you think it outerperforms ?classical? shared storagesystems which are attached via FC connected to NSD servers HDR attached? With FC you have also bounce copies and more delay , isn?t it? There are solutions around which work with local NVME disks building some protection level with Raid (or duplication) . I am curious if it would be a better approach than shared storage which has it?s limitation (cost intensive scale out, extra infrstructure, max 64Gb at this time ? ) Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Douglas O'flaherty Sent: Freitag, 10. Dezember 2021 05:24 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard > on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Dec 10 10:37:31 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Dec 2021 10:37:31 +0000 Subject: [gpfsug-discuss] Test email format / mail format Message-ID: An HTML attachment was scrubbed... URL: From Ondrej.Kosik at ibm.com Fri Dec 10 10:39:56 2021 From: Ondrej.Kosik at ibm.com (Ondrej Kosik) Date: Fri, 10 Dec 2021 10:39:56 +0000 Subject: [gpfsug-discuss] Test email format / mail format In-Reply-To: References: Message-ID: Hello all, Thank you for the test email, my reply is coming from Outlook-based infrastructure. ________________________________ From: Olaf Weiser Sent: Friday, December 10, 2021 10:37 AM To: gpfsug-discuss at spectrumscale.org Cc: Ondrej Kosik Subject: Test email format / mail format This email is just a test, because we've seen mail format issues from IBM sent emails you can ignore this email , just for internal problem determination -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Dec 10 11:10:07 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Dec 2021 11:10:07 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: , <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16391192376761.png Type: image/png Size: 127072 bytes Desc: not available URL: From anacreo at gmail.com Sun Dec 12 02:19:02 2021 From: anacreo at gmail.com (Alec) Date: Sat, 11 Dec 2021 18:19:02 -0800 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: I feel the need to respond here... I see many responses on this User Group forum that are dismissive of the fringe / extreme use cases and of the "what do you need that for '' mindset. The thing is that Spectrum Scale is for the extreme, just take the word "Parallel" in the old moniker that was already an extreme use case. If you have a standard workload, then sure most of the complex features of the file system are toys, but many of us DO have extreme workloads where shaking out every ounce of performance is a worthwhile and financially sound endeavor. It is also because of the efforts of those of us living on the cusp of technology that these technologies become mainstream and no-longer extreme. I have an AIX LPAR that traverses more than 300TB+ of data a day on a Spectrum Scale file system, it is fully virtualized, and handles a million files. If that performance level drops, regulatory reports will be late, business decisions won't be current. However, the systems of today and the future have to traverse this much data and if they are slow then they can't keep up with real-time data feeds. So the difference between an RDMA disk IO vs a non RDMA disk IO could possibly mean what level of analytics are done to perform real time fraud prevention. Or at what cost, today many systems achieve this by keeping everything in memory in HUGE farms.. Being able to perform data operations at 30GB/s means you can traverse ALL of the census bureau data for all time from the US Govt in about 2 seconds... that's a pretty substantial capability that moves the bar forward in what we can do from a technology perspective. I just did a technology garage with IBM where we were able to achieve 1.5TB/writes on an encrypted ESS off of a single VMWare Host and 4 VM's over IP... That's over 2PB of data writes a day on a single VM server. Being able to demonstrate that there are production virtualized environments capable of this type of capacity, helps to show where the point of engineering a proper storage architecture outweighs the benefits of just throwing more GPU compute farms at the problem with ever dithering disk I/O. It also helps to demonstrate how a virtual storage optimized farm could be leveraged to host many in-memory or data analytic heavy workloads in a shared configuration. Douglas's response is the right one, how much IO does the application / environment need, it's nice to see Spectrum Scale have the flexibility to deliver. I'm pretty confident that if I can't deliver the required I/O performance on Spectrum Scale, nobody else can on any other storage platform within reasonable limits. Alec Effrat On Thu, Dec 9, 2021 at 8:24 PM Douglas O'flaherty wrote: > Jonathan: > > You posed a reasonable question, which was "when is RDMA worth the > hassle?" I agree with part of your premises, which is that it only matters > when the bottleneck isn't somewhere else. With a parallel file system, like > Scale/GPFS, the absolute performance bottleneck is not the throughput of a > single drive. In a majority of Scale/GPFS clusters the network data path is > the performance limitation. If they deploy HDR or 100/200/400Gbps > Ethernet... At that point, the buffer copy time inside the server matters. > > When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) > is easily demonstrated because it eliminates the bounce copy through the > system memory. In our NVIDIA DGX A100 server testing testing we were able > to get around 2x the per system throughput by using RDMA direct to GPU (GUP > Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage > node.) > > However, your question remains. Synthetic benchmarks are good indicators > of technical benefit, but do your users and applications need that extra > performance? > > These are probably only a handful of codes in organizations that need > this. However, they are high-value use cases. We have client applications > that either read a lot of data semi-randomly and not-cached - think > mini-Epics for scaling ML training. Or, demand lowest response time, like > production inference on voice recognition and NLP. > > If anyone has use cases for GPU accelerated codes with truly demanding > data needs, please reach out directly. We are looking for more use cases to > characterize the benefit for a new paper. f you can provide some code > examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a > benefit. > > Thanks, > > doug > > Douglas O'Flaherty > douglasof at us.ibm.com > > > > > > > ----- Message from Jonathan Buzzard on > Fri, 10 Dec 2021 00:27:23 +0000 ----- > > *To:* > gpfsug-discuss at spectrumscale.org > > *Subject:* > Re: [gpfsug-discuss] > On 09/12/2021 16:04, Douglas O'flaherty wrote: > > > > Though not directly about your design, our work with NVIDIA on GPUdirect > > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > > MOFED and Firmware version compatibility can be. > > > > I would suggest anyone debugging RDMA issues should look at those > closely. > > > May I ask what are the alleged benefits of using RDMA in GPFS? > > I can see there would be lower latency over a plain IP Ethernet or IPoIB > solution but surely disk latency is going to swamp that? > > I guess SSD drives might change that calculation but I have never seen > proper benchmarks comparing the two, or even better yet all four > connection options. > > Just seems a lot of complexity and fragility for very little gain to me. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > > > ----- Original message ----- > From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS > Servers for Datamigration > Date: Fri, Dec 10, 2021 10:27 > > On 09/12/2021 16:04, Douglas O'flaherty wrote: > > > > Though not directly about your design, our work with NVIDIA on GPUdirect > > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > > MOFED and Firmware version compatibility can be. > > > > I would suggest anyone debugging RDMA issues should look at those > closely. > > > May I ask what are the alleged benefits of using RDMA in GPFS? > > I can see there would be lower latency over a plain IP Ethernet or IPoIB > solution but surely disk latency is going to swamp that? > > I guess SSD drives might change that calculation but I have never seen > proper benchmarks comparing the two, or even better yet all four > connection options. > > Just seems a lot of complexity and fragility for very little gain to me. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 02:38:26 2021 From: anacreo at gmail.com (Alec) Date: Sat, 11 Dec 2021 18:38:26 -0800 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: You can manipulate the permissions via GPFS policy engine, essentially you'd write a script that the policy engine calls and tell GPFS to farm out the change in at whatever scale you need... run in a single node, how many files per thread, how many threads per node, etc... This can GREATLY accelerate file change permissions over a large quantity of files. However, as stated earlier the mmfind command will do all of this for you and it's worth the effort to get it compiled for your system. I don't have Spectrum Scale in front of me but for the best performance you'll want to setup the mmfind policy engine parameters to parallelize your workload... If mmfind has no action it will silently use GPFS policy engine to produce the requested output, however if mmfind has an action it will expose the policy engine calls. it goes something like this: mmfind -B 1 -N directattachnode1,directattachnode2 -m 24 /path/to/find -perm +o=w ! \( -type d -perm +o=t \) -xargs chmod o-w This will run 48 threads on 2 nodes and bump other write permissions off of any file it finds (excluding temp dirs) until it completes, it should go blistering fast... as this is only a meta operation the -B 1 might not be necessary, you'd probably be better off with a -B 100, but as I deal with a lot of 100GB+ files I don't want a single thread to be stuck with 3 100GB+ files and another thread to have none, so I usually set the max depth to be 1 and take the higher execution count. This has an advantage in that GPFS will break up the inodes in the most efficient way for the chmod to happen in parallel. I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. If you think of the possibilities here you could easily change that chmod to a gzip and add a -mtime +1200 and you have a find command that will gzip compress files over 4 years old in parallel across multiple nodes... mmfind is VERY powerful and flexible, highly worth getting into usage. Alec On Tue, Dec 7, 2021 at 7:43 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 07/12/2021 14:55, Simon Thompson wrote: > > > > Or add: > > UPDATECTIME yes > > SKIPACLUPDATECHECK yes > > > > To you dsm.opt file to skip checking for those updates and don?t back > > them up again. > > Yeah, but then a restore gives you potentially an unusable file system > as the ownership of the files and ACL's are all wrong. Better to bite > the bullet and back them up again IMHO. > > > > > Actually I thought TSM only updated the metadata if the mode/owner > > changed, not re-backed the file? > > That was my understanding but I have seen TSM rebacked up large amounts > of data where the owner of the file changed in the past, so your mileage > may vary. > > Also ACL's are stored in extended attributes which are stored with the > files and changes will definitely cause the file to be backed up again. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Sun Dec 12 11:19:07 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sun, 12 Dec 2021 11:19:07 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: On 12/12/2021 02:19, Alec wrote: > I feel the need to respond here... I see many responses on this > User Group forum that are dismissive of the fringe / extreme use > cases and of the "what do you need that for '' mindset. The thing is > that Spectrum Scale is for the extreme, just take the word "Parallel" > in the old moniker that was already an extreme use case. I wasn't been dismissive, I was asking what the benefits of using RDMA where. There is very little information about it out there and not a lot of comparative benchmarking on it either. Without the benefits being clearly laid out I am unlikely to consider it and might be missing a trick. IBM's literature on the topic is underwhelming to say the least. [SNIP] > I have an AIX LPAR that traverses more than 300TB+ of data a day on a > Spectrum Scale file system, it is fully virtualized, and handles a > million files. If that performance level drops, regulatory reports > will be late, business decisions won't be current. However, the > systems of today and the future have to traverse this much data and > if they are slow then they can't keep up with real-time data feeds. I have this nagging suspicion that modern all flash storage systems could deliver that sort of performance without the overhead of a parallel file system. [SNIP] > > Douglas's response is the right one, how much IO does the > application / environment need, it's nice to see Spectrum Scale have > the flexibility to deliver. I'm pretty confident that if I can't > deliver the required I/O performance on Spectrum Scale, nobody else > can on any other storage platform within reasonable limits. > I would note here that in our *shared HPC* environment I made a very deliberate design decision to attach the compute nodes with 10Gbps Ethernet for storage. Though I would probably pick 25Gbps if we where procuring the system today. There where many reasons behind that, but the main ones being that historical file system performance showed that greater than 99% of the time the file system never got above 20% of it's benchmarked speed. Using 10Gbps Ethernet was not going to be a problem. Secondly by limiting the connection to 10Gbps it stops one person hogging the file system to the detriment of other users. We have seen individual nodes peg their 10Gbps link from time to time, even several nodes at once (jobs from the same user) and had they had access to a 100Gbps storage link that would have been curtains for everyone else's file system usage. At this juncture I would note that the GPFS admin traffic is handled by on separate IP address space on a separate VLAN which we prioritize with QOS on the switches. So even when a node floods it's 10Gbps link for extended periods of time it doesn't get ejected from the cluster. The need for a separate physical network for admin traffic is not necessary in my experience. That said you can do RDMA with Ethernet... Unfortunately the teaching cluster and protocol nodes are on Intel X520's which I don't think do RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do RDMA. I could upgrade the protocol nodes but the teaching cluster would be a problem. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From s.j.thompson at bham.ac.uk Sun Dec 12 17:01:21 2021 From: s.j.thompson at bham.ac.uk (Simon Thompson) Date: Sun, 12 Dec 2021 17:01:21 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 22:03:39 2021 From: anacreo at gmail.com (Alec) Date: Sun, 12 Dec 2021 14:03:39 -0800 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson wrote: > > I'm not sure if this happens on Spectrum Scale but on most FS's if you > do a chmod 770 file you'll lose any ACLs assigned to the > > file, so safest to bump the permissions with a subtractive or additive > o-w or g+w type operation. > > > > This depends entirely on the fileset setting, see: > > > https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command > > > > ?*allow-permission-change*? > > > > We typically have file-sets set to chmodAndUpdateAcl, though not > exclusively, I think it was some quirky software that tested the > permissions after doing something and didn?t like the updatewithAcl thing ? > > > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 23:00:21 2021 From: anacreo at gmail.com (Alec) Date: Sun, 12 Dec 2021 15:00:21 -0800 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: So I never said this node wasn't in a HPC Cluster, it has partners... For our use case however some nodes have very expensive per core software licensing, and we have to weigh the human costs of empowering traditional monolithic code to do the job, or bringing in more users to re-write and maintain distributed code (someone is going to spend the money to get this work done!). So to get the most out of those licensed cores we have designed our virtual compute machine(s) with 128Gbps+ of SAN fabric. Just to achieve our average business day reads it would take 3 of your cluster nodes maxed out 24 hours, or 9 of them in a business day to achieve the same read speeds... and another 4 nodes to handle the writes. I guess HPC is in the eye of the business... In my experience cables and ports are cheaper than servers. The classic shared HPC design you have is being up-ended by the fact that there is so much compute power (cpu and memory) now in the nodes, you can't simply build a system with two storage connections (Noah's ark) and call it a day. If you look at the spec 25Gbps Ethernet is only delivering ~3GB/s (which is just above USB 3.2, and below USB 4). Spectrum Scale does very well for us when met with a fully saturated workload, we maintain one node for SLA and one node for AdHoc workload, and like clockwork the SLA box always steals exactly half the bandwidth when a job fires, so that 1 SLA job can take half the bandwidth and complete compared to the 40 AdHoc jobs on the other node. In newer releases IBM has introduced fileset throttling.... this is very exciting as we can really just design the biggest fattest pipes from VM to Storage and then software define the storage AND the bandwidth from the standard nobody cares about workloads all the way up to the most critical workloads... I don't buy the smaller bandwidth is better, as I see that as just one band-aid that has more elegant solutions, such as simply doing more resource constraints (you can't push the bandwidth if you can't get the CPU...), or using a workload orchestrator such as LSF with limits set, but I also won't say it never makes sense, as well I only know my problems and my solutions. For years the network team wouldn't let users have more than 10mb then 100mb networking as they were always worried about their backend being overwhelmed... I literally had faster home internet service than my work desktop connection at one point in my life.. it was all a falesy, the workload should drive the technology, the technology shouldn't hinder the workload. You can do a simple exercise, try scaling up... imagine your cluster is asked to start computing 100x more work... and that work must be completed on time. Do you simply say let me buy 100x more of everything? Or do you start to look at where can I gain efficiency and what actual bottlenecks do I need to lift... for some of us it's CPU, for some it's Memory, for some it's disk, depending on the work... I'd say the extremely rare case is where you need 100x more of EVERYTHING, but you have to get past the performance of the basic building blocks baked into the cake before you do need to dig deeper into the bottlenecks and it makes practical and financial sense. If your main bottleneck was storage, you'd be asking far different questions about RDMA. Alec On Sun, Dec 12, 2021 at 3:19 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 12/12/2021 02:19, Alec wrote: > > > I feel the need to respond here... I see many responses on this > > User Group forum that are dismissive of the fringe / extreme use > > cases and of the "what do you need that for '' mindset. The thing is > > that Spectrum Scale is for the extreme, just take the word "Parallel" > > in the old moniker that was already an extreme use case. > > I wasn't been dismissive, I was asking what the benefits of using RDMA > where. There is very little information about it out there and not a lot > of comparative benchmarking on it either. Without the benefits being > clearly laid out I am unlikely to consider it and might be missing a trick. > > IBM's literature on the topic is underwhelming to say the least. > > [SNIP] > > > > I have an AIX LPAR that traverses more than 300TB+ of data a day on a > > Spectrum Scale file system, it is fully virtualized, and handles a > > million files. If that performance level drops, regulatory reports > > will be late, business decisions won't be current. However, the > > systems of today and the future have to traverse this much data and > > if they are slow then they can't keep up with real-time data feeds. > > I have this nagging suspicion that modern all flash storage systems > could deliver that sort of performance without the overhead of a > parallel file system. > > [SNIP] > > > > > Douglas's response is the right one, how much IO does the > > application / environment need, it's nice to see Spectrum Scale have > > the flexibility to deliver. I'm pretty confident that if I can't > > deliver the required I/O performance on Spectrum Scale, nobody else > > can on any other storage platform within reasonable limits. > > > > I would note here that in our *shared HPC* environment I made a very > deliberate design decision to attach the compute nodes with 10Gbps > Ethernet for storage. Though I would probably pick 25Gbps if we where > procuring the system today. > > There where many reasons behind that, but the main ones being that > historical file system performance showed that greater than 99% of the > time the file system never got above 20% of it's benchmarked speed. > Using 10Gbps Ethernet was not going to be a problem. > > Secondly by limiting the connection to 10Gbps it stops one person > hogging the file system to the detriment of other users. We have seen > individual nodes peg their 10Gbps link from time to time, even several > nodes at once (jobs from the same user) and had they had access to a > 100Gbps storage link that would have been curtains for everyone else's > file system usage. > > At this juncture I would note that the GPFS admin traffic is handled by > on separate IP address space on a separate VLAN which we prioritize with > QOS on the switches. So even when a node floods it's 10Gbps link for > extended periods of time it doesn't get ejected from the cluster. The > need for a separate physical network for admin traffic is not necessary > in my experience. > > That said you can do RDMA with Ethernet... Unfortunately the teaching > cluster and protocol nodes are on Intel X520's which I don't think do > RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do > RDMA. I could upgrade the protocol nodes but the teaching cluster would > be a problem. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Dec 13 00:03:42 2021 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 13 Dec 2021 00:03:42 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Mon Dec 13 10:49:37 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 13 Dec 2021 10:49:37 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: <96a77c75de9b41f089e853120eef870d@psi.ch> I am definitely going to try this solution with mmfind. Thank you also for the command line and several hints? I?ll be back with the outcome soon. Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Alec Inviato: domenica 12 dicembre 2021 23:04 A: gpfsug main discussion list Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Mon Dec 13 11:30:17 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 13 Dec 2021 11:30:17 +0000 Subject: [gpfsug-discuss] R: R: Question on changing mode on many files In-Reply-To: <96a77c75de9b41f089e853120eef870d@psi.ch> References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> <96a77c75de9b41f089e853120eef870d@psi.ch> Message-ID: Hi Alec , mmfind doesn?t have a man page (does it have an online one ? I cannot find it). And according to mmfind -h it doesn?t exposes the ?-N? neither the ?-B? flags. RPM is gpfs.base-5.1.1-2.x86_64. Do I have chance to download a newest version of that script from somewhere ? Thanks, Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Dorigo Alvise (PSI) Inviato: luned? 13 dicembre 2021 11:50 A: gpfsug main discussion list Oggetto: [gpfsug-discuss] R: Question on changing mode on many files I am definitely going to try this solution with mmfind. Thank you also for the command line and several hints? I?ll be back with the outcome soon. Alvise Da: gpfsug-discuss-bounces at spectrumscale.org > Per conto di Alec Inviato: domenica 12 dicembre 2021 23:04 A: gpfsug main discussion list > Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Mon Dec 13 18:33:23 2021 From: anacreo at gmail.com (Alec) Date: Mon, 13 Dec 2021 10:33:23 -0800 Subject: [gpfsug-discuss] R: R: Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> <96a77c75de9b41f089e853120eef870d@psi.ch> Message-ID: I checked on my office network.... mmfind --help mmfind -polFlags '-N node1,node2 -B 100 -m 24' /path/to/find -perm +o=w ! \( -type d -perm +o=t \) -xargs chmod o-w I think that the -m 24 is the default (24 threads per node), but it's nice to include on the command line so you remember you can increment/decrement it as your needs require or your nodes can handle. It's IMPORTANT to review in the mmfind --help output that some things are 'mmfind' args and go BEFORE the path... some are CRITERIA args and have no impact on the files... BUT SOME ARE ACTION args, and they will affect files. So -exec -xargs are obvious actions, however, -gpfsCompress doesn't find compressed files, it will actually compress the objects... in our AIX environment our compressed reads feel like they're essentially broken, we only get about 5MB/s, however on Linux compress reads seem to work fairly well. So make sure to read the man page carefully before using some non-obvious GPFS enhancements. Also the nice thing is mmfind -xargs takes care of all the strange file names, so you don't have to do anything complicated, but you also can't pipe the output as it will run the xarg in the policy engine. As a footnote this is my all time favorite find for troubleshooting... find $(pwd) -mtime -1 | sed -e 's/.*/"&"/g' | xargs ls -latr List all the files modified in the last day in reverse chronology... Doesn't work :-( Alec On Mon, Dec 13, 2021 at 3:30 AM Dorigo Alvise (PSI) wrote: > Hi Alec , > > mmfind doesn?t have a man page (does it have an online one ? I cannot find > it). And according to mmfind -h it doesn?t exposes the ?-N? neither the > ?-B? flags. RPM is gpfs.base-5.1.1-2.x86_64. > > > > Do I have chance to download a newest version of that script from > somewhere ? > > > > Thanks, > > > > Alvise > > > > *Da:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *Per conto di *Dorigo Alvise > (PSI) > *Inviato:* luned? 13 dicembre 2021 11:50 > *A:* gpfsug main discussion list > *Oggetto:* [gpfsug-discuss] R: Question on changing mode on many files > > > > I am definitely going to try this solution with mmfind. > > Thank you also for the command line and several hints? I?ll be back with > the outcome soon. > > > > Alvise > > > > *Da:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *Per conto di *Alec > *Inviato:* domenica 12 dicembre 2021 23:04 > *A:* gpfsug main discussion list > *Oggetto:* Re: [gpfsug-discuss] Question on changing mode on many files > > > > How am I just learning about this right now, thank you! Makes so much > more sense now the odd behaviors I've seen over the years on GPFS vs POSIX > chmod/ACL. Will definitely go review those settings on my filesets now, > wonder if the default has evolved from 3.x -> 4.x -> 5.x. > > > > IBM needs to find a way to pre-compile mmfind and make it supported, it > really is essential and so beneficial, and so hard to get done in a > production regulated environment. Though a bigger warning that the > compress option is an action not a criteria! > > > > Alec > > > > On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > > > I'm not sure if this happens on Spectrum Scale but on most FS's if you > do a chmod 770 file you'll lose any ACLs assigned to the > > file, so safest to bump the permissions with a subtractive or additive > o-w or g+w type operation. > > > > This depends entirely on the fileset setting, see: > > > https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command > > > > ?*allow-permission-change*? > > > > We typically have file-sets set to chmodAndUpdateAcl, though not > exclusively, I think it was some quirky software that tested the > permissions after doing something and didn?t like the updatewithAcl thing ? > > > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Mon Dec 13 23:55:23 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 13 Dec 2021 23:55:23 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: <19884986-aff8-20aa-f1d1-590f6b81ddd2@strath.ac.uk> On 13/12/2021 00:03, Andrew Beattie wrote: > What is the main outcome or business requirement of the teaching cluster > ( i notice your specific in the use of defining it as a teaching cluster) > It is entirely possible that the use case for this cluster does not > warrant the use of high speed low latency networking, and it simply > needs the benefits of a parallel filesystem. While we call it the "teaching cluster" it would be more appropriate to call them "teaching nodes" that shares resources (storage and login nodes) with the main research cluster. It's mainly used by undergraduates doing final year projects and M.Sc. students. It's getting a bit long in the tooth now but not many undergraduates have access to a 16 core machine with 64GB of RAM. Even if they did being able to let something go flat out for 48 hours means there personal laptop is available for other things :-) I was just musing that the cards in the teaching nodes being Intel 82599ES would be a stumbling block for RDMA over Ethernet, but on checking the Intel X710 doesn't do RDMA either so it would all be a bust anyway. I was clearly on the crack pipe when I thought they did. So aside from the DSS-G and GPU nodes with Connect-X4 cards nothing does RDMA. [SNIP] > For some of my research clients this is the ability to run 20-30% more > compute jobs on the same HPC resources in the same 24H period, which > means that they can reduce the amount of time they need on the HPC > cluster to get the data results that they are looking for. Except as I said in our cluster the storage servers have never been maxed out except when running benchmarks. Individual compute nodes have been maxed out (mainly Gaussian writing 800GB temporary files) but as I explained that's a good thing from my perspective because I don't want one or two users to be able to pound the storage into oblivion and cause problems for everyone else. We have enough problems with users tanking the login nodes by running computations on them. That should go away with our upgrade to RHEL8 and the wonders of per user cgroups; me I love systemd. In the end nobody has complained that the storage speed is a problem yet, and putting the metadata on SSD would be my first port of call if they did and funds where available to make things go faster. To be honest I think the users are just happy that GPFS doesn't eat itself and be out of action for a few weeks every couple of years like Lustre did on the previous system. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From olaf.weiser at de.ibm.com Fri Dec 17 15:08:15 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 17 Dec 2021 15:08:15 +0000 Subject: [gpfsug-discuss] email format check again for IBM domain send email Message-ID: An HTML attachment was scrubbed... URL: From juergen.hannappel at desy.de Fri Dec 17 15:57:45 2021 From: juergen.hannappel at desy.de (Hannappel, Juergen) Date: Fri, 17 Dec 2021 16:57:45 +0100 (CET) Subject: [gpfsug-discuss] ESS 6.1.2.1 changes Message-ID: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Hi, I just noticed that tday a new ESS release (6.1.2.1) appeared on fix central. What I can't find is a list of changes to 6.1.2.0, and anyway finding the change list is always a PITA. Does anyone know what changed? -- Dr. J?rgen Hannappel DESY/IT Tel. : +49 40 8998-4616 From luis.bolinches at fi.ibm.com Fri Dec 17 18:50:09 2021 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Fri, 17 Dec 2021 18:50:09 +0000 Subject: [gpfsug-discuss] ESS 6.1.2.1 changes In-Reply-To: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> References: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Dec 20 11:26:29 2021 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 20 Dec 2021 12:26:29 +0100 Subject: [gpfsug-discuss] ESS 6.1.2.1 changes In-Reply-To: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> References: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Message-ID: Just ran an upgrade on an EMS, and the only changes I see are these updated packages on the ems: +gpfs.docs-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:43 AM CET +gpfs.ess.firmware-6.0.0-15.ppc64le Mon 20 Dec 2021 11:56:42 AM CET +gpfs.msg.en_US-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:12 AM CET +gpfs.gss.pmsensors-5.1.2-0.el8.ppc64le Mon 20 Dec 2021 11:56:12 AM CET +gpfs.gpl-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:11 AM CET +gpfs.gnr.base-1.0.0-0.ppc64le Mon 20 Dec 2021 11:56:11 AM CET +gpfs.gnr.support-ess5000-1.0.0-3.noarch Mon 20 Dec 2021 11:56:10 AM CET +gpfs.gnr.support-ess3200-6.1.2-0.noarch Mon 20 Dec 2021 11:56:10 AM CET +gpfs.crypto-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:10 AM CET +gpfs.compression-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:10 AM CET +gpfs.license.dmd-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:09 AM CET +gpfs.gnr.support-ess3000-1.0.0-3.noarch Mon 20 Dec 2021 11:56:09 AM CET +gpfs.gui-5.1.2-0.4.noarch Mon 20 Dec 2021 11:56:05 AM CET +gpfs.gskit-8.0.55-19.ppc64le Mon 20 Dec 2021 11:56:02 AM CET +gpfs.java-5.1.2-0.4.ppc64le Mon 20 Dec 2021 11:56:01 AM CET +gpfs.gss.pmcollector-5.1.2-0.el8.ppc64le Mon 20 Dec 2021 11:55:59 AM CET +gpfs.gnr.support-essbase-6.1.2-0.noarch Mon 20 Dec 2021 11:55:59 AM CET +gpfs.adv-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:59 AM CET +gpfs.gnr-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:58 AM CET +gpfs.base-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:54 AM CET +sdparm-1.10-10.el8.ppc64le Mon 20 Dec 2021 11:55:21 AM CET +gpfs.ess.tools-6.1.2.1-release.noarch Mon 20 Dec 2021 11:50:47 AM CET I will guess it has something to do with log4j, but a changelog would be nice :-) https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=142683 On Fri, Dec 17, 2021 at 5:07 PM Hannappel, Juergen < juergen.hannappel at desy.de> wrote: > Hi, > I just noticed that tday a new ESS release (6.1.2.1) appeared on fix > central. > What I can't find is a list of changes to 6.1.2.0, and anyway finding the > change list is always a PITA. > > Does anyone know what changed? > > -- > Dr. J?rgen Hannappel DESY/IT Tel. : +49 40 8998-4616 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Dec 7 13:44:24 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 7 Dec 2021 13:44:24 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files Message-ID: Dear users/developers/support, I'd like to ask if there is a fast way to manipulate the permission mask of many files (millions). I tried on 900k files and a recursive chmod (chmod 0### -R path) takes about 1000s, with about 50% usage of mmfsd daemon. I tried with the perl's internal function chmod that can operate on an array of files, and it takes about 1/3 of the previous method. Which is already a good result. I've seen the possibility to run a policy to execute commands, but I would avoid to execute external commands through mmxargs, 1M of times; would you ? Does anybody have any suggestion to do this operation with minimum disruption on the system ? Thank you, Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Dec 7 14:01:41 2021 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 7 Dec 2021 14:01:41 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Dec 7 14:10:20 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 7 Dec 2021 14:10:20 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: Message-ID: I have 5.0.4 for the moment (planned to be updated next year) and what I see is: [root at sf-dss-1 tmp]# locate mmfind /usr/lpp/mmfs/samples/ilm/mmfind /usr/lpp/mmfs/samples/ilm/mmfind.README /usr/lpp/mmfs/samples/ilm/mmfindUtil_processOutputFile.c /usr/lpp/mmfs/samples/ilm/mmfindUtil_processOutputFile.sampleMakefile Is that what you are talking about ? Thanks, Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Frederick Stock Inviato: marted? 7 dicembre 2021 15:02 A: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files If you are running on a more recent version of Scale you might want to look at the mmfind command. It provides a find-like wrapper around the execution of policy rules. Fred _______________________________________________________ Fred Stock | Spectrum Scale Development Advocacy | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: "Dorigo Alvise (PSI)" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug main discussion list'" > Cc: Subject: [EXTERNAL] [gpfsug-discuss] Question on changing mode on many files Date: Tue, Dec 7, 2021 8:53 AM Dear users/developers/support, I?d like to ask if there is a fast way to manipulate the permission mask of many files (millions). I tried on 900k files and a recursive chmod (chmod 0### -R path) takes about 1000s, with about 50% usage of mmfsd daemon. I tried with the perl?s internal function chmod that can operate on an array of files, and it takes about 1/3 of the previous method. Which is already a good result. I?ve seen the possibility to run a policy to execute commands, but I would avoid to execute external commands through mmxargs, 1M of times; would you ? Does anybody have any suggestion to do this operation with minimum disruption on the system ? Thank you, Alvise _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Dec 7 14:19:42 2021 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 7 Dec 2021 14:19:42 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Dec 7 14:28:54 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 7 Dec 2021 14:28:54 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: Message-ID: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> On 07/12/2021 14:01, Frederick Stock wrote: > If you are running on a more recent version of Scale you might want to > look at the mmfind command.? It provides a find-like wrapper around the > execution of policy rules. > I am not sure that will be any faster than a "chmod -R" as it will exec millions of instances of chmod. What you gain on the swings you are going to loose on the roundabouts. TL;DR is you want to change permissions on millions of files expect it to take a considerable period of time. Even a modern NVMe SSD probably does around 50k IOPS per second, so best case scenario is one million files taking 40 seconds, at one read and one write per file and that is frankly unlikely. Also get ready to back them up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From s.j.thompson at bham.ac.uk Tue Dec 7 14:55:15 2021 From: s.j.thompson at bham.ac.uk (Simon Thompson) Date: Tue, 7 Dec 2021 14:55:15 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: Or add: UPDATECTIME yes SKIPACLUPDATECHECK yes To you dsm.opt file to skip checking for those updates and don?t back them up again. Actually I thought TSM only updated the metadata if the mode/owner changed, not re-backed the file? Simon From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Jonathan Buzzard Date: Tuesday, 7 December 2021 at 14:29 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Question on changing mode on many files On 07/12/2021 14:01, Frederick Stock wrote: > If you are running on a more recent version of Scale you might want to > look at the mmfind command. It provides a find-like wrapper around the > execution of policy rules. > I am not sure that will be any faster than a "chmod -R" as it will exec millions of instances of chmod. What you gain on the swings you are going to loose on the roundabouts. TL;DR is you want to change permissions on millions of files expect it to take a considerable period of time. Even a modern NVMe SSD probably does around 50k IOPS per second, so best case scenario is one million files taking 40 seconds, at one read and one write per file and that is frankly unlikely. Also get ready to back them up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Dec 7 15:42:58 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 7 Dec 2021 15:42:58 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: On 07/12/2021 14:55, Simon Thompson wrote: > > Or add: > ? UPDATECTIME?????????????? yes > ? SKIPACLUPDATECHECK??????? yes > > To you dsm.opt file to skip checking for those updates and don?t back > them up again. Yeah, but then a restore gives you potentially an unusable file system as the ownership of the files and ACL's are all wrong. Better to bite the bullet and back them up again IMHO. > > Actually I thought TSM only updated the metadata if the mode/owner > changed, not re-backed the file? That was my understanding but I have seen TSM rebacked up large amounts of data where the owner of the file changed in the past, so your mileage may vary. Also ACL's are stored in extended attributes which are stored with the files and changes will definitely cause the file to be backed up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Walter.Sklenka at EDV-Design.at Thu Dec 9 09:26:40 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 09:26:40 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Message-ID: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ... connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,... to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Visio-eodc-2-fabs.pdf Type: application/pdf Size: 35768 bytes Desc: Visio-eodc-2-fabs.pdf URL: From janfrode at tanso.net Thu Dec 9 10:25:17 2021 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 9 Dec 2021 11:25:17 +0100 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: I believe this should be a fully working solution. I see no problem enabling RDMA between a subset of nodes -- just disable verbsRdma on the nodes you want to use plain IP. -jf On Thu, Dec 9, 2021 at 11:04 AM Walter Sklenka wrote: > Dear spectrum scale users! > > May I ask you a design question? > > We have an IB environment which is very mixed at the moment ( connecX3 ? > connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also > HDR100 and HDR switches. We still have some big troubles in this fabric > when using RDMA , a case at Mellanox and IBM is open . > > The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where > we want to migrate the data to ess5000 , ( mmdelvdisk +qos) > > Due to the current problems with RDMA we though eventually we could try a > workaround : > > If you are interested there is Maybe you can find the attachment ? > > We build 2 separate fabrics , the ess-IO servers attached to both blue and > green and all other cluster members and all remote clusters only to fabric > blue > > The daemon interfaces (IPoIP) are on fabric blue > > > > It is the aim to setup rdma only on the ess-ioServers in the fabric green > , in the blue we must use IPoIB (tcp) > > Do you think datamigration would work between ess01,ess02,? to ess07,ess08 > via RDMA ? > > Or is it principally not possible to make a rdma network only for a > subset of a cluster (though this subset would be reachable via other > fabric) ? > > > > Thank you very much for any input ! > > Best regards walter > > > > > > > > Mit freundlichen Gr??en > *Walter Sklenka* > *Technical Consultant* > > > > EDV-Design Informationstechnologie GmbH > Giefinggasse 6/1/2, A-1210 Wien > Tel: +43 1 29 22 165-31 > Fax: +43 1 29 22 165-90 > E-Mail: sklenka at edv-design.at > Internet: www.edv-design.at > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Dec 9 10:41:29 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 10:41:29 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: Hi Jan! That great to hear So we will try this Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at Von: Jan-Frode Myklebust Gesendet: Thursday, December 9, 2021 11:25 AM An: Walter Sklenka Cc: gpfsug-discuss at spectrumscale.org Betreff: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration I believe this should be a fully working solution. I see no problem enabling RDMA between a subset of nodes -- just disable verbsRdma on the nodes you want to use plain IP. -jf On Thu, Dec 9, 2021 at 11:04 AM Walter Sklenka > wrote: Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Dec 9 12:04:28 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 9 Dec 2021 12:04:28 +0000 Subject: [gpfsug-discuss] =?utf-8?q?alternate_path_between_ESS_Servers_for?= =?utf-8?q?=09Datamigration?= In-Reply-To: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Dec 9 12:36:08 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 9 Dec 2021 12:36:08 +0000 Subject: [gpfsug-discuss] Adding a quorum node Message-ID: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> I am looking to replace the quorum node in our cluster. The RAID card in the server we are currently using is a casualty of the RHEL8 SAS card purge :-( I have a "new" dual core server that is fully supported by RHEL8. After some toing and throwing with IBM they agreed a Pentium G6400 is 70PVU a core and two cores :-) That said it is currently running RHEL7 because that's what the DSS-G nodes are running. The upgrade to RHEL8 is planned for next year. Anyway I have added it into the GPFS cluster all well and good and GPFS is mounted just fine. However when I ran the command to make it a quorum node I got the following error (sanitized to remove actual DNS names and IP addresses initialize (113, '', ('', 1191)) failed (err 79) server initialization failed (err 79) mmchnode: Unexpected error from chnodes -n 1=:1191,2:1191,3=:1191,113=:1191 -f 1 -P 1191 . Return code: 149 mmchnode: Unable to change the CCR quorum node configuration. mmchnode: Command failed. Examine previous error messages to determine cause. fqdn-new is the new node and fqdn1/2/3 are the existing quorum nodes. I want to remove fqdn3 in due course. Anyone any idea what is going on? I thought you could change the quorum nodes on the fly? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From douglasof at us.ibm.com Thu Dec 9 16:04:28 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Thu, 9 Dec 2021 16:04:28 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: Message-ID: Walter: Though not directly about your design, our work with NVIDIA on GPUdirect Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both MOFED and Firmware version compatibility can be. I would suggest anyone debugging RDMA issues should look at those closely. Doug by carrier pigeon On Dec 9, 2021, 5:04:36 AM, gpfsug-discuss-request at spectrumscale.org wrote: From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Date: Dec 9, 2021, 5:04:36 AM Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 119, Issue 5 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..." Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..."Today's Topics: 1. alternate path between ESS Servers for Datamigration (Walter Sklenka) Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.eberhard at de.ibm.com Thu Dec 9 16:43:26 2021 From: ralf.eberhard at de.ibm.com (Ralf Eberhard) Date: Thu, 9 Dec 2021 16:43:26 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 119, Issue 7 - Adding a quorum node In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Dec 9 19:09:44 2021 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 9 Dec 2021 19:09:44 +0000 Subject: [gpfsug-discuss] Adding a quorum node In-Reply-To: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> References: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> Message-ID: I frequently change quorum on the fly on both our 4.x and 5.0 clusters during upgrades/maintenance. You have sanity in the CCR to start with? (mmccr query, lsnodes, etc,etc) Anything useful in the logs or if you drop debug on it? ('export DEBUG=1'and then re-run command) Ed Wahl OSC -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Jonathan Buzzard Sent: Thursday, December 9, 2021 7:36 AM To: 'gpfsug-discuss at spectrumscale.org' Subject: [gpfsug-discuss] Adding a quorum node I am looking to replace the quorum node in our cluster. The RAID card in the server we are currently using is a casualty of the RHEL8 SAS card purge :-( I have a "new" dual core server that is fully supported by RHEL8. After some toing and throwing with IBM they agreed a Pentium G6400 is 70PVU a core and two cores :-) That said it is currently running RHEL7 because that's what the DSS-G nodes are running. The upgrade to RHEL8 is planned for next year. Anyway I have added it into the GPFS cluster all well and good and GPFS is mounted just fine. However when I ran the command to make it a quorum node I got the following error (sanitized to remove actual DNS names and IP addresses initialize (113, '', ('', 1191)) failed (err 79) server initialization failed (err 79) mmchnode: Unexpected error from chnodes -n 1=:1191,2:1191,3=:1191,113=:1191 -f 1 -P 1191 . Return code: 149 mmchnode: Unable to change the CCR quorum node configuration. mmchnode: Command failed. Examine previous error messages to determine cause. fqdn-new is the new node and fqdn1/2/3 are the existing quorum nodes. I want to remove fqdn3 in due course. Anyone any idea what is going on? I thought you could change the quorum nodes on the fly? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!KGKeukY!hO7wULtfr6n28eBJ0BB8sYyRMFo6Xl5_XDpsNZz3GiD_3nXlPf6nKHNR-X99$ From Walter.Sklenka at EDV-Design.at Thu Dec 9 19:38:45 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 19:38:45 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: Hi Olaf!! Many thanks OK well we will do mmvdisk vs delete So #mmvdisk vs delete ? -N ess01,ess02?.. would be correct , or? Best regards walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Olaf Weiser Sent: Donnerstag, 9. Dezember 2021 13:04 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Hallo Walter, ;-) yes !AND! no .. for sure , you can specifiy a subset of nodes to use RDMA and other nodes just communicating TCPIP But that's only half of the truth . The other half is.. who and how , you are going to migrate/copy the data in case you 'll use mmrestripe .... you will have to make sure , that only nodes, connected(green) and configured for RDMA doing the work otherwise.. if will also work to migrate the data, but then data is send throught the Ethernet as well , (as long all those nodes are in the same cluster) laff ----- Urspr?ngliche Nachricht ----- Von: "Walter Sklenka" > Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: "'gpfsug-discuss at spectrumscale.org'" > CC: Betreff: [EXTERNAL] [gpfsug-discuss] alternate path between ESS Servers for Datamigration Datum: Do, 9. Dez 2021 11:04 Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Dec 9 19:43:31 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 19:43:31 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: Message-ID: <4f6b41f6a3b44c7a80cb588add2056dd@Mail.EDVDesign.cloudia> Hello Douglas! Many thanks for your advice ! Well we are in a horrible situation regarding firmware and MOFED of old equipment Mellanox advised us to use a special version of subnetmanager 5.0-2.1.8.0 from MOFED I hope this helps Let?s see how we can proceed Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Douglas O'flaherty Sent: Donnerstag, 9. Dezember 2021 17:04 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Walter: Though not directly about your design, our work with NVIDIA on GPUdirect Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both MOFED and Firmware version compatibility can be. I would suggest anyone debugging RDMA issues should look at those closely. Doug by carrier pigeon ________________________________ On Dec 9, 2021, 5:04:36 AM, gpfsug-discuss-request at spectrumscale.org wrote: From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Date: Dec 9, 2021, 5:04:36 AM Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 119, Issue 5 ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..." Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..."Today's Topics: 1. alternate path between ESS Servers for Datamigration (Walter Sklenka) Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Dec 9 20:19:41 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 9 Dec 2021 20:19:41 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 119, Issue 7 - Adding a quorum node In-Reply-To: References: Message-ID: On 09/12/2021 16:43, Ralf Eberhard wrote: > Jonathan, > > my suspicion is that?the GPFS daemon on fqdn-new is not reachable via > port 1191. > You can double check that by?sending a lightweight CCR RPC to this > daemon from another quorum node by attempting: > > mmccr echo -n fqdn-new;echo $? > > If this echo returns with a non-zero exit code the network settings must > be verified. And even?the other direction must > work: Node fqdn-new must?reach another quorum node, like (attempting on > fqdn-new): > > mmccr echo -n ;echo $? > Duh, that's my Homer Simpson moment for today. I forgotten to move the relevant network interfaces on the new server to the trusted zone in the firewall. So of course my normal testing with ping and ssh was working just fine. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Fri Dec 10 00:27:23 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 10 Dec 2021 00:27:23 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: Message-ID: On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From abeattie at au1.ibm.com Fri Dec 10 01:09:57 2021 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Fri, 10 Dec 2021 01:09:57 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16390972812300.png Type: image/png Size: 98384 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16390972812301.png Type: image/png Size: 101267 bytes Desc: not available URL: From douglasof at us.ibm.com Fri Dec 10 04:24:21 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Fri, 10 Dec 2021 00:24:21 -0400 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Fri Dec 10 10:17:20 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Fri, 10 Dec 2021 10:17:20 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Hello Douglas! May I ask a basic question regarding GPUdirect Storage or all local attached storage like NVME disks. Do you think it outerperforms "classical" shared storagesystems which are attached via FC connected to NSD servers HDR attached? With FC you have also bounce copies and more delay , isn?t it? There are solutions around which work with local NVME disks building some protection level with Raid (or duplication) . I am curious if it would be a better approach than shared storage which has it?s limitation (cost intensive scale out, extra infrstructure, max 64Gb at this time ... ) Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Douglas O'flaherty Sent: Freitag, 10. Dezember 2021 05:24 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard > on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Fri Dec 10 10:28:38 2021 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Fri, 10 Dec 2021 10:28:38 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> References: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Message-ID: Hallo Walter, we had many experiences now to change our Storage-Systems in our Backup-Environment to RDMA-IB with HDR and EDR Connections. What we see now (came from a 16Gbit FC Infrastructure) we enhance our throuhput from 7 GB/s to 30 GB/s. The main reason are the elimination of the driver-layers in the client-systems and make a Buffer to Buffer communication because of RDMA. The latency reduction are significant. Regards Renar. We use now ESS3k and ESS5k systems with 6.1.1.2-Code level. Renar Grunenberg Abteilung Informatik - Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder, Thomas Sehn, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org Im Auftrag von Walter Sklenka Gesendet: Freitag, 10. Dezember 2021 11:17 An: gpfsug-discuss at spectrumscale.org Betreff: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Hello Douglas! May I ask a basic question regarding GPUdirect Storage or all local attached storage like NVME disks. Do you think it outerperforms ?classical? shared storagesystems which are attached via FC connected to NSD servers HDR attached? With FC you have also bounce copies and more delay , isn?t it? There are solutions around which work with local NVME disks building some protection level with Raid (or duplication) . I am curious if it would be a better approach than shared storage which has it?s limitation (cost intensive scale out, extra infrstructure, max 64Gb at this time ? ) Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Douglas O'flaherty Sent: Freitag, 10. Dezember 2021 05:24 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard > on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Dec 10 10:37:31 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Dec 2021 10:37:31 +0000 Subject: [gpfsug-discuss] Test email format / mail format Message-ID: An HTML attachment was scrubbed... URL: From Ondrej.Kosik at ibm.com Fri Dec 10 10:39:56 2021 From: Ondrej.Kosik at ibm.com (Ondrej Kosik) Date: Fri, 10 Dec 2021 10:39:56 +0000 Subject: [gpfsug-discuss] Test email format / mail format In-Reply-To: References: Message-ID: Hello all, Thank you for the test email, my reply is coming from Outlook-based infrastructure. ________________________________ From: Olaf Weiser Sent: Friday, December 10, 2021 10:37 AM To: gpfsug-discuss at spectrumscale.org Cc: Ondrej Kosik Subject: Test email format / mail format This email is just a test, because we've seen mail format issues from IBM sent emails you can ignore this email , just for internal problem determination -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Dec 10 11:10:07 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Dec 2021 11:10:07 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: , <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16391192376761.png Type: image/png Size: 127072 bytes Desc: not available URL: From anacreo at gmail.com Sun Dec 12 02:19:02 2021 From: anacreo at gmail.com (Alec) Date: Sat, 11 Dec 2021 18:19:02 -0800 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: I feel the need to respond here... I see many responses on this User Group forum that are dismissive of the fringe / extreme use cases and of the "what do you need that for '' mindset. The thing is that Spectrum Scale is for the extreme, just take the word "Parallel" in the old moniker that was already an extreme use case. If you have a standard workload, then sure most of the complex features of the file system are toys, but many of us DO have extreme workloads where shaking out every ounce of performance is a worthwhile and financially sound endeavor. It is also because of the efforts of those of us living on the cusp of technology that these technologies become mainstream and no-longer extreme. I have an AIX LPAR that traverses more than 300TB+ of data a day on a Spectrum Scale file system, it is fully virtualized, and handles a million files. If that performance level drops, regulatory reports will be late, business decisions won't be current. However, the systems of today and the future have to traverse this much data and if they are slow then they can't keep up with real-time data feeds. So the difference between an RDMA disk IO vs a non RDMA disk IO could possibly mean what level of analytics are done to perform real time fraud prevention. Or at what cost, today many systems achieve this by keeping everything in memory in HUGE farms.. Being able to perform data operations at 30GB/s means you can traverse ALL of the census bureau data for all time from the US Govt in about 2 seconds... that's a pretty substantial capability that moves the bar forward in what we can do from a technology perspective. I just did a technology garage with IBM where we were able to achieve 1.5TB/writes on an encrypted ESS off of a single VMWare Host and 4 VM's over IP... That's over 2PB of data writes a day on a single VM server. Being able to demonstrate that there are production virtualized environments capable of this type of capacity, helps to show where the point of engineering a proper storage architecture outweighs the benefits of just throwing more GPU compute farms at the problem with ever dithering disk I/O. It also helps to demonstrate how a virtual storage optimized farm could be leveraged to host many in-memory or data analytic heavy workloads in a shared configuration. Douglas's response is the right one, how much IO does the application / environment need, it's nice to see Spectrum Scale have the flexibility to deliver. I'm pretty confident that if I can't deliver the required I/O performance on Spectrum Scale, nobody else can on any other storage platform within reasonable limits. Alec Effrat On Thu, Dec 9, 2021 at 8:24 PM Douglas O'flaherty wrote: > Jonathan: > > You posed a reasonable question, which was "when is RDMA worth the > hassle?" I agree with part of your premises, which is that it only matters > when the bottleneck isn't somewhere else. With a parallel file system, like > Scale/GPFS, the absolute performance bottleneck is not the throughput of a > single drive. In a majority of Scale/GPFS clusters the network data path is > the performance limitation. If they deploy HDR or 100/200/400Gbps > Ethernet... At that point, the buffer copy time inside the server matters. > > When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) > is easily demonstrated because it eliminates the bounce copy through the > system memory. In our NVIDIA DGX A100 server testing testing we were able > to get around 2x the per system throughput by using RDMA direct to GPU (GUP > Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage > node.) > > However, your question remains. Synthetic benchmarks are good indicators > of technical benefit, but do your users and applications need that extra > performance? > > These are probably only a handful of codes in organizations that need > this. However, they are high-value use cases. We have client applications > that either read a lot of data semi-randomly and not-cached - think > mini-Epics for scaling ML training. Or, demand lowest response time, like > production inference on voice recognition and NLP. > > If anyone has use cases for GPU accelerated codes with truly demanding > data needs, please reach out directly. We are looking for more use cases to > characterize the benefit for a new paper. f you can provide some code > examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a > benefit. > > Thanks, > > doug > > Douglas O'Flaherty > douglasof at us.ibm.com > > > > > > > ----- Message from Jonathan Buzzard on > Fri, 10 Dec 2021 00:27:23 +0000 ----- > > *To:* > gpfsug-discuss at spectrumscale.org > > *Subject:* > Re: [gpfsug-discuss] > On 09/12/2021 16:04, Douglas O'flaherty wrote: > > > > Though not directly about your design, our work with NVIDIA on GPUdirect > > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > > MOFED and Firmware version compatibility can be. > > > > I would suggest anyone debugging RDMA issues should look at those > closely. > > > May I ask what are the alleged benefits of using RDMA in GPFS? > > I can see there would be lower latency over a plain IP Ethernet or IPoIB > solution but surely disk latency is going to swamp that? > > I guess SSD drives might change that calculation but I have never seen > proper benchmarks comparing the two, or even better yet all four > connection options. > > Just seems a lot of complexity and fragility for very little gain to me. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > > > ----- Original message ----- > From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS > Servers for Datamigration > Date: Fri, Dec 10, 2021 10:27 > > On 09/12/2021 16:04, Douglas O'flaherty wrote: > > > > Though not directly about your design, our work with NVIDIA on GPUdirect > > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > > MOFED and Firmware version compatibility can be. > > > > I would suggest anyone debugging RDMA issues should look at those > closely. > > > May I ask what are the alleged benefits of using RDMA in GPFS? > > I can see there would be lower latency over a plain IP Ethernet or IPoIB > solution but surely disk latency is going to swamp that? > > I guess SSD drives might change that calculation but I have never seen > proper benchmarks comparing the two, or even better yet all four > connection options. > > Just seems a lot of complexity and fragility for very little gain to me. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 02:38:26 2021 From: anacreo at gmail.com (Alec) Date: Sat, 11 Dec 2021 18:38:26 -0800 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: You can manipulate the permissions via GPFS policy engine, essentially you'd write a script that the policy engine calls and tell GPFS to farm out the change in at whatever scale you need... run in a single node, how many files per thread, how many threads per node, etc... This can GREATLY accelerate file change permissions over a large quantity of files. However, as stated earlier the mmfind command will do all of this for you and it's worth the effort to get it compiled for your system. I don't have Spectrum Scale in front of me but for the best performance you'll want to setup the mmfind policy engine parameters to parallelize your workload... If mmfind has no action it will silently use GPFS policy engine to produce the requested output, however if mmfind has an action it will expose the policy engine calls. it goes something like this: mmfind -B 1 -N directattachnode1,directattachnode2 -m 24 /path/to/find -perm +o=w ! \( -type d -perm +o=t \) -xargs chmod o-w This will run 48 threads on 2 nodes and bump other write permissions off of any file it finds (excluding temp dirs) until it completes, it should go blistering fast... as this is only a meta operation the -B 1 might not be necessary, you'd probably be better off with a -B 100, but as I deal with a lot of 100GB+ files I don't want a single thread to be stuck with 3 100GB+ files and another thread to have none, so I usually set the max depth to be 1 and take the higher execution count. This has an advantage in that GPFS will break up the inodes in the most efficient way for the chmod to happen in parallel. I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. If you think of the possibilities here you could easily change that chmod to a gzip and add a -mtime +1200 and you have a find command that will gzip compress files over 4 years old in parallel across multiple nodes... mmfind is VERY powerful and flexible, highly worth getting into usage. Alec On Tue, Dec 7, 2021 at 7:43 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 07/12/2021 14:55, Simon Thompson wrote: > > > > Or add: > > UPDATECTIME yes > > SKIPACLUPDATECHECK yes > > > > To you dsm.opt file to skip checking for those updates and don?t back > > them up again. > > Yeah, but then a restore gives you potentially an unusable file system > as the ownership of the files and ACL's are all wrong. Better to bite > the bullet and back them up again IMHO. > > > > > Actually I thought TSM only updated the metadata if the mode/owner > > changed, not re-backed the file? > > That was my understanding but I have seen TSM rebacked up large amounts > of data where the owner of the file changed in the past, so your mileage > may vary. > > Also ACL's are stored in extended attributes which are stored with the > files and changes will definitely cause the file to be backed up again. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Sun Dec 12 11:19:07 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sun, 12 Dec 2021 11:19:07 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: On 12/12/2021 02:19, Alec wrote: > I feel the need to respond here... I see many responses on this > User Group forum that are dismissive of the fringe / extreme use > cases and of the "what do you need that for '' mindset. The thing is > that Spectrum Scale is for the extreme, just take the word "Parallel" > in the old moniker that was already an extreme use case. I wasn't been dismissive, I was asking what the benefits of using RDMA where. There is very little information about it out there and not a lot of comparative benchmarking on it either. Without the benefits being clearly laid out I am unlikely to consider it and might be missing a trick. IBM's literature on the topic is underwhelming to say the least. [SNIP] > I have an AIX LPAR that traverses more than 300TB+ of data a day on a > Spectrum Scale file system, it is fully virtualized, and handles a > million files. If that performance level drops, regulatory reports > will be late, business decisions won't be current. However, the > systems of today and the future have to traverse this much data and > if they are slow then they can't keep up with real-time data feeds. I have this nagging suspicion that modern all flash storage systems could deliver that sort of performance without the overhead of a parallel file system. [SNIP] > > Douglas's response is the right one, how much IO does the > application / environment need, it's nice to see Spectrum Scale have > the flexibility to deliver. I'm pretty confident that if I can't > deliver the required I/O performance on Spectrum Scale, nobody else > can on any other storage platform within reasonable limits. > I would note here that in our *shared HPC* environment I made a very deliberate design decision to attach the compute nodes with 10Gbps Ethernet for storage. Though I would probably pick 25Gbps if we where procuring the system today. There where many reasons behind that, but the main ones being that historical file system performance showed that greater than 99% of the time the file system never got above 20% of it's benchmarked speed. Using 10Gbps Ethernet was not going to be a problem. Secondly by limiting the connection to 10Gbps it stops one person hogging the file system to the detriment of other users. We have seen individual nodes peg their 10Gbps link from time to time, even several nodes at once (jobs from the same user) and had they had access to a 100Gbps storage link that would have been curtains for everyone else's file system usage. At this juncture I would note that the GPFS admin traffic is handled by on separate IP address space on a separate VLAN which we prioritize with QOS on the switches. So even when a node floods it's 10Gbps link for extended periods of time it doesn't get ejected from the cluster. The need for a separate physical network for admin traffic is not necessary in my experience. That said you can do RDMA with Ethernet... Unfortunately the teaching cluster and protocol nodes are on Intel X520's which I don't think do RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do RDMA. I could upgrade the protocol nodes but the teaching cluster would be a problem. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From s.j.thompson at bham.ac.uk Sun Dec 12 17:01:21 2021 From: s.j.thompson at bham.ac.uk (Simon Thompson) Date: Sun, 12 Dec 2021 17:01:21 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 22:03:39 2021 From: anacreo at gmail.com (Alec) Date: Sun, 12 Dec 2021 14:03:39 -0800 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson wrote: > > I'm not sure if this happens on Spectrum Scale but on most FS's if you > do a chmod 770 file you'll lose any ACLs assigned to the > > file, so safest to bump the permissions with a subtractive or additive > o-w or g+w type operation. > > > > This depends entirely on the fileset setting, see: > > > https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command > > > > ?*allow-permission-change*? > > > > We typically have file-sets set to chmodAndUpdateAcl, though not > exclusively, I think it was some quirky software that tested the > permissions after doing something and didn?t like the updatewithAcl thing ? > > > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 23:00:21 2021 From: anacreo at gmail.com (Alec) Date: Sun, 12 Dec 2021 15:00:21 -0800 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: So I never said this node wasn't in a HPC Cluster, it has partners... For our use case however some nodes have very expensive per core software licensing, and we have to weigh the human costs of empowering traditional monolithic code to do the job, or bringing in more users to re-write and maintain distributed code (someone is going to spend the money to get this work done!). So to get the most out of those licensed cores we have designed our virtual compute machine(s) with 128Gbps+ of SAN fabric. Just to achieve our average business day reads it would take 3 of your cluster nodes maxed out 24 hours, or 9 of them in a business day to achieve the same read speeds... and another 4 nodes to handle the writes. I guess HPC is in the eye of the business... In my experience cables and ports are cheaper than servers. The classic shared HPC design you have is being up-ended by the fact that there is so much compute power (cpu and memory) now in the nodes, you can't simply build a system with two storage connections (Noah's ark) and call it a day. If you look at the spec 25Gbps Ethernet is only delivering ~3GB/s (which is just above USB 3.2, and below USB 4). Spectrum Scale does very well for us when met with a fully saturated workload, we maintain one node for SLA and one node for AdHoc workload, and like clockwork the SLA box always steals exactly half the bandwidth when a job fires, so that 1 SLA job can take half the bandwidth and complete compared to the 40 AdHoc jobs on the other node. In newer releases IBM has introduced fileset throttling.... this is very exciting as we can really just design the biggest fattest pipes from VM to Storage and then software define the storage AND the bandwidth from the standard nobody cares about workloads all the way up to the most critical workloads... I don't buy the smaller bandwidth is better, as I see that as just one band-aid that has more elegant solutions, such as simply doing more resource constraints (you can't push the bandwidth if you can't get the CPU...), or using a workload orchestrator such as LSF with limits set, but I also won't say it never makes sense, as well I only know my problems and my solutions. For years the network team wouldn't let users have more than 10mb then 100mb networking as they were always worried about their backend being overwhelmed... I literally had faster home internet service than my work desktop connection at one point in my life.. it was all a falesy, the workload should drive the technology, the technology shouldn't hinder the workload. You can do a simple exercise, try scaling up... imagine your cluster is asked to start computing 100x more work... and that work must be completed on time. Do you simply say let me buy 100x more of everything? Or do you start to look at where can I gain efficiency and what actual bottlenecks do I need to lift... for some of us it's CPU, for some it's Memory, for some it's disk, depending on the work... I'd say the extremely rare case is where you need 100x more of EVERYTHING, but you have to get past the performance of the basic building blocks baked into the cake before you do need to dig deeper into the bottlenecks and it makes practical and financial sense. If your main bottleneck was storage, you'd be asking far different questions about RDMA. Alec On Sun, Dec 12, 2021 at 3:19 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 12/12/2021 02:19, Alec wrote: > > > I feel the need to respond here... I see many responses on this > > User Group forum that are dismissive of the fringe / extreme use > > cases and of the "what do you need that for '' mindset. The thing is > > that Spectrum Scale is for the extreme, just take the word "Parallel" > > in the old moniker that was already an extreme use case. > > I wasn't been dismissive, I was asking what the benefits of using RDMA > where. There is very little information about it out there and not a lot > of comparative benchmarking on it either. Without the benefits being > clearly laid out I am unlikely to consider it and might be missing a trick. > > IBM's literature on the topic is underwhelming to say the least. > > [SNIP] > > > > I have an AIX LPAR that traverses more than 300TB+ of data a day on a > > Spectrum Scale file system, it is fully virtualized, and handles a > > million files. If that performance level drops, regulatory reports > > will be late, business decisions won't be current. However, the > > systems of today and the future have to traverse this much data and > > if they are slow then they can't keep up with real-time data feeds. > > I have this nagging suspicion that modern all flash storage systems > could deliver that sort of performance without the overhead of a > parallel file system. > > [SNIP] > > > > > Douglas's response is the right one, how much IO does the > > application / environment need, it's nice to see Spectrum Scale have > > the flexibility to deliver. I'm pretty confident that if I can't > > deliver the required I/O performance on Spectrum Scale, nobody else > > can on any other storage platform within reasonable limits. > > > > I would note here that in our *shared HPC* environment I made a very > deliberate design decision to attach the compute nodes with 10Gbps > Ethernet for storage. Though I would probably pick 25Gbps if we where > procuring the system today. > > There where many reasons behind that, but the main ones being that > historical file system performance showed that greater than 99% of the > time the file system never got above 20% of it's benchmarked speed. > Using 10Gbps Ethernet was not going to be a problem. > > Secondly by limiting the connection to 10Gbps it stops one person > hogging the file system to the detriment of other users. We have seen > individual nodes peg their 10Gbps link from time to time, even several > nodes at once (jobs from the same user) and had they had access to a > 100Gbps storage link that would have been curtains for everyone else's > file system usage. > > At this juncture I would note that the GPFS admin traffic is handled by > on separate IP address space on a separate VLAN which we prioritize with > QOS on the switches. So even when a node floods it's 10Gbps link for > extended periods of time it doesn't get ejected from the cluster. The > need for a separate physical network for admin traffic is not necessary > in my experience. > > That said you can do RDMA with Ethernet... Unfortunately the teaching > cluster and protocol nodes are on Intel X520's which I don't think do > RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do > RDMA. I could upgrade the protocol nodes but the teaching cluster would > be a problem. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Dec 13 00:03:42 2021 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 13 Dec 2021 00:03:42 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Mon Dec 13 10:49:37 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 13 Dec 2021 10:49:37 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: <96a77c75de9b41f089e853120eef870d@psi.ch> I am definitely going to try this solution with mmfind. Thank you also for the command line and several hints? I?ll be back with the outcome soon. Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Alec Inviato: domenica 12 dicembre 2021 23:04 A: gpfsug main discussion list Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Mon Dec 13 11:30:17 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 13 Dec 2021 11:30:17 +0000 Subject: [gpfsug-discuss] R: R: Question on changing mode on many files In-Reply-To: <96a77c75de9b41f089e853120eef870d@psi.ch> References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> <96a77c75de9b41f089e853120eef870d@psi.ch> Message-ID: Hi Alec , mmfind doesn?t have a man page (does it have an online one ? I cannot find it). And according to mmfind -h it doesn?t exposes the ?-N? neither the ?-B? flags. RPM is gpfs.base-5.1.1-2.x86_64. Do I have chance to download a newest version of that script from somewhere ? Thanks, Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Dorigo Alvise (PSI) Inviato: luned? 13 dicembre 2021 11:50 A: gpfsug main discussion list Oggetto: [gpfsug-discuss] R: Question on changing mode on many files I am definitely going to try this solution with mmfind. Thank you also for the command line and several hints? I?ll be back with the outcome soon. Alvise Da: gpfsug-discuss-bounces at spectrumscale.org > Per conto di Alec Inviato: domenica 12 dicembre 2021 23:04 A: gpfsug main discussion list > Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Mon Dec 13 18:33:23 2021 From: anacreo at gmail.com (Alec) Date: Mon, 13 Dec 2021 10:33:23 -0800 Subject: [gpfsug-discuss] R: R: Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> <96a77c75de9b41f089e853120eef870d@psi.ch> Message-ID: I checked on my office network.... mmfind --help mmfind -polFlags '-N node1,node2 -B 100 -m 24' /path/to/find -perm +o=w ! \( -type d -perm +o=t \) -xargs chmod o-w I think that the -m 24 is the default (24 threads per node), but it's nice to include on the command line so you remember you can increment/decrement it as your needs require or your nodes can handle. It's IMPORTANT to review in the mmfind --help output that some things are 'mmfind' args and go BEFORE the path... some are CRITERIA args and have no impact on the files... BUT SOME ARE ACTION args, and they will affect files. So -exec -xargs are obvious actions, however, -gpfsCompress doesn't find compressed files, it will actually compress the objects... in our AIX environment our compressed reads feel like they're essentially broken, we only get about 5MB/s, however on Linux compress reads seem to work fairly well. So make sure to read the man page carefully before using some non-obvious GPFS enhancements. Also the nice thing is mmfind -xargs takes care of all the strange file names, so you don't have to do anything complicated, but you also can't pipe the output as it will run the xarg in the policy engine. As a footnote this is my all time favorite find for troubleshooting... find $(pwd) -mtime -1 | sed -e 's/.*/"&"/g' | xargs ls -latr List all the files modified in the last day in reverse chronology... Doesn't work :-( Alec On Mon, Dec 13, 2021 at 3:30 AM Dorigo Alvise (PSI) wrote: > Hi Alec , > > mmfind doesn?t have a man page (does it have an online one ? I cannot find > it). And according to mmfind -h it doesn?t exposes the ?-N? neither the > ?-B? flags. RPM is gpfs.base-5.1.1-2.x86_64. > > > > Do I have chance to download a newest version of that script from > somewhere ? > > > > Thanks, > > > > Alvise > > > > *Da:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *Per conto di *Dorigo Alvise > (PSI) > *Inviato:* luned? 13 dicembre 2021 11:50 > *A:* gpfsug main discussion list > *Oggetto:* [gpfsug-discuss] R: Question on changing mode on many files > > > > I am definitely going to try this solution with mmfind. > > Thank you also for the command line and several hints? I?ll be back with > the outcome soon. > > > > Alvise > > > > *Da:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *Per conto di *Alec > *Inviato:* domenica 12 dicembre 2021 23:04 > *A:* gpfsug main discussion list > *Oggetto:* Re: [gpfsug-discuss] Question on changing mode on many files > > > > How am I just learning about this right now, thank you! Makes so much > more sense now the odd behaviors I've seen over the years on GPFS vs POSIX > chmod/ACL. Will definitely go review those settings on my filesets now, > wonder if the default has evolved from 3.x -> 4.x -> 5.x. > > > > IBM needs to find a way to pre-compile mmfind and make it supported, it > really is essential and so beneficial, and so hard to get done in a > production regulated environment. Though a bigger warning that the > compress option is an action not a criteria! > > > > Alec > > > > On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > > > I'm not sure if this happens on Spectrum Scale but on most FS's if you > do a chmod 770 file you'll lose any ACLs assigned to the > > file, so safest to bump the permissions with a subtractive or additive > o-w or g+w type operation. > > > > This depends entirely on the fileset setting, see: > > > https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command > > > > ?*allow-permission-change*? > > > > We typically have file-sets set to chmodAndUpdateAcl, though not > exclusively, I think it was some quirky software that tested the > permissions after doing something and didn?t like the updatewithAcl thing ? > > > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Mon Dec 13 23:55:23 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 13 Dec 2021 23:55:23 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: <19884986-aff8-20aa-f1d1-590f6b81ddd2@strath.ac.uk> On 13/12/2021 00:03, Andrew Beattie wrote: > What is the main outcome or business requirement of the teaching cluster > ( i notice your specific in the use of defining it as a teaching cluster) > It is entirely possible that the use case for this cluster does not > warrant the use of high speed low latency networking, and it simply > needs the benefits of a parallel filesystem. While we call it the "teaching cluster" it would be more appropriate to call them "teaching nodes" that shares resources (storage and login nodes) with the main research cluster. It's mainly used by undergraduates doing final year projects and M.Sc. students. It's getting a bit long in the tooth now but not many undergraduates have access to a 16 core machine with 64GB of RAM. Even if they did being able to let something go flat out for 48 hours means there personal laptop is available for other things :-) I was just musing that the cards in the teaching nodes being Intel 82599ES would be a stumbling block for RDMA over Ethernet, but on checking the Intel X710 doesn't do RDMA either so it would all be a bust anyway. I was clearly on the crack pipe when I thought they did. So aside from the DSS-G and GPU nodes with Connect-X4 cards nothing does RDMA. [SNIP] > For some of my research clients this is the ability to run 20-30% more > compute jobs on the same HPC resources in the same 24H period, which > means that they can reduce the amount of time they need on the HPC > cluster to get the data results that they are looking for. Except as I said in our cluster the storage servers have never been maxed out except when running benchmarks. Individual compute nodes have been maxed out (mainly Gaussian writing 800GB temporary files) but as I explained that's a good thing from my perspective because I don't want one or two users to be able to pound the storage into oblivion and cause problems for everyone else. We have enough problems with users tanking the login nodes by running computations on them. That should go away with our upgrade to RHEL8 and the wonders of per user cgroups; me I love systemd. In the end nobody has complained that the storage speed is a problem yet, and putting the metadata on SSD would be my first port of call if they did and funds where available to make things go faster. To be honest I think the users are just happy that GPFS doesn't eat itself and be out of action for a few weeks every couple of years like Lustre did on the previous system. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From olaf.weiser at de.ibm.com Fri Dec 17 15:08:15 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 17 Dec 2021 15:08:15 +0000 Subject: [gpfsug-discuss] email format check again for IBM domain send email Message-ID: An HTML attachment was scrubbed... URL: From juergen.hannappel at desy.de Fri Dec 17 15:57:45 2021 From: juergen.hannappel at desy.de (Hannappel, Juergen) Date: Fri, 17 Dec 2021 16:57:45 +0100 (CET) Subject: [gpfsug-discuss] ESS 6.1.2.1 changes Message-ID: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Hi, I just noticed that tday a new ESS release (6.1.2.1) appeared on fix central. What I can't find is a list of changes to 6.1.2.0, and anyway finding the change list is always a PITA. Does anyone know what changed? -- Dr. J?rgen Hannappel DESY/IT Tel. : +49 40 8998-4616 From luis.bolinches at fi.ibm.com Fri Dec 17 18:50:09 2021 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Fri, 17 Dec 2021 18:50:09 +0000 Subject: [gpfsug-discuss] ESS 6.1.2.1 changes In-Reply-To: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> References: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Dec 20 11:26:29 2021 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 20 Dec 2021 12:26:29 +0100 Subject: [gpfsug-discuss] ESS 6.1.2.1 changes In-Reply-To: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> References: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Message-ID: Just ran an upgrade on an EMS, and the only changes I see are these updated packages on the ems: +gpfs.docs-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:43 AM CET +gpfs.ess.firmware-6.0.0-15.ppc64le Mon 20 Dec 2021 11:56:42 AM CET +gpfs.msg.en_US-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:12 AM CET +gpfs.gss.pmsensors-5.1.2-0.el8.ppc64le Mon 20 Dec 2021 11:56:12 AM CET +gpfs.gpl-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:11 AM CET +gpfs.gnr.base-1.0.0-0.ppc64le Mon 20 Dec 2021 11:56:11 AM CET +gpfs.gnr.support-ess5000-1.0.0-3.noarch Mon 20 Dec 2021 11:56:10 AM CET +gpfs.gnr.support-ess3200-6.1.2-0.noarch Mon 20 Dec 2021 11:56:10 AM CET +gpfs.crypto-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:10 AM CET +gpfs.compression-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:10 AM CET +gpfs.license.dmd-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:09 AM CET +gpfs.gnr.support-ess3000-1.0.0-3.noarch Mon 20 Dec 2021 11:56:09 AM CET +gpfs.gui-5.1.2-0.4.noarch Mon 20 Dec 2021 11:56:05 AM CET +gpfs.gskit-8.0.55-19.ppc64le Mon 20 Dec 2021 11:56:02 AM CET +gpfs.java-5.1.2-0.4.ppc64le Mon 20 Dec 2021 11:56:01 AM CET +gpfs.gss.pmcollector-5.1.2-0.el8.ppc64le Mon 20 Dec 2021 11:55:59 AM CET +gpfs.gnr.support-essbase-6.1.2-0.noarch Mon 20 Dec 2021 11:55:59 AM CET +gpfs.adv-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:59 AM CET +gpfs.gnr-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:58 AM CET +gpfs.base-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:54 AM CET +sdparm-1.10-10.el8.ppc64le Mon 20 Dec 2021 11:55:21 AM CET +gpfs.ess.tools-6.1.2.1-release.noarch Mon 20 Dec 2021 11:50:47 AM CET I will guess it has something to do with log4j, but a changelog would be nice :-) https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=142683 On Fri, Dec 17, 2021 at 5:07 PM Hannappel, Juergen < juergen.hannappel at desy.de> wrote: > Hi, > I just noticed that tday a new ESS release (6.1.2.1) appeared on fix > central. > What I can't find is a list of changes to 6.1.2.0, and anyway finding the > change list is always a PITA. > > Does anyone know what changed? > > -- > Dr. J?rgen Hannappel DESY/IT Tel. : +49 40 8998-4616 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Dec 7 13:44:24 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 7 Dec 2021 13:44:24 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files Message-ID: Dear users/developers/support, I'd like to ask if there is a fast way to manipulate the permission mask of many files (millions). I tried on 900k files and a recursive chmod (chmod 0### -R path) takes about 1000s, with about 50% usage of mmfsd daemon. I tried with the perl's internal function chmod that can operate on an array of files, and it takes about 1/3 of the previous method. Which is already a good result. I've seen the possibility to run a policy to execute commands, but I would avoid to execute external commands through mmxargs, 1M of times; would you ? Does anybody have any suggestion to do this operation with minimum disruption on the system ? Thank you, Alvise -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Dec 7 14:01:41 2021 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 7 Dec 2021 14:01:41 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Tue Dec 7 14:10:20 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Tue, 7 Dec 2021 14:10:20 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: Message-ID: I have 5.0.4 for the moment (planned to be updated next year) and what I see is: [root at sf-dss-1 tmp]# locate mmfind /usr/lpp/mmfs/samples/ilm/mmfind /usr/lpp/mmfs/samples/ilm/mmfind.README /usr/lpp/mmfs/samples/ilm/mmfindUtil_processOutputFile.c /usr/lpp/mmfs/samples/ilm/mmfindUtil_processOutputFile.sampleMakefile Is that what you are talking about ? Thanks, Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Frederick Stock Inviato: marted? 7 dicembre 2021 15:02 A: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files If you are running on a more recent version of Scale you might want to look at the mmfind command. It provides a find-like wrapper around the execution of policy rules. Fred _______________________________________________________ Fred Stock | Spectrum Scale Development Advocacy | 720-430-8821 stockf at us.ibm.com ----- Original message ----- From: "Dorigo Alvise (PSI)" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: "'gpfsug main discussion list'" > Cc: Subject: [EXTERNAL] [gpfsug-discuss] Question on changing mode on many files Date: Tue, Dec 7, 2021 8:53 AM Dear users/developers/support, I?d like to ask if there is a fast way to manipulate the permission mask of many files (millions). I tried on 900k files and a recursive chmod (chmod 0### -R path) takes about 1000s, with about 50% usage of mmfsd daemon. I tried with the perl?s internal function chmod that can operate on an array of files, and it takes about 1/3 of the previous method. Which is already a good result. I?ve seen the possibility to run a policy to execute commands, but I would avoid to execute external commands through mmxargs, 1M of times; would you ? Does anybody have any suggestion to do this operation with minimum disruption on the system ? Thank you, Alvise _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Tue Dec 7 14:19:42 2021 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 7 Dec 2021 14:19:42 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Dec 7 14:28:54 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 7 Dec 2021 14:28:54 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: Message-ID: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> On 07/12/2021 14:01, Frederick Stock wrote: > If you are running on a more recent version of Scale you might want to > look at the mmfind command.? It provides a find-like wrapper around the > execution of policy rules. > I am not sure that will be any faster than a "chmod -R" as it will exec millions of instances of chmod. What you gain on the swings you are going to loose on the roundabouts. TL;DR is you want to change permissions on millions of files expect it to take a considerable period of time. Even a modern NVMe SSD probably does around 50k IOPS per second, so best case scenario is one million files taking 40 seconds, at one read and one write per file and that is frankly unlikely. Also get ready to back them up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From s.j.thompson at bham.ac.uk Tue Dec 7 14:55:15 2021 From: s.j.thompson at bham.ac.uk (Simon Thompson) Date: Tue, 7 Dec 2021 14:55:15 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: Or add: UPDATECTIME yes SKIPACLUPDATECHECK yes To you dsm.opt file to skip checking for those updates and don?t back them up again. Actually I thought TSM only updated the metadata if the mode/owner changed, not re-backed the file? Simon From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Jonathan Buzzard Date: Tuesday, 7 December 2021 at 14:29 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Question on changing mode on many files On 07/12/2021 14:01, Frederick Stock wrote: > If you are running on a more recent version of Scale you might want to > look at the mmfind command. It provides a find-like wrapper around the > execution of policy rules. > I am not sure that will be any faster than a "chmod -R" as it will exec millions of instances of chmod. What you gain on the swings you are going to loose on the roundabouts. TL;DR is you want to change permissions on millions of files expect it to take a considerable period of time. Even a modern NVMe SSD probably does around 50k IOPS per second, so best case scenario is one million files taking 40 seconds, at one read and one write per file and that is frankly unlikely. Also get ready to back them up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Tue Dec 7 15:42:58 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Tue, 7 Dec 2021 15:42:58 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: On 07/12/2021 14:55, Simon Thompson wrote: > > Or add: > ? UPDATECTIME?????????????? yes > ? SKIPACLUPDATECHECK??????? yes > > To you dsm.opt file to skip checking for those updates and don?t back > them up again. Yeah, but then a restore gives you potentially an unusable file system as the ownership of the files and ACL's are all wrong. Better to bite the bullet and back them up again IMHO. > > Actually I thought TSM only updated the metadata if the mode/owner > changed, not re-backed the file? That was my understanding but I have seen TSM rebacked up large amounts of data where the owner of the file changed in the past, so your mileage may vary. Also ACL's are stored in extended attributes which are stored with the files and changes will definitely cause the file to be backed up again. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From Walter.Sklenka at EDV-Design.at Thu Dec 9 09:26:40 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 09:26:40 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Message-ID: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ... connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,... to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Visio-eodc-2-fabs.pdf Type: application/pdf Size: 35768 bytes Desc: Visio-eodc-2-fabs.pdf URL: From janfrode at tanso.net Thu Dec 9 10:25:17 2021 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Thu, 9 Dec 2021 11:25:17 +0100 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: I believe this should be a fully working solution. I see no problem enabling RDMA between a subset of nodes -- just disable verbsRdma on the nodes you want to use plain IP. -jf On Thu, Dec 9, 2021 at 11:04 AM Walter Sklenka wrote: > Dear spectrum scale users! > > May I ask you a design question? > > We have an IB environment which is very mixed at the moment ( connecX3 ? > connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also > HDR100 and HDR switches. We still have some big troubles in this fabric > when using RDMA , a case at Mellanox and IBM is open . > > The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where > we want to migrate the data to ess5000 , ( mmdelvdisk +qos) > > Due to the current problems with RDMA we though eventually we could try a > workaround : > > If you are interested there is Maybe you can find the attachment ? > > We build 2 separate fabrics , the ess-IO servers attached to both blue and > green and all other cluster members and all remote clusters only to fabric > blue > > The daemon interfaces (IPoIP) are on fabric blue > > > > It is the aim to setup rdma only on the ess-ioServers in the fabric green > , in the blue we must use IPoIB (tcp) > > Do you think datamigration would work between ess01,ess02,? to ess07,ess08 > via RDMA ? > > Or is it principally not possible to make a rdma network only for a > subset of a cluster (though this subset would be reachable via other > fabric) ? > > > > Thank you very much for any input ! > > Best regards walter > > > > > > > > Mit freundlichen Gr??en > *Walter Sklenka* > *Technical Consultant* > > > > EDV-Design Informationstechnologie GmbH > Giefinggasse 6/1/2, A-1210 Wien > Tel: +43 1 29 22 165-31 > Fax: +43 1 29 22 165-90 > E-Mail: sklenka at edv-design.at > Internet: www.edv-design.at > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Dec 9 10:41:29 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 10:41:29 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: Hi Jan! That great to hear So we will try this Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at Von: Jan-Frode Myklebust Gesendet: Thursday, December 9, 2021 11:25 AM An: Walter Sklenka Cc: gpfsug-discuss at spectrumscale.org Betreff: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration I believe this should be a fully working solution. I see no problem enabling RDMA between a subset of nodes -- just disable verbsRdma on the nodes you want to use plain IP. -jf On Thu, Dec 9, 2021 at 11:04 AM Walter Sklenka > wrote: Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Dec 9 12:04:28 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 9 Dec 2021 12:04:28 +0000 Subject: [gpfsug-discuss] =?utf-8?q?alternate_path_between_ESS_Servers_for?= =?utf-8?q?=09Datamigration?= In-Reply-To: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Dec 9 12:36:08 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 9 Dec 2021 12:36:08 +0000 Subject: [gpfsug-discuss] Adding a quorum node Message-ID: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> I am looking to replace the quorum node in our cluster. The RAID card in the server we are currently using is a casualty of the RHEL8 SAS card purge :-( I have a "new" dual core server that is fully supported by RHEL8. After some toing and throwing with IBM they agreed a Pentium G6400 is 70PVU a core and two cores :-) That said it is currently running RHEL7 because that's what the DSS-G nodes are running. The upgrade to RHEL8 is planned for next year. Anyway I have added it into the GPFS cluster all well and good and GPFS is mounted just fine. However when I ran the command to make it a quorum node I got the following error (sanitized to remove actual DNS names and IP addresses initialize (113, '', ('', 1191)) failed (err 79) server initialization failed (err 79) mmchnode: Unexpected error from chnodes -n 1=:1191,2:1191,3=:1191,113=:1191 -f 1 -P 1191 . Return code: 149 mmchnode: Unable to change the CCR quorum node configuration. mmchnode: Command failed. Examine previous error messages to determine cause. fqdn-new is the new node and fqdn1/2/3 are the existing quorum nodes. I want to remove fqdn3 in due course. Anyone any idea what is going on? I thought you could change the quorum nodes on the fly? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From douglasof at us.ibm.com Thu Dec 9 16:04:28 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Thu, 9 Dec 2021 16:04:28 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: Message-ID: Walter: Though not directly about your design, our work with NVIDIA on GPUdirect Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both MOFED and Firmware version compatibility can be. I would suggest anyone debugging RDMA issues should look at those closely. Doug by carrier pigeon On Dec 9, 2021, 5:04:36 AM, gpfsug-discuss-request at spectrumscale.org wrote: From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Date: Dec 9, 2021, 5:04:36 AM Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 119, Issue 5 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..." Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..."Today's Topics: 1. alternate path between ESS Servers for Datamigration (Walter Sklenka) Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From ralf.eberhard at de.ibm.com Thu Dec 9 16:43:26 2021 From: ralf.eberhard at de.ibm.com (Ralf Eberhard) Date: Thu, 9 Dec 2021 16:43:26 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 119, Issue 7 - Adding a quorum node In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Dec 9 19:09:44 2021 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 9 Dec 2021 19:09:44 +0000 Subject: [gpfsug-discuss] Adding a quorum node In-Reply-To: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> References: <73c81130-c120-d5f3-395f-4695e56905e1@strath.ac.uk> Message-ID: I frequently change quorum on the fly on both our 4.x and 5.0 clusters during upgrades/maintenance. You have sanity in the CCR to start with? (mmccr query, lsnodes, etc,etc) Anything useful in the logs or if you drop debug on it? ('export DEBUG=1'and then re-run command) Ed Wahl OSC -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Jonathan Buzzard Sent: Thursday, December 9, 2021 7:36 AM To: 'gpfsug-discuss at spectrumscale.org' Subject: [gpfsug-discuss] Adding a quorum node I am looking to replace the quorum node in our cluster. The RAID card in the server we are currently using is a casualty of the RHEL8 SAS card purge :-( I have a "new" dual core server that is fully supported by RHEL8. After some toing and throwing with IBM they agreed a Pentium G6400 is 70PVU a core and two cores :-) That said it is currently running RHEL7 because that's what the DSS-G nodes are running. The upgrade to RHEL8 is planned for next year. Anyway I have added it into the GPFS cluster all well and good and GPFS is mounted just fine. However when I ran the command to make it a quorum node I got the following error (sanitized to remove actual DNS names and IP addresses initialize (113, '', ('', 1191)) failed (err 79) server initialization failed (err 79) mmchnode: Unexpected error from chnodes -n 1=:1191,2:1191,3=:1191,113=:1191 -f 1 -P 1191 . Return code: 149 mmchnode: Unable to change the CCR quorum node configuration. mmchnode: Command failed. Examine previous error messages to determine cause. fqdn-new is the new node and fqdn1/2/3 are the existing quorum nodes. I want to remove fqdn3 in due course. Anyone any idea what is going on? I thought you could change the quorum nodes on the fly? JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.com/v3/__http://gpfsug.org/mailman/listinfo/gpfsug-discuss__;!!KGKeukY!hO7wULtfr6n28eBJ0BB8sYyRMFo6Xl5_XDpsNZz3GiD_3nXlPf6nKHNR-X99$ From Walter.Sklenka at EDV-Design.at Thu Dec 9 19:38:45 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 19:38:45 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: <203c51ce5d6c4cb9992ebc26f1b503cf@Mail.EDVDesign.cloudia> Message-ID: Hi Olaf!! Many thanks OK well we will do mmvdisk vs delete So #mmvdisk vs delete ? -N ess01,ess02?.. would be correct , or? Best regards walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Olaf Weiser Sent: Donnerstag, 9. Dezember 2021 13:04 To: gpfsug-discuss at spectrumscale.org Cc: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Hallo Walter, ;-) yes !AND! no .. for sure , you can specifiy a subset of nodes to use RDMA and other nodes just communicating TCPIP But that's only half of the truth . The other half is.. who and how , you are going to migrate/copy the data in case you 'll use mmrestripe .... you will have to make sure , that only nodes, connected(green) and configured for RDMA doing the work otherwise.. if will also work to migrate the data, but then data is send throught the Ethernet as well , (as long all those nodes are in the same cluster) laff ----- Urspr?ngliche Nachricht ----- Von: "Walter Sklenka" > Gesendet von: gpfsug-discuss-bounces at spectrumscale.org An: "'gpfsug-discuss at spectrumscale.org'" > CC: Betreff: [EXTERNAL] [gpfsug-discuss] alternate path between ESS Servers for Datamigration Datum: Do, 9. Dez 2021 11:04 Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Thu Dec 9 19:43:31 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Thu, 9 Dec 2021 19:43:31 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: Message-ID: <4f6b41f6a3b44c7a80cb588add2056dd@Mail.EDVDesign.cloudia> Hello Douglas! Many thanks for your advice ! Well we are in a horrible situation regarding firmware and MOFED of old equipment Mellanox advised us to use a special version of subnetmanager 5.0-2.1.8.0 from MOFED I hope this helps Let?s see how we can proceed Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Douglas O'flaherty Sent: Donnerstag, 9. Dezember 2021 17:04 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Walter: Though not directly about your design, our work with NVIDIA on GPUdirect Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both MOFED and Firmware version compatibility can be. I would suggest anyone debugging RDMA issues should look at those closely. Doug by carrier pigeon ________________________________ On Dec 9, 2021, 5:04:36 AM, gpfsug-discuss-request at spectrumscale.org wrote: From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Date: Dec 9, 2021, 5:04:36 AM Subject: [EXTERNAL] gpfsug-discuss Digest, Vol 119, Issue 5 ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..." Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.orgTo subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.orgYou can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.orgWhen replying, please edit your Subject line so it is more specificthan "Re: Contents of gpfsug-discuss digest..."Today's Topics: 1. alternate path between ESS Servers for Datamigration (Walter Sklenka) Dear spectrum scale users! May I ask you a design question? We have an IB environment which is very mixed at the moment ( connecX3 ? connect-X6 with FDR , even FDR10 and with arrive of ESS5000SC7 now also HDR100 and HDR switches. We still have some big troubles in this fabric when using RDMA , a case at Mellanox and IBM is open . The environment has 3 old Building blocks 2xESSGL6 and 1x GL4 , from where we want to migrate the data to ess5000 , ( mmdelvdisk +qos) Due to the current problems with RDMA we though eventually we could try a workaround : If you are interested there is Maybe you can find the attachment ? We build 2 separate fabrics , the ess-IO servers attached to both blue and green and all other cluster members and all remote clusters only to fabric blue The daemon interfaces (IPoIP) are on fabric blue It is the aim to setup rdma only on the ess-ioServers in the fabric green , in the blue we must use IPoIB (tcp) Do you think datamigration would work between ess01,ess02,? to ess07,ess08 via RDMA ? Or is it principally not possible to make a rdma network only for a subset of a cluster (though this subset would be reachable via other fabric) ? Thank you very much for any input ! Best regards walter Mit freundlichen Gr??en Walter Sklenka Technical Consultant EDV-Design Informationstechnologie GmbH Giefinggasse 6/1/2, A-1210 Wien Tel: +43 1 29 22 165-31 Fax: +43 1 29 22 165-90 E-Mail: sklenka at edv-design.at Internet: www.edv-design.at -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Dec 9 20:19:41 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 9 Dec 2021 20:19:41 +0000 Subject: [gpfsug-discuss] gpfsug-discuss Digest, Vol 119, Issue 7 - Adding a quorum node In-Reply-To: References: Message-ID: On 09/12/2021 16:43, Ralf Eberhard wrote: > Jonathan, > > my suspicion is that?the GPFS daemon on fqdn-new is not reachable via > port 1191. > You can double check that by?sending a lightweight CCR RPC to this > daemon from another quorum node by attempting: > > mmccr echo -n fqdn-new;echo $? > > If this echo returns with a non-zero exit code the network settings must > be verified. And even?the other direction must > work: Node fqdn-new must?reach another quorum node, like (attempting on > fqdn-new): > > mmccr echo -n ;echo $? > Duh, that's my Homer Simpson moment for today. I forgotten to move the relevant network interfaces on the new server to the trusted zone in the firewall. So of course my normal testing with ping and ssh was working just fine. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From jonathan.buzzard at strath.ac.uk Fri Dec 10 00:27:23 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 10 Dec 2021 00:27:23 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: Message-ID: On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From abeattie at au1.ibm.com Fri Dec 10 01:09:57 2021 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Fri, 10 Dec 2021 01:09:57 +0000 Subject: [gpfsug-discuss] alternate path between ESS Servers for Datamigration In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16390972812300.png Type: image/png Size: 98384 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16390972812301.png Type: image/png Size: 101267 bytes Desc: not available URL: From douglasof at us.ibm.com Fri Dec 10 04:24:21 2021 From: douglasof at us.ibm.com (Douglas O'flaherty) Date: Fri, 10 Dec 2021 00:24:21 -0400 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Walter.Sklenka at EDV-Design.at Fri Dec 10 10:17:20 2021 From: Walter.Sklenka at EDV-Design.at (Walter Sklenka) Date: Fri, 10 Dec 2021 10:17:20 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Hello Douglas! May I ask a basic question regarding GPUdirect Storage or all local attached storage like NVME disks. Do you think it outerperforms "classical" shared storagesystems which are attached via FC connected to NSD servers HDR attached? With FC you have also bounce copies and more delay , isn?t it? There are solutions around which work with local NVME disks building some protection level with Raid (or duplication) . I am curious if it would be a better approach than shared storage which has it?s limitation (cost intensive scale out, extra infrstructure, max 64Gb at this time ... ) Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org On Behalf Of Douglas O'flaherty Sent: Freitag, 10. Dezember 2021 05:24 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard > on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Fri Dec 10 10:28:38 2021 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Fri, 10 Dec 2021 10:28:38 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> References: <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Message-ID: Hallo Walter, we had many experiences now to change our Storage-Systems in our Backup-Environment to RDMA-IB with HDR and EDR Connections. What we see now (came from a 16Gbit FC Infrastructure) we enhance our throuhput from 7 GB/s to 30 GB/s. The main reason are the elimination of the driver-layers in the client-systems and make a Buffer to Buffer communication because of RDMA. The latency reduction are significant. Regards Renar. We use now ESS3k and ESS5k systems with 6.1.1.2-Code level. Renar Grunenberg Abteilung Informatik - Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder, Thomas Sehn, Daniel Thomas. ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org Im Auftrag von Walter Sklenka Gesendet: Freitag, 10. Dezember 2021 11:17 An: gpfsug-discuss at spectrumscale.org Betreff: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Hello Douglas! May I ask a basic question regarding GPUdirect Storage or all local attached storage like NVME disks. Do you think it outerperforms ?classical? shared storagesystems which are attached via FC connected to NSD servers HDR attached? With FC you have also bounce copies and more delay , isn?t it? There are solutions around which work with local NVME disks building some protection level with Raid (or duplication) . I am curious if it would be a better approach than shared storage which has it?s limitation (cost intensive scale out, extra infrstructure, max 64Gb at this time ? ) Best regards Walter From: gpfsug-discuss-bounces at spectrumscale.org > On Behalf Of Douglas O'flaherty Sent: Freitag, 10. Dezember 2021 05:24 To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] WAS: alternative path; Now: RDMA Jonathan: You posed a reasonable question, which was "when is RDMA worth the hassle?" I agree with part of your premises, which is that it only matters when the bottleneck isn't somewhere else. With a parallel file system, like Scale/GPFS, the absolute performance bottleneck is not the throughput of a single drive. In a majority of Scale/GPFS clusters the network data path is the performance limitation. If they deploy HDR or 100/200/400Gbps Ethernet... At that point, the buffer copy time inside the server matters. When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) is easily demonstrated because it eliminates the bounce copy through the system memory. In our NVIDIA DGX A100 server testing testing we were able to get around 2x the per system throughput by using RDMA direct to GPU (GUP Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage node.) However, your question remains. Synthetic benchmarks are good indicators of technical benefit, but do your users and applications need that extra performance? These are probably only a handful of codes in organizations that need this. However, they are high-value use cases. We have client applications that either read a lot of data semi-randomly and not-cached - think mini-Epics for scaling ML training. Or, demand lowest response time, like production inference on voice recognition and NLP. If anyone has use cases for GPU accelerated codes with truly demanding data needs, please reach out directly. We are looking for more use cases to characterize the benefit for a new paper. f you can provide some code examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a benefit. Thanks, doug Douglas O'Flaherty douglasof at us.ibm.com ----- Message from Jonathan Buzzard > on Fri, 10 Dec 2021 00:27:23 +0000 ----- To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG ----- Original message ----- From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS Servers for Datamigration Date: Fri, Dec 10, 2021 10:27 On 09/12/2021 16:04, Douglas O'flaherty wrote: > > Though not directly about your design, our work with NVIDIA on GPUdirect > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > MOFED and Firmware version compatibility can be. > > I would suggest anyone debugging RDMA issues should look at those closely. > May I ask what are the alleged benefits of using RDMA in GPFS? I can see there would be lower latency over a plain IP Ethernet or IPoIB solution but surely disk latency is going to swamp that? I guess SSD drives might change that calculation but I have never seen proper benchmarks comparing the two, or even better yet all four connection options. Just seems a lot of complexity and fragility for very little gain to me. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Dec 10 10:37:31 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Dec 2021 10:37:31 +0000 Subject: [gpfsug-discuss] Test email format / mail format Message-ID: An HTML attachment was scrubbed... URL: From Ondrej.Kosik at ibm.com Fri Dec 10 10:39:56 2021 From: Ondrej.Kosik at ibm.com (Ondrej Kosik) Date: Fri, 10 Dec 2021 10:39:56 +0000 Subject: [gpfsug-discuss] Test email format / mail format In-Reply-To: References: Message-ID: Hello all, Thank you for the test email, my reply is coming from Outlook-based infrastructure. ________________________________ From: Olaf Weiser Sent: Friday, December 10, 2021 10:37 AM To: gpfsug-discuss at spectrumscale.org Cc: Ondrej Kosik Subject: Test email format / mail format This email is just a test, because we've seen mail format issues from IBM sent emails you can ignore this email , just for internal problem determination -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Dec 10 11:10:07 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Dec 2021 11:10:07 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: , <7bec39e7fe0d4aac842b59a29239522f@Mail.EDVDesign.cloudia> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: Image.16391192376761.png Type: image/png Size: 127072 bytes Desc: not available URL: From anacreo at gmail.com Sun Dec 12 02:19:02 2021 From: anacreo at gmail.com (Alec) Date: Sat, 11 Dec 2021 18:19:02 -0800 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: I feel the need to respond here... I see many responses on this User Group forum that are dismissive of the fringe / extreme use cases and of the "what do you need that for '' mindset. The thing is that Spectrum Scale is for the extreme, just take the word "Parallel" in the old moniker that was already an extreme use case. If you have a standard workload, then sure most of the complex features of the file system are toys, but many of us DO have extreme workloads where shaking out every ounce of performance is a worthwhile and financially sound endeavor. It is also because of the efforts of those of us living on the cusp of technology that these technologies become mainstream and no-longer extreme. I have an AIX LPAR that traverses more than 300TB+ of data a day on a Spectrum Scale file system, it is fully virtualized, and handles a million files. If that performance level drops, regulatory reports will be late, business decisions won't be current. However, the systems of today and the future have to traverse this much data and if they are slow then they can't keep up with real-time data feeds. So the difference between an RDMA disk IO vs a non RDMA disk IO could possibly mean what level of analytics are done to perform real time fraud prevention. Or at what cost, today many systems achieve this by keeping everything in memory in HUGE farms.. Being able to perform data operations at 30GB/s means you can traverse ALL of the census bureau data for all time from the US Govt in about 2 seconds... that's a pretty substantial capability that moves the bar forward in what we can do from a technology perspective. I just did a technology garage with IBM where we were able to achieve 1.5TB/writes on an encrypted ESS off of a single VMWare Host and 4 VM's over IP... That's over 2PB of data writes a day on a single VM server. Being able to demonstrate that there are production virtualized environments capable of this type of capacity, helps to show where the point of engineering a proper storage architecture outweighs the benefits of just throwing more GPU compute farms at the problem with ever dithering disk I/O. It also helps to demonstrate how a virtual storage optimized farm could be leveraged to host many in-memory or data analytic heavy workloads in a shared configuration. Douglas's response is the right one, how much IO does the application / environment need, it's nice to see Spectrum Scale have the flexibility to deliver. I'm pretty confident that if I can't deliver the required I/O performance on Spectrum Scale, nobody else can on any other storage platform within reasonable limits. Alec Effrat On Thu, Dec 9, 2021 at 8:24 PM Douglas O'flaherty wrote: > Jonathan: > > You posed a reasonable question, which was "when is RDMA worth the > hassle?" I agree with part of your premises, which is that it only matters > when the bottleneck isn't somewhere else. With a parallel file system, like > Scale/GPFS, the absolute performance bottleneck is not the throughput of a > single drive. In a majority of Scale/GPFS clusters the network data path is > the performance limitation. If they deploy HDR or 100/200/400Gbps > Ethernet... At that point, the buffer copy time inside the server matters. > > When the device is an accelerator, like a GPU, the benefit of RDMA (GDS) > is easily demonstrated because it eliminates the bounce copy through the > system memory. In our NVIDIA DGX A100 server testing testing we were able > to get around 2x the per system throughput by using RDMA direct to GPU (GUP > Direct Storage). (Tested on 2 DGX system with 4x HDR links per storage > node.) > > However, your question remains. Synthetic benchmarks are good indicators > of technical benefit, but do your users and applications need that extra > performance? > > These are probably only a handful of codes in organizations that need > this. However, they are high-value use cases. We have client applications > that either read a lot of data semi-randomly and not-cached - think > mini-Epics for scaling ML training. Or, demand lowest response time, like > production inference on voice recognition and NLP. > > If anyone has use cases for GPU accelerated codes with truly demanding > data needs, please reach out directly. We are looking for more use cases to > characterize the benefit for a new paper. f you can provide some code > examples, we can help test if RDMA direct to GPU (GPUdirect Storage) is a > benefit. > > Thanks, > > doug > > Douglas O'Flaherty > douglasof at us.ibm.com > > > > > > > ----- Message from Jonathan Buzzard on > Fri, 10 Dec 2021 00:27:23 +0000 ----- > > *To:* > gpfsug-discuss at spectrumscale.org > > *Subject:* > Re: [gpfsug-discuss] > On 09/12/2021 16:04, Douglas O'flaherty wrote: > > > > Though not directly about your design, our work with NVIDIA on GPUdirect > > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > > MOFED and Firmware version compatibility can be. > > > > I would suggest anyone debugging RDMA issues should look at those > closely. > > > May I ask what are the alleged benefits of using RDMA in GPFS? > > I can see there would be lower latency over a plain IP Ethernet or IPoIB > solution but surely disk latency is going to swamp that? > > I guess SSD drives might change that calculation but I have never seen > proper benchmarks comparing the two, or even better yet all four > connection options. > > Just seems a lot of complexity and fragility for very little gain to me. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > > > > > ----- Original message ----- > From: "Jonathan Buzzard" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug-discuss at spectrumscale.org > Cc: > Subject: [EXTERNAL] Re: [gpfsug-discuss] alternate path between ESS > Servers for Datamigration > Date: Fri, Dec 10, 2021 10:27 > > On 09/12/2021 16:04, Douglas O'flaherty wrote: > > > > Though not directly about your design, our work with NVIDIA on GPUdirect > > Storage and SuperPOD has shown how sensitive RDMA (IB & RoCE) to both > > MOFED and Firmware version compatibility can be. > > > > I would suggest anyone debugging RDMA issues should look at those > closely. > > > May I ask what are the alleged benefits of using RDMA in GPFS? > > I can see there would be lower latency over a plain IP Ethernet or IPoIB > solution but surely disk latency is going to swamp that? > > I guess SSD drives might change that calculation but I have never seen > proper benchmarks comparing the two, or even better yet all four > connection options. > > Just seems a lot of complexity and fragility for very little gain to me. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 02:38:26 2021 From: anacreo at gmail.com (Alec) Date: Sat, 11 Dec 2021 18:38:26 -0800 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: You can manipulate the permissions via GPFS policy engine, essentially you'd write a script that the policy engine calls and tell GPFS to farm out the change in at whatever scale you need... run in a single node, how many files per thread, how many threads per node, etc... This can GREATLY accelerate file change permissions over a large quantity of files. However, as stated earlier the mmfind command will do all of this for you and it's worth the effort to get it compiled for your system. I don't have Spectrum Scale in front of me but for the best performance you'll want to setup the mmfind policy engine parameters to parallelize your workload... If mmfind has no action it will silently use GPFS policy engine to produce the requested output, however if mmfind has an action it will expose the policy engine calls. it goes something like this: mmfind -B 1 -N directattachnode1,directattachnode2 -m 24 /path/to/find -perm +o=w ! \( -type d -perm +o=t \) -xargs chmod o-w This will run 48 threads on 2 nodes and bump other write permissions off of any file it finds (excluding temp dirs) until it completes, it should go blistering fast... as this is only a meta operation the -B 1 might not be necessary, you'd probably be better off with a -B 100, but as I deal with a lot of 100GB+ files I don't want a single thread to be stuck with 3 100GB+ files and another thread to have none, so I usually set the max depth to be 1 and take the higher execution count. This has an advantage in that GPFS will break up the inodes in the most efficient way for the chmod to happen in parallel. I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. If you think of the possibilities here you could easily change that chmod to a gzip and add a -mtime +1200 and you have a find command that will gzip compress files over 4 years old in parallel across multiple nodes... mmfind is VERY powerful and flexible, highly worth getting into usage. Alec On Tue, Dec 7, 2021 at 7:43 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 07/12/2021 14:55, Simon Thompson wrote: > > > > Or add: > > UPDATECTIME yes > > SKIPACLUPDATECHECK yes > > > > To you dsm.opt file to skip checking for those updates and don?t back > > them up again. > > Yeah, but then a restore gives you potentially an unusable file system > as the ownership of the files and ACL's are all wrong. Better to bite > the bullet and back them up again IMHO. > > > > > Actually I thought TSM only updated the metadata if the mode/owner > > changed, not re-backed the file? > > That was my understanding but I have seen TSM rebacked up large amounts > of data where the owner of the file changed in the past, so your mileage > may vary. > > Also ACL's are stored in extended attributes which are stored with the > files and changes will definitely cause the file to be backed up again. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Sun Dec 12 11:19:07 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Sun, 12 Dec 2021 11:19:07 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: On 12/12/2021 02:19, Alec wrote: > I feel the need to respond here... I see many responses on this > User Group forum that are dismissive of the fringe / extreme use > cases and of the "what do you need that for '' mindset. The thing is > that Spectrum Scale is for the extreme, just take the word "Parallel" > in the old moniker that was already an extreme use case. I wasn't been dismissive, I was asking what the benefits of using RDMA where. There is very little information about it out there and not a lot of comparative benchmarking on it either. Without the benefits being clearly laid out I am unlikely to consider it and might be missing a trick. IBM's literature on the topic is underwhelming to say the least. [SNIP] > I have an AIX LPAR that traverses more than 300TB+ of data a day on a > Spectrum Scale file system, it is fully virtualized, and handles a > million files. If that performance level drops, regulatory reports > will be late, business decisions won't be current. However, the > systems of today and the future have to traverse this much data and > if they are slow then they can't keep up with real-time data feeds. I have this nagging suspicion that modern all flash storage systems could deliver that sort of performance without the overhead of a parallel file system. [SNIP] > > Douglas's response is the right one, how much IO does the > application / environment need, it's nice to see Spectrum Scale have > the flexibility to deliver. I'm pretty confident that if I can't > deliver the required I/O performance on Spectrum Scale, nobody else > can on any other storage platform within reasonable limits. > I would note here that in our *shared HPC* environment I made a very deliberate design decision to attach the compute nodes with 10Gbps Ethernet for storage. Though I would probably pick 25Gbps if we where procuring the system today. There where many reasons behind that, but the main ones being that historical file system performance showed that greater than 99% of the time the file system never got above 20% of it's benchmarked speed. Using 10Gbps Ethernet was not going to be a problem. Secondly by limiting the connection to 10Gbps it stops one person hogging the file system to the detriment of other users. We have seen individual nodes peg their 10Gbps link from time to time, even several nodes at once (jobs from the same user) and had they had access to a 100Gbps storage link that would have been curtains for everyone else's file system usage. At this juncture I would note that the GPFS admin traffic is handled by on separate IP address space on a separate VLAN which we prioritize with QOS on the switches. So even when a node floods it's 10Gbps link for extended periods of time it doesn't get ejected from the cluster. The need for a separate physical network for admin traffic is not necessary in my experience. That said you can do RDMA with Ethernet... Unfortunately the teaching cluster and protocol nodes are on Intel X520's which I don't think do RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do RDMA. I could upgrade the protocol nodes but the teaching cluster would be a problem. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From s.j.thompson at bham.ac.uk Sun Dec 12 17:01:21 2021 From: s.j.thompson at bham.ac.uk (Simon Thompson) Date: Sun, 12 Dec 2021 17:01:21 +0000 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 22:03:39 2021 From: anacreo at gmail.com (Alec) Date: Sun, 12 Dec 2021 14:03:39 -0800 Subject: [gpfsug-discuss] Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson wrote: > > I'm not sure if this happens on Spectrum Scale but on most FS's if you > do a chmod 770 file you'll lose any ACLs assigned to the > > file, so safest to bump the permissions with a subtractive or additive > o-w or g+w type operation. > > > > This depends entirely on the fileset setting, see: > > > https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command > > > > ?*allow-permission-change*? > > > > We typically have file-sets set to chmodAndUpdateAcl, though not > exclusively, I think it was some quirky software that tested the > permissions after doing something and didn?t like the updatewithAcl thing ? > > > > Simon > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Sun Dec 12 23:00:21 2021 From: anacreo at gmail.com (Alec) Date: Sun, 12 Dec 2021 15:00:21 -0800 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: So I never said this node wasn't in a HPC Cluster, it has partners... For our use case however some nodes have very expensive per core software licensing, and we have to weigh the human costs of empowering traditional monolithic code to do the job, or bringing in more users to re-write and maintain distributed code (someone is going to spend the money to get this work done!). So to get the most out of those licensed cores we have designed our virtual compute machine(s) with 128Gbps+ of SAN fabric. Just to achieve our average business day reads it would take 3 of your cluster nodes maxed out 24 hours, or 9 of them in a business day to achieve the same read speeds... and another 4 nodes to handle the writes. I guess HPC is in the eye of the business... In my experience cables and ports are cheaper than servers. The classic shared HPC design you have is being up-ended by the fact that there is so much compute power (cpu and memory) now in the nodes, you can't simply build a system with two storage connections (Noah's ark) and call it a day. If you look at the spec 25Gbps Ethernet is only delivering ~3GB/s (which is just above USB 3.2, and below USB 4). Spectrum Scale does very well for us when met with a fully saturated workload, we maintain one node for SLA and one node for AdHoc workload, and like clockwork the SLA box always steals exactly half the bandwidth when a job fires, so that 1 SLA job can take half the bandwidth and complete compared to the 40 AdHoc jobs on the other node. In newer releases IBM has introduced fileset throttling.... this is very exciting as we can really just design the biggest fattest pipes from VM to Storage and then software define the storage AND the bandwidth from the standard nobody cares about workloads all the way up to the most critical workloads... I don't buy the smaller bandwidth is better, as I see that as just one band-aid that has more elegant solutions, such as simply doing more resource constraints (you can't push the bandwidth if you can't get the CPU...), or using a workload orchestrator such as LSF with limits set, but I also won't say it never makes sense, as well I only know my problems and my solutions. For years the network team wouldn't let users have more than 10mb then 100mb networking as they were always worried about their backend being overwhelmed... I literally had faster home internet service than my work desktop connection at one point in my life.. it was all a falesy, the workload should drive the technology, the technology shouldn't hinder the workload. You can do a simple exercise, try scaling up... imagine your cluster is asked to start computing 100x more work... and that work must be completed on time. Do you simply say let me buy 100x more of everything? Or do you start to look at where can I gain efficiency and what actual bottlenecks do I need to lift... for some of us it's CPU, for some it's Memory, for some it's disk, depending on the work... I'd say the extremely rare case is where you need 100x more of EVERYTHING, but you have to get past the performance of the basic building blocks baked into the cake before you do need to dig deeper into the bottlenecks and it makes practical and financial sense. If your main bottleneck was storage, you'd be asking far different questions about RDMA. Alec On Sun, Dec 12, 2021 at 3:19 AM Jonathan Buzzard < jonathan.buzzard at strath.ac.uk> wrote: > On 12/12/2021 02:19, Alec wrote: > > > I feel the need to respond here... I see many responses on this > > User Group forum that are dismissive of the fringe / extreme use > > cases and of the "what do you need that for '' mindset. The thing is > > that Spectrum Scale is for the extreme, just take the word "Parallel" > > in the old moniker that was already an extreme use case. > > I wasn't been dismissive, I was asking what the benefits of using RDMA > where. There is very little information about it out there and not a lot > of comparative benchmarking on it either. Without the benefits being > clearly laid out I am unlikely to consider it and might be missing a trick. > > IBM's literature on the topic is underwhelming to say the least. > > [SNIP] > > > > I have an AIX LPAR that traverses more than 300TB+ of data a day on a > > Spectrum Scale file system, it is fully virtualized, and handles a > > million files. If that performance level drops, regulatory reports > > will be late, business decisions won't be current. However, the > > systems of today and the future have to traverse this much data and > > if they are slow then they can't keep up with real-time data feeds. > > I have this nagging suspicion that modern all flash storage systems > could deliver that sort of performance without the overhead of a > parallel file system. > > [SNIP] > > > > > Douglas's response is the right one, how much IO does the > > application / environment need, it's nice to see Spectrum Scale have > > the flexibility to deliver. I'm pretty confident that if I can't > > deliver the required I/O performance on Spectrum Scale, nobody else > > can on any other storage platform within reasonable limits. > > > > I would note here that in our *shared HPC* environment I made a very > deliberate design decision to attach the compute nodes with 10Gbps > Ethernet for storage. Though I would probably pick 25Gbps if we where > procuring the system today. > > There where many reasons behind that, but the main ones being that > historical file system performance showed that greater than 99% of the > time the file system never got above 20% of it's benchmarked speed. > Using 10Gbps Ethernet was not going to be a problem. > > Secondly by limiting the connection to 10Gbps it stops one person > hogging the file system to the detriment of other users. We have seen > individual nodes peg their 10Gbps link from time to time, even several > nodes at once (jobs from the same user) and had they had access to a > 100Gbps storage link that would have been curtains for everyone else's > file system usage. > > At this juncture I would note that the GPFS admin traffic is handled by > on separate IP address space on a separate VLAN which we prioritize with > QOS on the switches. So even when a node floods it's 10Gbps link for > extended periods of time it doesn't get ejected from the cluster. The > need for a separate physical network for admin traffic is not necessary > in my experience. > > That said you can do RDMA with Ethernet... Unfortunately the teaching > cluster and protocol nodes are on Intel X520's which I don't think do > RDMA. Everything is X710's or Mellanox Connect-X4 which definitely do do > RDMA. I could upgrade the protocol nodes but the teaching cluster would > be a problem. > > > JAB. > > -- > Jonathan A. Buzzard Tel: +44141-5483420 > HPC System Administrator, ARCHIE-WeSt. > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From abeattie at au1.ibm.com Mon Dec 13 00:03:42 2021 From: abeattie at au1.ibm.com (Andrew Beattie) Date: Mon, 13 Dec 2021 00:03:42 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Mon Dec 13 10:49:37 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 13 Dec 2021 10:49:37 +0000 Subject: [gpfsug-discuss] R: Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> Message-ID: <96a77c75de9b41f089e853120eef870d@psi.ch> I am definitely going to try this solution with mmfind. Thank you also for the command line and several hints? I?ll be back with the outcome soon. Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Alec Inviato: domenica 12 dicembre 2021 23:04 A: gpfsug main discussion list Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From alvise.dorigo at psi.ch Mon Dec 13 11:30:17 2021 From: alvise.dorigo at psi.ch (Dorigo Alvise (PSI)) Date: Mon, 13 Dec 2021 11:30:17 +0000 Subject: [gpfsug-discuss] R: R: Question on changing mode on many files In-Reply-To: <96a77c75de9b41f089e853120eef870d@psi.ch> References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> <96a77c75de9b41f089e853120eef870d@psi.ch> Message-ID: Hi Alec , mmfind doesn?t have a man page (does it have an online one ? I cannot find it). And according to mmfind -h it doesn?t exposes the ?-N? neither the ?-B? flags. RPM is gpfs.base-5.1.1-2.x86_64. Do I have chance to download a newest version of that script from somewhere ? Thanks, Alvise Da: gpfsug-discuss-bounces at spectrumscale.org Per conto di Dorigo Alvise (PSI) Inviato: luned? 13 dicembre 2021 11:50 A: gpfsug main discussion list Oggetto: [gpfsug-discuss] R: Question on changing mode on many files I am definitely going to try this solution with mmfind. Thank you also for the command line and several hints? I?ll be back with the outcome soon. Alvise Da: gpfsug-discuss-bounces at spectrumscale.org > Per conto di Alec Inviato: domenica 12 dicembre 2021 23:04 A: gpfsug main discussion list > Oggetto: Re: [gpfsug-discuss] Question on changing mode on many files How am I just learning about this right now, thank you! Makes so much more sense now the odd behaviors I've seen over the years on GPFS vs POSIX chmod/ACL. Will definitely go review those settings on my filesets now, wonder if the default has evolved from 3.x -> 4.x -> 5.x. IBM needs to find a way to pre-compile mmfind and make it supported, it really is essential and so beneficial, and so hard to get done in a production regulated environment. Though a bigger warning that the compress option is an action not a criteria! Alec On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > I'm not sure if this happens on Spectrum Scale but on most FS's if you do a chmod 770 file you'll lose any ACLs assigned to the > file, so safest to bump the permissions with a subtractive or additive o-w or g+w type operation. This depends entirely on the fileset setting, see: https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command ?allow-permission-change? We typically have file-sets set to chmodAndUpdateAcl, though not exclusively, I think it was some quirky software that tested the permissions after doing something and didn?t like the updatewithAcl thing ? Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From anacreo at gmail.com Mon Dec 13 18:33:23 2021 From: anacreo at gmail.com (Alec) Date: Mon, 13 Dec 2021 10:33:23 -0800 Subject: [gpfsug-discuss] R: R: Question on changing mode on many files In-Reply-To: References: <15a0cd66-7a61-15ff-15eb-2613979b48b6@strath.ac.uk> <96a77c75de9b41f089e853120eef870d@psi.ch> Message-ID: I checked on my office network.... mmfind --help mmfind -polFlags '-N node1,node2 -B 100 -m 24' /path/to/find -perm +o=w ! \( -type d -perm +o=t \) -xargs chmod o-w I think that the -m 24 is the default (24 threads per node), but it's nice to include on the command line so you remember you can increment/decrement it as your needs require or your nodes can handle. It's IMPORTANT to review in the mmfind --help output that some things are 'mmfind' args and go BEFORE the path... some are CRITERIA args and have no impact on the files... BUT SOME ARE ACTION args, and they will affect files. So -exec -xargs are obvious actions, however, -gpfsCompress doesn't find compressed files, it will actually compress the objects... in our AIX environment our compressed reads feel like they're essentially broken, we only get about 5MB/s, however on Linux compress reads seem to work fairly well. So make sure to read the man page carefully before using some non-obvious GPFS enhancements. Also the nice thing is mmfind -xargs takes care of all the strange file names, so you don't have to do anything complicated, but you also can't pipe the output as it will run the xarg in the policy engine. As a footnote this is my all time favorite find for troubleshooting... find $(pwd) -mtime -1 | sed -e 's/.*/"&"/g' | xargs ls -latr List all the files modified in the last day in reverse chronology... Doesn't work :-( Alec On Mon, Dec 13, 2021 at 3:30 AM Dorigo Alvise (PSI) wrote: > Hi Alec , > > mmfind doesn?t have a man page (does it have an online one ? I cannot find > it). And according to mmfind -h it doesn?t exposes the ?-N? neither the > ?-B? flags. RPM is gpfs.base-5.1.1-2.x86_64. > > > > Do I have chance to download a newest version of that script from > somewhere ? > > > > Thanks, > > > > Alvise > > > > *Da:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *Per conto di *Dorigo Alvise > (PSI) > *Inviato:* luned? 13 dicembre 2021 11:50 > *A:* gpfsug main discussion list > *Oggetto:* [gpfsug-discuss] R: Question on changing mode on many files > > > > I am definitely going to try this solution with mmfind. > > Thank you also for the command line and several hints? I?ll be back with > the outcome soon. > > > > Alvise > > > > *Da:* gpfsug-discuss-bounces at spectrumscale.org < > gpfsug-discuss-bounces at spectrumscale.org> *Per conto di *Alec > *Inviato:* domenica 12 dicembre 2021 23:04 > *A:* gpfsug main discussion list > *Oggetto:* Re: [gpfsug-discuss] Question on changing mode on many files > > > > How am I just learning about this right now, thank you! Makes so much > more sense now the odd behaviors I've seen over the years on GPFS vs POSIX > chmod/ACL. Will definitely go review those settings on my filesets now, > wonder if the default has evolved from 3.x -> 4.x -> 5.x. > > > > IBM needs to find a way to pre-compile mmfind and make it supported, it > really is essential and so beneficial, and so hard to get done in a > production regulated environment. Though a bigger warning that the > compress option is an action not a criteria! > > > > Alec > > > > On Sun, Dec 12, 2021 at 9:01 AM Simon Thompson > wrote: > > > I'm not sure if this happens on Spectrum Scale but on most FS's if you > do a chmod 770 file you'll lose any ACLs assigned to the > > file, so safest to bump the permissions with a subtractive or additive > o-w or g+w type operation. > > > > This depends entirely on the fileset setting, see: > > > https://www.ibm.com/docs/en/spectrum-scale/5.1.2?topic=reference-mmchfileset-command > > > > ?*allow-permission-change*? > > > > We typically have file-sets set to chmodAndUpdateAcl, though not > exclusively, I think it was some quirky software that tested the > permissions after doing something and didn?t like the updatewithAcl thing ? > > > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Mon Dec 13 23:55:23 2021 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Mon, 13 Dec 2021 23:55:23 +0000 Subject: [gpfsug-discuss] WAS: alternative path; Now: RDMA In-Reply-To: References: Message-ID: <19884986-aff8-20aa-f1d1-590f6b81ddd2@strath.ac.uk> On 13/12/2021 00:03, Andrew Beattie wrote: > What is the main outcome or business requirement of the teaching cluster > ( i notice your specific in the use of defining it as a teaching cluster) > It is entirely possible that the use case for this cluster does not > warrant the use of high speed low latency networking, and it simply > needs the benefits of a parallel filesystem. While we call it the "teaching cluster" it would be more appropriate to call them "teaching nodes" that shares resources (storage and login nodes) with the main research cluster. It's mainly used by undergraduates doing final year projects and M.Sc. students. It's getting a bit long in the tooth now but not many undergraduates have access to a 16 core machine with 64GB of RAM. Even if they did being able to let something go flat out for 48 hours means there personal laptop is available for other things :-) I was just musing that the cards in the teaching nodes being Intel 82599ES would be a stumbling block for RDMA over Ethernet, but on checking the Intel X710 doesn't do RDMA either so it would all be a bust anyway. I was clearly on the crack pipe when I thought they did. So aside from the DSS-G and GPU nodes with Connect-X4 cards nothing does RDMA. [SNIP] > For some of my research clients this is the ability to run 20-30% more > compute jobs on the same HPC resources in the same 24H period, which > means that they can reduce the amount of time they need on the HPC > cluster to get the data results that they are looking for. Except as I said in our cluster the storage servers have never been maxed out except when running benchmarks. Individual compute nodes have been maxed out (mainly Gaussian writing 800GB temporary files) but as I explained that's a good thing from my perspective because I don't want one or two users to be able to pound the storage into oblivion and cause problems for everyone else. We have enough problems with users tanking the login nodes by running computations on them. That should go away with our upgrade to RHEL8 and the wonders of per user cgroups; me I love systemd. In the end nobody has complained that the storage speed is a problem yet, and putting the metadata on SSD would be my first port of call if they did and funds where available to make things go faster. To be honest I think the users are just happy that GPFS doesn't eat itself and be out of action for a few weeks every couple of years like Lustre did on the previous system. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From olaf.weiser at de.ibm.com Fri Dec 17 15:08:15 2021 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 17 Dec 2021 15:08:15 +0000 Subject: [gpfsug-discuss] email format check again for IBM domain send email Message-ID: An HTML attachment was scrubbed... URL: From juergen.hannappel at desy.de Fri Dec 17 15:57:45 2021 From: juergen.hannappel at desy.de (Hannappel, Juergen) Date: Fri, 17 Dec 2021 16:57:45 +0100 (CET) Subject: [gpfsug-discuss] ESS 6.1.2.1 changes Message-ID: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Hi, I just noticed that tday a new ESS release (6.1.2.1) appeared on fix central. What I can't find is a list of changes to 6.1.2.0, and anyway finding the change list is always a PITA. Does anyone know what changed? -- Dr. J?rgen Hannappel DESY/IT Tel. : +49 40 8998-4616 From luis.bolinches at fi.ibm.com Fri Dec 17 18:50:09 2021 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Fri, 17 Dec 2021 18:50:09 +0000 Subject: [gpfsug-discuss] ESS 6.1.2.1 changes In-Reply-To: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> References: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Dec 20 11:26:29 2021 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 20 Dec 2021 12:26:29 +0100 Subject: [gpfsug-discuss] ESS 6.1.2.1 changes In-Reply-To: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> References: <1740905192.10339973.1639756665210.JavaMail.zimbra@desy.de> Message-ID: Just ran an upgrade on an EMS, and the only changes I see are these updated packages on the ems: +gpfs.docs-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:43 AM CET +gpfs.ess.firmware-6.0.0-15.ppc64le Mon 20 Dec 2021 11:56:42 AM CET +gpfs.msg.en_US-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:12 AM CET +gpfs.gss.pmsensors-5.1.2-0.el8.ppc64le Mon 20 Dec 2021 11:56:12 AM CET +gpfs.gpl-5.1.2-0.9.noarch Mon 20 Dec 2021 11:56:11 AM CET +gpfs.gnr.base-1.0.0-0.ppc64le Mon 20 Dec 2021 11:56:11 AM CET +gpfs.gnr.support-ess5000-1.0.0-3.noarch Mon 20 Dec 2021 11:56:10 AM CET +gpfs.gnr.support-ess3200-6.1.2-0.noarch Mon 20 Dec 2021 11:56:10 AM CET +gpfs.crypto-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:10 AM CET +gpfs.compression-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:10 AM CET +gpfs.license.dmd-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:56:09 AM CET +gpfs.gnr.support-ess3000-1.0.0-3.noarch Mon 20 Dec 2021 11:56:09 AM CET +gpfs.gui-5.1.2-0.4.noarch Mon 20 Dec 2021 11:56:05 AM CET +gpfs.gskit-8.0.55-19.ppc64le Mon 20 Dec 2021 11:56:02 AM CET +gpfs.java-5.1.2-0.4.ppc64le Mon 20 Dec 2021 11:56:01 AM CET +gpfs.gss.pmcollector-5.1.2-0.el8.ppc64le Mon 20 Dec 2021 11:55:59 AM CET +gpfs.gnr.support-essbase-6.1.2-0.noarch Mon 20 Dec 2021 11:55:59 AM CET +gpfs.adv-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:59 AM CET +gpfs.gnr-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:58 AM CET +gpfs.base-5.1.2-0.9.ppc64le Mon 20 Dec 2021 11:55:54 AM CET +sdparm-1.10-10.el8.ppc64le Mon 20 Dec 2021 11:55:21 AM CET +gpfs.ess.tools-6.1.2.1-release.noarch Mon 20 Dec 2021 11:50:47 AM CET I will guess it has something to do with log4j, but a changelog would be nice :-) https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=142683 On Fri, Dec 17, 2021 at 5:07 PM Hannappel, Juergen < juergen.hannappel at desy.de> wrote: > Hi, > I just noticed that tday a new ESS release (6.1.2.1) appeared on fix > central. > What I can't find is a list of changes to 6.1.2.0, and anyway finding the > change list is always a PITA. > > Does anyone know what changed? > > -- > Dr. J?rgen Hannappel DESY/IT Tel. : +49 40 8998-4616 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: