From ilan84 at gmail.com Tue Aug 1 07:16:19 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 1 Aug 2017 09:16:19 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports Message-ID: Hi, I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum scale cluster (CentOS). I dont need NFSv4 ACLS enabled, but i dont mind them to be if its mandatory for the NFSv4 to work. I have created the domain user "fwuser" in the Active Directory (domain=LH20), it is in group Domain users, Domain Admins, Backup Operators and administrators. In the linux machine im with user ilanwalk (sudoer) [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) groups=12000513(LH20\domain users),12001603(LH20\fwuser),12000572(LH20\denied rodc password replication group),12000512(LH20\domain admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) and when trying to add smb export: [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share /fs_gpfs01 --option "admin users=LH20\fwuser" mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file system that does not enforce NFSv4 ACLs. [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) Also, when trying to enable NFS i get: [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS Failed to enable NFS service. Ensure file authentication is removed prior enabling service. What am I missing ? From jonathan at buzzard.me.uk Tue Aug 1 09:50:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Tue, 01 Aug 2017 09:50:05 +0100 Subject: [gpfsug-discuss] Quota and hardlimit enforcement In-Reply-To: <3789811E-523F-47AE-93F3-E2985DD84D60@vanderbilt.edu> References: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> <20170731160353.54412s4i1r957eax@support.scinet.utoronto.ca> <3789811E-523F-47AE-93F3-E2985DD84D60@vanderbilt.edu> Message-ID: <1501577405.17548.11.camel@buzzard.me.uk> On Mon, 2017-07-31 at 20:11 +0000, Buterbaugh, Kevin L wrote: > Jaime, > > > That?s heavily workload dependent. We run a traditional HPC cluster > and have a 7 day grace on home and 14 days on scratch. By setting the > soft and hard limits appropriately we?ve slammed the door on many a > runaway user / group / fileset. YMMV? > I would concur that it is heavily workload dependant. I have never had a problem with a 7 day period. Besides which if they can significantly blow through the hard limit due to heavy writing and the "in doubt" value then it matters not one jot that grace is 7 days or two hours. My preference however is to set the grace period to as long as possible (which from memory is about 10 years on GPFS) then set the soft at 90% of the hard and use over quota callbacks to signal that there is a problem. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From r.sobey at imperial.ac.uk Tue Aug 1 15:23:58 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 1 Aug 2017 14:23:58 +0000 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports Message-ID: You must have nfs4 Acl semantics only to create smb exports. Mmchfs -k parameter as I recall. On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: Hi, I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum scale cluster (CentOS). I dont need NFSv4 ACLS enabled, but i dont mind them to be if its mandatory for the NFSv4 to work. I have created the domain user "fwuser" in the Active Directory (domain=LH20), it is in group Domain users, Domain Admins, Backup Operators and administrators. In the linux machine im with user ilanwalk (sudoer) [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) groups=12000513(LH20\domain users),12001603(LH20\fwuser),12000572(LH20\denied rodc password replication group),12000512(LH20\domain admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) and when trying to add smb export: [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share /fs_gpfs01 --option "admin users=LH20\fwuser" mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file system that does not enforce NFSv4 ACLs. [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) Also, when trying to enable NFS i get: [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS Failed to enable NFS service. Ensure file authentication is removed prior enabling service. What am I missing ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 1 16:34:29 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 1 Aug 2017 18:34:29 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: References: Message-ID: Yes I succeeded to make smb share. But only the user i put in the command can write files to it. Others can read only. How can i enable write it to all domain users? The group. And what about the error when enabling nfs? On Aug 1, 2017 17:24, "Sobey, Richard A" wrote: > You must have nfs4 Acl semantics only to create smb exports. > > Mmchfs -k parameter as I recall. > > On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: > > Hi, > I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum > scale cluster (CentOS). > I dont need NFSv4 ACLS enabled, but i dont mind them to be if its > mandatory for the NFSv4 to work. > > I have created the domain user "fwuser" in the Active Directory > (domain=LH20), it is in group Domain users, Domain Admins, Backup > Operators and administrators. > > In the linux machine im with user ilanwalk (sudoer) > > [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser > uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) > groups=12000513(LH20\domain > users),12001603(LH20\fwuser),12000572(LH20\denied rodc password > replication group),12000512(LH20\domain > admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) > > > and when trying to add smb export: > [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share > /fs_gpfs01 --option "admin users=LH20\fwuser" > mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file > system that does not enforce NFSv4 ACLs. > > > [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs > fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) > > > > Also, when trying to enable NFS i get: > [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS > Failed to enable NFS service. Ensure file authentication is removed > prior enabling service. > > > What am I missing ? > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Tue Aug 1 23:40:38 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Tue, 1 Aug 2017 23:40:38 +0100 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: References: Message-ID: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> Could you please give a break down of the commands that you have used to configure/setup the CES services? Which guide did you follow? and what version of GPFS/SS are you currently running -- Lauz On 01/08/2017 16:34, Ilan Schwarts wrote: > > Yes I succeeded to make smb share. But only the user i put in the > command can write files to it. Others can read only. > > How can i enable write it to all domain users? The group. > And what about the error when enabling nfs? > > On Aug 1, 2017 17:24, "Sobey, Richard A" > wrote: > > You must have nfs4 Acl semantics only to create smb exports. > > Mmchfs -k parameter as I recall. > > On 1 Aug 2017 7:16 am, Ilan Schwarts > wrote: > > Hi, > I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes > spectrum > scale cluster (CentOS). > I dont need NFSv4 ACLS enabled, but i dont mind them to be if its > mandatory for the NFSv4 to work. > > I have created the domain user "fwuser" in the Active Directory > (domain=LH20), it is in group Domain users, Domain Admins, Backup > Operators and administrators. > > In the linux machine im with user ilanwalk (sudoer) > > [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser > uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) > groups=12000513(LH20\domain > users),12001603(LH20\fwuser),12000572(LH20\denied rodc password > replication group),12000512(LH20\domain > admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) > > > and when trying to add smb export: > [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share > /fs_gpfs01 --option "admin users=LH20\fwuser" > mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file > system that does not enforce NFSv4 ACLs. > > > [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs > fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) > > > > Also, when trying to enable NFS i get: > [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS > Failed to enable NFS service. Ensure file authentication is > removed > prior enabling service. > > > What am I missing ? > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Wed Aug 2 05:33:02 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Wed, 2 Aug 2017 07:33:02 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> References: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> Message-ID: Hi, I use SpectrumScale 4.2.2 I have configured the CES as in documentation: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_setcessharedroot.htm This means i did the following: mmchconfig cesSharedRoot=/fs_gpfs01 mmchnode -?ces-enable ?N LH20-GPFS1,LH20-GPFS2 Thank you Some output: [root at LH20-GPFS1 ~]# mmces state show -a NODE AUTH BLOCK NETWORK AUTH_OBJ NFS OBJ SMB CES LH20-GPFS1 HEALTHY DISABLED DEGRADED DISABLED DISABLED DISABLED HEALTHY DEGRADED LH20-GPFS2 HEALTHY DISABLED DEGRADED DISABLED DISABLED DISABLED HEALTHY DEGRADED [root at LH20-GPFS1 ~]# [root at LH20-GPFS1 ~]# mmces node list Node Name Node Flags Node Groups ----------------------------------------------------------------- 1 LH20-GPFS1 none 3 LH20-GPFS2 none [root at LH20-GPFS1 ~]# mmlscluster --ces GPFS cluster information ======================== GPFS cluster name: LH20-GPFS1 GPFS cluster id: 10777108240438931454 Cluster Export Services global parameters ----------------------------------------- Shared root directory: /fs_gpfs01 Enabled Services: SMB Log level: 0 Address distribution policy: even-coverage Node Daemon node name IP address CES IP address list ----------------------------------------------------------------------- 1 LH20-GPFS1 10.10.158.61 None 3 LH20-GPFS2 10.10.158.62 None On Wed, Aug 2, 2017 at 1:40 AM, Laurence Horrocks-Barlow wrote: > Could you please give a break down of the commands that you have used to > configure/setup the CES services? > > Which guide did you follow? and what version of GPFS/SS are you currently > running > > -- Lauz > > > On 01/08/2017 16:34, Ilan Schwarts wrote: > > Yes I succeeded to make smb share. But only the user i put in the command > can write files to it. Others can read only. > > How can i enable write it to all domain users? The group. > And what about the error when enabling nfs? > > On Aug 1, 2017 17:24, "Sobey, Richard A" wrote: >> >> You must have nfs4 Acl semantics only to create smb exports. >> >> Mmchfs -k parameter as I recall. >> >> On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: >> >> Hi, >> I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum >> scale cluster (CentOS). >> I dont need NFSv4 ACLS enabled, but i dont mind them to be if its >> mandatory for the NFSv4 to work. >> >> I have created the domain user "fwuser" in the Active Directory >> (domain=LH20), it is in group Domain users, Domain Admins, Backup >> Operators and administrators. >> >> In the linux machine im with user ilanwalk (sudoer) >> >> [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser >> uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) >> groups=12000513(LH20\domain >> users),12001603(LH20\fwuser),12000572(LH20\denied rodc password >> replication group),12000512(LH20\domain >> admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) >> >> >> and when trying to add smb export: >> [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share >> /fs_gpfs01 --option "admin users=LH20\fwuser" >> mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file >> system that does not enforce NFSv4 ACLs. >> >> >> [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs >> fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) >> >> >> >> Also, when trying to enable NFS i get: >> [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS >> Failed to enable NFS service. Ensure file authentication is removed >> prior enabling service. >> >> >> What am I missing ? >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- - Ilan Schwarts From john.hearns at asml.com Wed Aug 2 10:49:36 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 09:49:36 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn't work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Wed Aug 2 11:01:20 2017 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Wed, 2 Aug 2017 10:01:20 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Wed Aug 2 11:50:29 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 10:50:29 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> References: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> Message-ID: Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cgirda at wustl.edu Wed Aug 2 13:07:05 2017 From: cgirda at wustl.edu (Chakravarthy Girda) Date: Wed, 2 Aug 2017 17:37:05 +0530 Subject: [gpfsug-discuss] Modify template variables on pre-built grafana dashboard. In-Reply-To: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> References: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> Message-ID: <978835cd-4e29-4207-9936-6c95159356a3@wustl.edu> Hi, Successfully created bridge port and imported the pre-built grafana dashboard. https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/a180eb7e-9161-4e07-a6e4-35a0a076f7b3/attachment/5e9a5886-5bd9-4a6f-919e-bc66d16760cf/media/default%20dashboards%20set.zip Getting updates on some graphs but not all. Looks like I need to update the template variables. Need some help/instructions on how to evaluate those default variables on CLI, so I can fix them. Eg:- I get into the "File Systems View" Variable ( gpfsMetrics_fs1 ) --> Query ( gpfsMetrics_fs1 ) Regex ( /.*[^gpfs_fs_inode_used|gpfs_fs_inode_alloc|gpfs_fs_inode_free|gpfs_fs_inode_max]/ ) Question: * How can I execute the above Query and regex to fix the issues. * Is there any document on CLI options? Thank you Chakri -------------- next part -------------- An HTML attachment was scrubbed... URL: From truongv at us.ibm.com Wed Aug 2 15:35:12 2017 From: truongv at us.ibm.com (Truong Vu) Date: Wed, 2 Aug 2017 10:35:12 -0400 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 < https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Ftopic%3Fid%3D00104bb5-acf5-4036-93ba-29ea7b1d43b7%26ps%3D25&data=01%7C01%7Cjohn.hearns%40asml.com%7Caf48038c0f334674b53208d4d98d739e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=XuRlV4%2BRTilLfWD5NTK7n08m6IzjAmZ5mZOwUTNplSQ%3D&reserved=0 > Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org< mailto:gpfsug-discuss-bounces at spectrumscale.org> [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741< https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsystemd%2Fsystemd%2Fissues%2F1741&data=01%7C01%7Cjohn.hearns%40asml.com%7Caf48038c0f334674b53208d4d98d739e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=MNPDZ4bKsQBtYiz0j6SMI%2FCsKmnMbrc7kD6LMh0FQBw%3D&reserved=0 > However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170802/c0c43ae8/attachment.html > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From john.hearns at asml.com Wed Aug 2 15:49:15 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 14:49:15 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: Truong, thankyou for responding. The discussion which Renar referred to discussed system version 208, and suggested upgrading this. The system I am working on at the moment has systemd version 219, and there is only a slight minor number upgrade available. I should say that the temporary fix suggested in that discussion did work for me. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Truong Vu Sent: Wednesday, August 02, 2017 4:35 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. [Inactive hide details for gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: > Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de> ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From john.hearns at asml.com Wed Aug 2 16:19:27 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 15:19:27 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: Truong, thanks again for the response. I shall implement what is suggested in the FAQ. As we are in polite company I shall maintain a smiley face when mentioning systemd From: John Hearns Sent: Wednesday, August 02, 2017 4:49 PM To: 'gpfsug main discussion list' Subject: RE: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Truong, thankyou for responding. The discussion which Renar referred to discussed system version 208, and suggested upgrading this. The system I am working on at the moment has systemd version 219, and there is only a slight minor number upgrade available. I should say that the temporary fix suggested in that discussion did work for me. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Truong Vu Sent: Wednesday, August 02, 2017 4:35 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. [Inactive hide details for gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: > Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de> ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From stijn.deweirdt at ugent.be Wed Aug 2 16:57:55 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 17:57:55 +0200 Subject: [gpfsug-discuss] data integrity documentation Message-ID: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> hi all, is there any documentation wrt data integrity in spectrum scale: assuming a crappy network, does gpfs garantee somehow that data written by client ends up safe in the nsd gpfs daemon; and similarly from the nsd gpfs daemon to disk. and wrt crappy network, what about rdma on crappy network? is it the same? (we are hunting down a crappy infiniband issue; ibm support says it's network issue; and we see no errors anywhere...) thanks a lot, stijn From eric.wonderley at vt.edu Wed Aug 2 17:15:12 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 2 Aug 2017 12:15:12 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: No guarantee...unless you are using ess/gss solution. Crappy network will get you loads of expels and occasional fscks. Which I guess beats data loss and recovery from backup. YOu probably have a network issue...they can be subtle. Gpfs is a very extremely thorough network tester. Eric On Wed, Aug 2, 2017 at 11:57 AM, Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 17:26:29 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 16:26:29 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: the very first thing you should check is if you have this setting set : mmlsconfig envVar envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1 if that doesn't come back the way above you need to set it : mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" there was a problem in the Mellanox FW in various versions that was never completely addressed (bugs where found and fixed, but it was never fully proven to be addressed) the above environment variables turn code on in the mellanox driver that prevents this potential code path from being used to begin with. in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale that even you don't set this variables the problem can't happen anymore until then the only choice you have is the envVar above (which btw ships as default on all ESS systems). you also should be on the latest available Mellanox FW & Drivers as not all versions even have the code that is activated by the environment variables above, i think at a minimum you need to be at 3.4 but i don't remember the exact version. There had been multiple defects opened around this area, the last one i remember was : 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock you may ask your mellanox representative if they can get you access to this defect. while it was found on ESS , means on PPC64 and with ConnectX-3 cards its a general issue that affects all cards and on intel as well as Power. On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 17:26:29 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 16:26:29 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: the very first thing you should check is if you have this setting set : mmlsconfig envVar envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1 if that doesn't come back the way above you need to set it : mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" there was a problem in the Mellanox FW in various versions that was never completely addressed (bugs where found and fixed, but it was never fully proven to be addressed) the above environment variables turn code on in the mellanox driver that prevents this potential code path from being used to begin with. in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale that even you don't set this variables the problem can't happen anymore until then the only choice you have is the envVar above (which btw ships as default on all ESS systems). you also should be on the latest available Mellanox FW & Drivers as not all versions even have the code that is activated by the environment variables above, i think at a minimum you need to be at 3.4 but i don't remember the exact version. There had been multiple defects opened around this area, the last one i remember was : 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock you may ask your mellanox representative if they can get you access to this defect. while it was found on ESS , means on PPC64 and with ConnectX-3 cards its a general issue that affects all cards and on intel as well as Power. On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 19:38:13 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:38:13 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: <2518112e-0311-09c6-4f24-daa2f18bd80c@ugent.be> > No guarantee...unless you are using ess/gss solution. ok, so crappy network == corrupt data? hmmm, that is really a pity on 2017... > > Crappy network will get you loads of expels and occasional fscks. Which I > guess beats data loss and recovery from backup. if only we had errors like that. with the current issue mmfsck is the only tool that seems to trigger them (and setting some of the nsdChksum config flags reports checksum errors in the log files). but nsdperf with verify=on reports nothing. > > YOu probably have a network issue...they can be subtle. Gpfs is a very > extremely thorough network tester. we know ;) stijn > > > Eric > > On Wed, Aug 2, 2017 at 11:57 AM, Stijn De Weirdt > wrote: > >> hi all, >> >> is there any documentation wrt data integrity in spectrum scale: >> assuming a crappy network, does gpfs garantee somehow that data written >> by client ends up safe in the nsd gpfs daemon; and similarly from the >> nsd gpfs daemon to disk. >> >> and wrt crappy network, what about rdma on crappy network? is it the same? >> >> (we are hunting down a crappy infiniband issue; ibm support says it's >> network issue; and we see no errors anywhere...) >> >> thanks a lot, >> >> stijn >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From stijn.deweirdt at ugent.be Wed Aug 2 19:43:51 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:43:51 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> hi sven, > the very first thing you should check is if you have this setting set : maybe the very first thing to check should be the faq/wiki that has this documented? > > mmlsconfig envVar > > envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > MLX5_USE_MUTEX 1 > > if that doesn't come back the way above you need to set it : > > mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" i just set this (wasn't set before), but problem is still present. > > there was a problem in the Mellanox FW in various versions that was never > completely addressed (bugs where found and fixed, but it was never fully > proven to be addressed) the above environment variables turn code on in the > mellanox driver that prevents this potential code path from being used to > begin with. > > in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale > that even you don't set this variables the problem can't happen anymore > until then the only choice you have is the envVar above (which btw ships as > default on all ESS systems). > > you also should be on the latest available Mellanox FW & Drivers as not all > versions even have the code that is activated by the environment variables > above, i think at a minimum you need to be at 3.4 but i don't remember the > exact version. There had been multiple defects opened around this area, the > last one i remember was : we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from dell, and the fw is a bit behind. i'm trying to convince dell to make new one. mellanox used to allow to make your own, but they don't anymore. > > 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock > > you may ask your mellanox representative if they can get you access to this > defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > cards its a general issue that affects all cards and on intel as well as > Power. ok, thanks for this. maybe such a reference is enough for dell to update their firmware. stijn > > On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt > wrote: > >> hi all, >> >> is there any documentation wrt data integrity in spectrum scale: >> assuming a crappy network, does gpfs garantee somehow that data written >> by client ends up safe in the nsd gpfs daemon; and similarly from the >> nsd gpfs daemon to disk. >> >> and wrt crappy network, what about rdma on crappy network? is it the same? >> >> (we are hunting down a crappy infiniband issue; ibm support says it's >> network issue; and we see no errors anywhere...) >> >> thanks a lot, >> >> stijn >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 19:47:52 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 18:47:52 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> Message-ID: How can you reproduce this so quick ? Did you restart all daemons after that ? On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt wrote: > hi sven, > > > > the very first thing you should check is if you have this setting set : > maybe the very first thing to check should be the faq/wiki that has this > documented? > > > > > mmlsconfig envVar > > > > envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > > MLX5_USE_MUTEX 1 > > > > if that doesn't come back the way above you need to set it : > > > > mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > > MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > i just set this (wasn't set before), but problem is still present. > > > > > there was a problem in the Mellanox FW in various versions that was never > > completely addressed (bugs where found and fixed, but it was never fully > > proven to be addressed) the above environment variables turn code on in > the > > mellanox driver that prevents this potential code path from being used to > > begin with. > > > > in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale > > that even you don't set this variables the problem can't happen anymore > > until then the only choice you have is the envVar above (which btw ships > as > > default on all ESS systems). > > > > you also should be on the latest available Mellanox FW & Drivers as not > all > > versions even have the code that is activated by the environment > variables > > above, i think at a minimum you need to be at 3.4 but i don't remember > the > > exact version. There had been multiple defects opened around this area, > the > > last one i remember was : > we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > dell, and the fw is a bit behind. i'm trying to convince dell to make > new one. mellanox used to allow to make your own, but they don't anymore. > > > > > 00154843 : ESS ConnectX-3 performance issue - spinning on > pthread_spin_lock > > > > you may ask your mellanox representative if they can get you access to > this > > defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > > cards its a general issue that affects all cards and on intel as well as > > Power. > ok, thanks for this. maybe such a reference is enough for dell to update > their firmware. > > stijn > > > > > On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt > > wrote: > > > >> hi all, > >> > >> is there any documentation wrt data integrity in spectrum scale: > >> assuming a crappy network, does gpfs garantee somehow that data written > >> by client ends up safe in the nsd gpfs daemon; and similarly from the > >> nsd gpfs daemon to disk. > >> > >> and wrt crappy network, what about rdma on crappy network? is it the > same? > >> > >> (we are hunting down a crappy infiniband issue; ibm support says it's > >> network issue; and we see no errors anywhere...) > >> > >> thanks a lot, > >> > >> stijn > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 19:53:09 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:53:09 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> Message-ID: <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> yes ;) the system is in preproduction, so nothing that can't stopped/started in a few minutes (current setup has only 4 nsds, and no clients). mmfsck triggers the errors very early during inode replica compare. stijn On 08/02/2017 08:47 PM, Sven Oehme wrote: > How can you reproduce this so quick ? > Did you restart all daemons after that ? > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > wrote: > >> hi sven, >> >> >>> the very first thing you should check is if you have this setting set : >> maybe the very first thing to check should be the faq/wiki that has this >> documented? >> >>> >>> mmlsconfig envVar >>> >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>> MLX5_USE_MUTEX 1 >>> >>> if that doesn't come back the way above you need to set it : >>> >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >> i just set this (wasn't set before), but problem is still present. >> >>> >>> there was a problem in the Mellanox FW in various versions that was never >>> completely addressed (bugs where found and fixed, but it was never fully >>> proven to be addressed) the above environment variables turn code on in >> the >>> mellanox driver that prevents this potential code path from being used to >>> begin with. >>> >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale >>> that even you don't set this variables the problem can't happen anymore >>> until then the only choice you have is the envVar above (which btw ships >> as >>> default on all ESS systems). >>> >>> you also should be on the latest available Mellanox FW & Drivers as not >> all >>> versions even have the code that is activated by the environment >> variables >>> above, i think at a minimum you need to be at 3.4 but i don't remember >> the >>> exact version. There had been multiple defects opened around this area, >> the >>> last one i remember was : >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >> dell, and the fw is a bit behind. i'm trying to convince dell to make >> new one. mellanox used to allow to make your own, but they don't anymore. >> >>> >>> 00154843 : ESS ConnectX-3 performance issue - spinning on >> pthread_spin_lock >>> >>> you may ask your mellanox representative if they can get you access to >> this >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>> cards its a general issue that affects all cards and on intel as well as >>> Power. >> ok, thanks for this. maybe such a reference is enough for dell to update >> their firmware. >> >> stijn >> >>> >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt >>> wrote: >>> >>>> hi all, >>>> >>>> is there any documentation wrt data integrity in spectrum scale: >>>> assuming a crappy network, does gpfs garantee somehow that data written >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>> nsd gpfs daemon to disk. >>>> >>>> and wrt crappy network, what about rdma on crappy network? is it the >> same? >>>> >>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>> network issue; and we see no errors anywhere...) >>>> >>>> thanks a lot, >>>> >>>> stijn >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 20:10:07 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 19:10:07 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> Message-ID: ok, i think i understand now, the data was already corrupted. the config change i proposed only prevents a potentially known future on the wire corruption, this will not fix something that made it to the disk already. Sven On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt wrote: > yes ;) > > the system is in preproduction, so nothing that can't stopped/started in > a few minutes (current setup has only 4 nsds, and no clients). > mmfsck triggers the errors very early during inode replica compare. > > > stijn > > On 08/02/2017 08:47 PM, Sven Oehme wrote: > > How can you reproduce this so quick ? > > Did you restart all daemons after that ? > > > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > > wrote: > > > >> hi sven, > >> > >> > >>> the very first thing you should check is if you have this setting set : > >> maybe the very first thing to check should be the faq/wiki that has this > >> documented? > >> > >>> > >>> mmlsconfig envVar > >>> > >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>> MLX5_USE_MUTEX 1 > >>> > >>> if that doesn't come back the way above you need to set it : > >>> > >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >> i just set this (wasn't set before), but problem is still present. > >> > >>> > >>> there was a problem in the Mellanox FW in various versions that was > never > >>> completely addressed (bugs where found and fixed, but it was never > fully > >>> proven to be addressed) the above environment variables turn code on in > >> the > >>> mellanox driver that prevents this potential code path from being used > to > >>> begin with. > >>> > >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > Scale > >>> that even you don't set this variables the problem can't happen anymore > >>> until then the only choice you have is the envVar above (which btw > ships > >> as > >>> default on all ESS systems). > >>> > >>> you also should be on the latest available Mellanox FW & Drivers as not > >> all > >>> versions even have the code that is activated by the environment > >> variables > >>> above, i think at a minimum you need to be at 3.4 but i don't remember > >> the > >>> exact version. There had been multiple defects opened around this area, > >> the > >>> last one i remember was : > >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >> dell, and the fw is a bit behind. i'm trying to convince dell to make > >> new one. mellanox used to allow to make your own, but they don't > anymore. > >> > >>> > >>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >> pthread_spin_lock > >>> > >>> you may ask your mellanox representative if they can get you access to > >> this > >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > >>> cards its a general issue that affects all cards and on intel as well > as > >>> Power. > >> ok, thanks for this. maybe such a reference is enough for dell to update > >> their firmware. > >> > >> stijn > >> > >>> > >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be> > >>> wrote: > >>> > >>>> hi all, > >>>> > >>>> is there any documentation wrt data integrity in spectrum scale: > >>>> assuming a crappy network, does gpfs garantee somehow that data > written > >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the > >>>> nsd gpfs daemon to disk. > >>>> > >>>> and wrt crappy network, what about rdma on crappy network? is it the > >> same? > >>>> > >>>> (we are hunting down a crappy infiniband issue; ibm support says it's > >>>> network issue; and we see no errors anywhere...) > >>>> > >>>> thanks a lot, > >>>> > >>>> stijn > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 20:20:14 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 21:20:14 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> Message-ID: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> hi sven, the data is not corrupted. mmfsck compares 2 inodes, says they don't match, but checking the data with tbdbfs reveals they are equal. (one replica has to be fetched over the network; the nsds cannot access all disks) with some nsdChksum... settings we get during this mmfsck a lot of "Encountered XYZ checksum errors on network I/O to NSD Client disk" ibm support says these are hardware issues, but wrt to mmfsck false positives. anyway, our current question is: if these are hardware issues, is there anything in gpfs client->nsd (on the network side) that would detect such errors. ie can we trust the data (and metadata). i was under the impression that client to disk is not covered, but i assumed that at least client to nsd (the network part) was checksummed. stijn On 08/02/2017 09:10 PM, Sven Oehme wrote: > ok, i think i understand now, the data was already corrupted. the config > change i proposed only prevents a potentially known future on the wire > corruption, this will not fix something that made it to the disk already. > > Sven > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > wrote: > >> yes ;) >> >> the system is in preproduction, so nothing that can't stopped/started in >> a few minutes (current setup has only 4 nsds, and no clients). >> mmfsck triggers the errors very early during inode replica compare. >> >> >> stijn >> >> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>> How can you reproduce this so quick ? >>> Did you restart all daemons after that ? >>> >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >>> wrote: >>> >>>> hi sven, >>>> >>>> >>>>> the very first thing you should check is if you have this setting set : >>>> maybe the very first thing to check should be the faq/wiki that has this >>>> documented? >>>> >>>>> >>>>> mmlsconfig envVar >>>>> >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>> MLX5_USE_MUTEX 1 >>>>> >>>>> if that doesn't come back the way above you need to set it : >>>>> >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>> i just set this (wasn't set before), but problem is still present. >>>> >>>>> >>>>> there was a problem in the Mellanox FW in various versions that was >> never >>>>> completely addressed (bugs where found and fixed, but it was never >> fully >>>>> proven to be addressed) the above environment variables turn code on in >>>> the >>>>> mellanox driver that prevents this potential code path from being used >> to >>>>> begin with. >>>>> >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >> Scale >>>>> that even you don't set this variables the problem can't happen anymore >>>>> until then the only choice you have is the envVar above (which btw >> ships >>>> as >>>>> default on all ESS systems). >>>>> >>>>> you also should be on the latest available Mellanox FW & Drivers as not >>>> all >>>>> versions even have the code that is activated by the environment >>>> variables >>>>> above, i think at a minimum you need to be at 3.4 but i don't remember >>>> the >>>>> exact version. There had been multiple defects opened around this area, >>>> the >>>>> last one i remember was : >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>> new one. mellanox used to allow to make your own, but they don't >> anymore. >>>> >>>>> >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>> pthread_spin_lock >>>>> >>>>> you may ask your mellanox representative if they can get you access to >>>> this >>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>>>> cards its a general issue that affects all cards and on intel as well >> as >>>>> Power. >>>> ok, thanks for this. maybe such a reference is enough for dell to update >>>> their firmware. >>>> >>>> stijn >>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be> >>>>> wrote: >>>>> >>>>>> hi all, >>>>>> >>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>> assuming a crappy network, does gpfs garantee somehow that data >> written >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>>>> nsd gpfs daemon to disk. >>>>>> >>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>> same? >>>>>> >>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>>>> network issue; and we see no errors anywhere...) >>>>>> >>>>>> thanks a lot, >>>>>> >>>>>> stijn >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From ewahl at osc.edu Wed Aug 2 21:11:53 2017 From: ewahl at osc.edu (Edward Wahl) Date: Wed, 2 Aug 2017 16:11:53 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: <20170802161153.4eea6f61@osc.edu> What version of GPFS? Are you generating a patch file? Try using this before your mmfsck: mmdsh -N mmfsadm test fsck usePatchQueue 0 my notes say all, but I would have only had NSD nodes up at the time. Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. I won't know for sure until late August. Ed On Wed, 2 Aug 2017 21:20:14 +0200 Stijn De Weirdt wrote: > hi sven, > > the data is not corrupted. mmfsck compares 2 inodes, says they don't > match, but checking the data with tbdbfs reveals they are equal. > (one replica has to be fetched over the network; the nsds cannot access > all disks) > > with some nsdChksum... settings we get during this mmfsck a lot of > "Encountered XYZ checksum errors on network I/O to NSD Client disk" > > ibm support says these are hardware issues, but wrt to mmfsck false > positives. > > anyway, our current question is: if these are hardware issues, is there > anything in gpfs client->nsd (on the network side) that would detect > such errors. ie can we trust the data (and metadata). > i was under the impression that client to disk is not covered, but i > assumed that at least client to nsd (the network part) was checksummed. > > stijn > > > On 08/02/2017 09:10 PM, Sven Oehme wrote: > > ok, i think i understand now, the data was already corrupted. the config > > change i proposed only prevents a potentially known future on the wire > > corruption, this will not fix something that made it to the disk already. > > > > Sven > > > > > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > > wrote: > > > >> yes ;) > >> > >> the system is in preproduction, so nothing that can't stopped/started in > >> a few minutes (current setup has only 4 nsds, and no clients). > >> mmfsck triggers the errors very early during inode replica compare. > >> > >> > >> stijn > >> > >> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>> How can you reproduce this so quick ? > >>> Did you restart all daemons after that ? > >>> > >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > >>> wrote: > >>> > >>>> hi sven, > >>>> > >>>> > >>>>> the very first thing you should check is if you have this setting > >>>>> set : > >>>> maybe the very first thing to check should be the faq/wiki that has this > >>>> documented? > >>>> > >>>>> > >>>>> mmlsconfig envVar > >>>>> > >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>>>> MLX5_USE_MUTEX 1 > >>>>> > >>>>> if that doesn't come back the way above you need to set it : > >>>>> > >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>> i just set this (wasn't set before), but problem is still present. > >>>> > >>>>> > >>>>> there was a problem in the Mellanox FW in various versions that was > >> never > >>>>> completely addressed (bugs where found and fixed, but it was never > >> fully > >>>>> proven to be addressed) the above environment variables turn code on > >>>>> in > >>>> the > >>>>> mellanox driver that prevents this potential code path from being used > >> to > >>>>> begin with. > >>>>> > >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >> Scale > >>>>> that even you don't set this variables the problem can't happen anymore > >>>>> until then the only choice you have is the envVar above (which btw > >> ships > >>>> as > >>>>> default on all ESS systems). > >>>>> > >>>>> you also should be on the latest available Mellanox FW & Drivers as > >>>>> not > >>>> all > >>>>> versions even have the code that is activated by the environment > >>>> variables > >>>>> above, i think at a minimum you need to be at 3.4 but i don't remember > >>>> the > >>>>> exact version. There had been multiple defects opened around this > >>>>> area, > >>>> the > >>>>> last one i remember was : > >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make > >>>> new one. mellanox used to allow to make your own, but they don't > >> anymore. > >>>> > >>>>> > >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>> pthread_spin_lock > >>>>> > >>>>> you may ask your mellanox representative if they can get you access to > >>>> this > >>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > >>>>> cards its a general issue that affects all cards and on intel as well > >> as > >>>>> Power. > >>>> ok, thanks for this. maybe such a reference is enough for dell to update > >>>> their firmware. > >>>> > >>>> stijn > >>>> > >>>>> > >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >> stijn.deweirdt at ugent.be> > >>>>> wrote: > >>>>> > >>>>>> hi all, > >>>>>> > >>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>> assuming a crappy network, does gpfs garantee somehow that data > >> written > >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the > >>>>>> nsd gpfs daemon to disk. > >>>>>> > >>>>>> and wrt crappy network, what about rdma on crappy network? is it the > >>>> same? > >>>>>> > >>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's > >>>>>> network issue; and we see no errors anywhere...) > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> stijn > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From stijn.deweirdt at ugent.be Wed Aug 2 21:38:29 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 22:38:29 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <20170802161153.4eea6f61@osc.edu> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> <20170802161153.4eea6f61@osc.edu> Message-ID: <393b54ec-ec6a-040b-ef04-6076632db60c@ugent.be> hi ed, On 08/02/2017 10:11 PM, Edward Wahl wrote: > What version of GPFS? Are you generating a patch file? 4.2.3 series, now we run 4.2.3.3 to be clear, right now we use mmfsck to trigger the chksum issue hoping we can find the actual "hardware" issue. we know by elimination which HCAs to avoid, so we do not get the checksum errors. but to consider that a fix, we need to know if the data written by the client can be trusted due to these silent hw errors. > > Try using this before your mmfsck: > > mmdsh -N mmfsadm test fsck usePatchQueue 0 mmchmgr somefs nsdXYZ mmfsck somefs -Vn -m -N nsdXYZ -t /var/tmp/ the idea is to force everything as much as possible on one node, accessing the other failure group is forced over network > > my notes say all, but I would have only had NSD nodes up at the time. > Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. we had the "pleasure" last to have mmfsck segfaulting while we were trying to recover a filesystem, at least that was certainly fixed ;) stijn > I won't know for sure until late August. > > Ed > > > On Wed, 2 Aug 2017 21:20:14 +0200 > Stijn De Weirdt wrote: > >> hi sven, >> >> the data is not corrupted. mmfsck compares 2 inodes, says they don't >> match, but checking the data with tbdbfs reveals they are equal. >> (one replica has to be fetched over the network; the nsds cannot access >> all disks) >> >> with some nsdChksum... settings we get during this mmfsck a lot of >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >> >> ibm support says these are hardware issues, but wrt to mmfsck false >> positives. >> >> anyway, our current question is: if these are hardware issues, is there >> anything in gpfs client->nsd (on the network side) that would detect >> such errors. ie can we trust the data (and metadata). >> i was under the impression that client to disk is not covered, but i >> assumed that at least client to nsd (the network part) was checksummed. >> >> stijn >> >> >> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>> ok, i think i understand now, the data was already corrupted. the config >>> change i proposed only prevents a potentially known future on the wire >>> corruption, this will not fix something that made it to the disk already. >>> >>> Sven >>> >>> >>> >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt >>> wrote: >>> >>>> yes ;) >>>> >>>> the system is in preproduction, so nothing that can't stopped/started in >>>> a few minutes (current setup has only 4 nsds, and no clients). >>>> mmfsck triggers the errors very early during inode replica compare. >>>> >>>> >>>> stijn >>>> >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>> How can you reproduce this so quick ? >>>>> Did you restart all daemons after that ? >>>>> >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >>>>> wrote: >>>>> >>>>>> hi sven, >>>>>> >>>>>> >>>>>>> the very first thing you should check is if you have this setting >>>>>>> set : >>>>>> maybe the very first thing to check should be the faq/wiki that has this >>>>>> documented? >>>>>> >>>>>>> >>>>>>> mmlsconfig envVar >>>>>>> >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>>>> MLX5_USE_MUTEX 1 >>>>>>> >>>>>>> if that doesn't come back the way above you need to set it : >>>>>>> >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>> i just set this (wasn't set before), but problem is still present. >>>>>> >>>>>>> >>>>>>> there was a problem in the Mellanox FW in various versions that was >>>> never >>>>>>> completely addressed (bugs where found and fixed, but it was never >>>> fully >>>>>>> proven to be addressed) the above environment variables turn code on >>>>>>> in >>>>>> the >>>>>>> mellanox driver that prevents this potential code path from being used >>>> to >>>>>>> begin with. >>>>>>> >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>> Scale >>>>>>> that even you don't set this variables the problem can't happen anymore >>>>>>> until then the only choice you have is the envVar above (which btw >>>> ships >>>>>> as >>>>>>> default on all ESS systems). >>>>>>> >>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>>>>> not >>>>>> all >>>>>>> versions even have the code that is activated by the environment >>>>>> variables >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't remember >>>>>> the >>>>>>> exact version. There had been multiple defects opened around this >>>>>>> area, >>>>>> the >>>>>>> last one i remember was : >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>>>> new one. mellanox used to allow to make your own, but they don't >>>> anymore. >>>>>> >>>>>>> >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>> pthread_spin_lock >>>>>>> >>>>>>> you may ask your mellanox representative if they can get you access to >>>>>> this >>>>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>>>>>> cards its a general issue that affects all cards and on intel as well >>>> as >>>>>>> Power. >>>>>> ok, thanks for this. maybe such a reference is enough for dell to update >>>>>> their firmware. >>>>>> >>>>>> stijn >>>>>> >>>>>>> >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>> stijn.deweirdt at ugent.be> >>>>>>> wrote: >>>>>>> >>>>>>>> hi all, >>>>>>>> >>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>> written >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>>>>>> nsd gpfs daemon to disk. >>>>>>>> >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>>>> same? >>>>>>>> >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>> >>>>>>>> thanks a lot, >>>>>>>> >>>>>>>> stijn >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > From eric.wonderley at vt.edu Wed Aug 2 22:02:20 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 2 Aug 2017 17:02:20 -0400 Subject: [gpfsug-discuss] mmsetquota produces error Message-ID: for one of our home filesystem we get: mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'nathanfootest' error (22): 'Invalid argument'. mmedquota -j home:nathanfootest does work however -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 22:05:18 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 21:05:18 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: before i answer the rest of your questions, can you share what version of GPFS exactly you are on mmfsadm dump version would be best source for that. if you have 2 inodes and you know the exact address of where they are stored on disk one could 'dd' them of the disk and compare if they are really equal. we only support checksums when you use GNR based systems, they cover network as well as Disk side for that. the nsdchecksum code you refer to is the one i mentioned above thats only supported with GNR at least i am not aware that we ever claimed it to be supported outside of it, but i can check that. sven On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt wrote: > hi sven, > > the data is not corrupted. mmfsck compares 2 inodes, says they don't > match, but checking the data with tbdbfs reveals they are equal. > (one replica has to be fetched over the network; the nsds cannot access > all disks) > > with some nsdChksum... settings we get during this mmfsck a lot of > "Encountered XYZ checksum errors on network I/O to NSD Client disk" > > ibm support says these are hardware issues, but wrt to mmfsck false > positives. > > anyway, our current question is: if these are hardware issues, is there > anything in gpfs client->nsd (on the network side) that would detect > such errors. ie can we trust the data (and metadata). > i was under the impression that client to disk is not covered, but i > assumed that at least client to nsd (the network part) was checksummed. > > stijn > > > On 08/02/2017 09:10 PM, Sven Oehme wrote: > > ok, i think i understand now, the data was already corrupted. the config > > change i proposed only prevents a potentially known future on the wire > > corruption, this will not fix something that made it to the disk already. > > > > Sven > > > > > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > > > wrote: > > > >> yes ;) > >> > >> the system is in preproduction, so nothing that can't stopped/started in > >> a few minutes (current setup has only 4 nsds, and no clients). > >> mmfsck triggers the errors very early during inode replica compare. > >> > >> > >> stijn > >> > >> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>> How can you reproduce this so quick ? > >>> Did you restart all daemons after that ? > >>> > >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > > >>> wrote: > >>> > >>>> hi sven, > >>>> > >>>> > >>>>> the very first thing you should check is if you have this setting > set : > >>>> maybe the very first thing to check should be the faq/wiki that has > this > >>>> documented? > >>>> > >>>>> > >>>>> mmlsconfig envVar > >>>>> > >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>>>> MLX5_USE_MUTEX 1 > >>>>> > >>>>> if that doesn't come back the way above you need to set it : > >>>>> > >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>> i just set this (wasn't set before), but problem is still present. > >>>> > >>>>> > >>>>> there was a problem in the Mellanox FW in various versions that was > >> never > >>>>> completely addressed (bugs where found and fixed, but it was never > >> fully > >>>>> proven to be addressed) the above environment variables turn code on > in > >>>> the > >>>>> mellanox driver that prevents this potential code path from being > used > >> to > >>>>> begin with. > >>>>> > >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >> Scale > >>>>> that even you don't set this variables the problem can't happen > anymore > >>>>> until then the only choice you have is the envVar above (which btw > >> ships > >>>> as > >>>>> default on all ESS systems). > >>>>> > >>>>> you also should be on the latest available Mellanox FW & Drivers as > not > >>>> all > >>>>> versions even have the code that is activated by the environment > >>>> variables > >>>>> above, i think at a minimum you need to be at 3.4 but i don't > remember > >>>> the > >>>>> exact version. There had been multiple defects opened around this > area, > >>>> the > >>>>> last one i remember was : > >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make > >>>> new one. mellanox used to allow to make your own, but they don't > >> anymore. > >>>> > >>>>> > >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>> pthread_spin_lock > >>>>> > >>>>> you may ask your mellanox representative if they can get you access > to > >>>> this > >>>>> defect. while it was found on ESS , means on PPC64 and with > ConnectX-3 > >>>>> cards its a general issue that affects all cards and on intel as well > >> as > >>>>> Power. > >>>> ok, thanks for this. maybe such a reference is enough for dell to > update > >>>> their firmware. > >>>> > >>>> stijn > >>>> > >>>>> > >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >> stijn.deweirdt at ugent.be> > >>>>> wrote: > >>>>> > >>>>>> hi all, > >>>>>> > >>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>> assuming a crappy network, does gpfs garantee somehow that data > >> written > >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from > the > >>>>>> nsd gpfs daemon to disk. > >>>>>> > >>>>>> and wrt crappy network, what about rdma on crappy network? is it the > >>>> same? > >>>>>> > >>>>>> (we are hunting down a crappy infiniband issue; ibm support says > it's > >>>>>> network issue; and we see no errors anywhere...) > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> stijn > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 22:14:45 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:14:45 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: hi sven, > before i answer the rest of your questions, can you share what version of > GPFS exactly you are on mmfsadm dump version would be best source for that. it returns Build branch "4.2.3.3 ". > if you have 2 inodes and you know the exact address of where they are > stored on disk one could 'dd' them of the disk and compare if they are > really equal. ok, i can try that later. are you suggesting that the "tsdbfs comp" might gave wrong results? because we ran that and got eg > # tsdbfs somefs comp 7:5137408 25:221785088 1024 > Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = 0x19:D382C00: > All sectors identical > we only support checksums when you use GNR based systems, they cover > network as well as Disk side for that. > the nsdchecksum code you refer to is the one i mentioned above thats only > supported with GNR at least i am not aware that we ever claimed it to be > supported outside of it, but i can check that. ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, and they are not in the same gpfs cluster. i thought the GNR extended the checksumming to disk, and that it was already there for the network part. thanks for clearing this up. but that is worse then i thought... stijn > > sven > > On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt > wrote: > >> hi sven, >> >> the data is not corrupted. mmfsck compares 2 inodes, says they don't >> match, but checking the data with tbdbfs reveals they are equal. >> (one replica has to be fetched over the network; the nsds cannot access >> all disks) >> >> with some nsdChksum... settings we get during this mmfsck a lot of >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >> >> ibm support says these are hardware issues, but wrt to mmfsck false >> positives. >> >> anyway, our current question is: if these are hardware issues, is there >> anything in gpfs client->nsd (on the network side) that would detect >> such errors. ie can we trust the data (and metadata). >> i was under the impression that client to disk is not covered, but i >> assumed that at least client to nsd (the network part) was checksummed. >> >> stijn >> >> >> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>> ok, i think i understand now, the data was already corrupted. the config >>> change i proposed only prevents a potentially known future on the wire >>> corruption, this will not fix something that made it to the disk already. >>> >>> Sven >>> >>> >>> >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt >> >>> wrote: >>> >>>> yes ;) >>>> >>>> the system is in preproduction, so nothing that can't stopped/started in >>>> a few minutes (current setup has only 4 nsds, and no clients). >>>> mmfsck triggers the errors very early during inode replica compare. >>>> >>>> >>>> stijn >>>> >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>> How can you reproduce this so quick ? >>>>> Did you restart all daemons after that ? >>>>> >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >> >>>>> wrote: >>>>> >>>>>> hi sven, >>>>>> >>>>>> >>>>>>> the very first thing you should check is if you have this setting >> set : >>>>>> maybe the very first thing to check should be the faq/wiki that has >> this >>>>>> documented? >>>>>> >>>>>>> >>>>>>> mmlsconfig envVar >>>>>>> >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>>>> MLX5_USE_MUTEX 1 >>>>>>> >>>>>>> if that doesn't come back the way above you need to set it : >>>>>>> >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>> i just set this (wasn't set before), but problem is still present. >>>>>> >>>>>>> >>>>>>> there was a problem in the Mellanox FW in various versions that was >>>> never >>>>>>> completely addressed (bugs where found and fixed, but it was never >>>> fully >>>>>>> proven to be addressed) the above environment variables turn code on >> in >>>>>> the >>>>>>> mellanox driver that prevents this potential code path from being >> used >>>> to >>>>>>> begin with. >>>>>>> >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>> Scale >>>>>>> that even you don't set this variables the problem can't happen >> anymore >>>>>>> until then the only choice you have is the envVar above (which btw >>>> ships >>>>>> as >>>>>>> default on all ESS systems). >>>>>>> >>>>>>> you also should be on the latest available Mellanox FW & Drivers as >> not >>>>>> all >>>>>>> versions even have the code that is activated by the environment >>>>>> variables >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >> remember >>>>>> the >>>>>>> exact version. There had been multiple defects opened around this >> area, >>>>>> the >>>>>>> last one i remember was : >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>>>> new one. mellanox used to allow to make your own, but they don't >>>> anymore. >>>>>> >>>>>>> >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>> pthread_spin_lock >>>>>>> >>>>>>> you may ask your mellanox representative if they can get you access >> to >>>>>> this >>>>>>> defect. while it was found on ESS , means on PPC64 and with >> ConnectX-3 >>>>>>> cards its a general issue that affects all cards and on intel as well >>>> as >>>>>>> Power. >>>>>> ok, thanks for this. maybe such a reference is enough for dell to >> update >>>>>> their firmware. >>>>>> >>>>>> stijn >>>>>> >>>>>>> >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>> stijn.deweirdt at ugent.be> >>>>>>> wrote: >>>>>>> >>>>>>>> hi all, >>>>>>>> >>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>> written >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >> the >>>>>>>> nsd gpfs daemon to disk. >>>>>>>> >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>>>> same? >>>>>>>> >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >> it's >>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>> >>>>>>>> thanks a lot, >>>>>>>> >>>>>>>> stijn >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 22:23:44 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 21:23:44 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: ok, you can't be any newer that that. i just wonder why you have 512b inodes if this is a new system ? are this raw disks in this setup or raid controllers ? whats the disk sector size and how was the filesystem created (mmlsfs FSNAME would show answer to the last question) on the tsdbfs i am not sure if it gave wrong results, but it would be worth a test to see whats actually on the disk . you are correct that GNR extends this to the disk, but the network part is covered by the nsdchecksums you turned on when you enable the not to be named checksum parameter do you actually still get an error from fsck ? sven On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt wrote: > hi sven, > > > before i answer the rest of your questions, can you share what version of > > GPFS exactly you are on mmfsadm dump version would be best source for > that. > it returns > Build branch "4.2.3.3 ". > > > if you have 2 inodes and you know the exact address of where they are > > stored on disk one could 'dd' them of the disk and compare if they are > > really equal. > ok, i can try that later. are you suggesting that the "tsdbfs comp" > might gave wrong results? because we ran that and got eg > > > # tsdbfs somefs comp 7:5137408 25:221785088 1024 > > Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = > 0x19:D382C00: > > All sectors identical > > > > we only support checksums when you use GNR based systems, they cover > > network as well as Disk side for that. > > the nsdchecksum code you refer to is the one i mentioned above thats only > > supported with GNR at least i am not aware that we ever claimed it to be > > supported outside of it, but i can check that. > ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, > and they are not in the same gpfs cluster. > > i thought the GNR extended the checksumming to disk, and that it was > already there for the network part. thanks for clearing this up. but > that is worse then i thought... > > stijn > > > > > sven > > > > On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt > > > wrote: > > > >> hi sven, > >> > >> the data is not corrupted. mmfsck compares 2 inodes, says they don't > >> match, but checking the data with tbdbfs reveals they are equal. > >> (one replica has to be fetched over the network; the nsds cannot access > >> all disks) > >> > >> with some nsdChksum... settings we get during this mmfsck a lot of > >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" > >> > >> ibm support says these are hardware issues, but wrt to mmfsck false > >> positives. > >> > >> anyway, our current question is: if these are hardware issues, is there > >> anything in gpfs client->nsd (on the network side) that would detect > >> such errors. ie can we trust the data (and metadata). > >> i was under the impression that client to disk is not covered, but i > >> assumed that at least client to nsd (the network part) was checksummed. > >> > >> stijn > >> > >> > >> On 08/02/2017 09:10 PM, Sven Oehme wrote: > >>> ok, i think i understand now, the data was already corrupted. the > config > >>> change i proposed only prevents a potentially known future on the wire > >>> corruption, this will not fix something that made it to the disk > already. > >>> > >>> Sven > >>> > >>> > >>> > >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be > >>> > >>> wrote: > >>> > >>>> yes ;) > >>>> > >>>> the system is in preproduction, so nothing that can't stopped/started > in > >>>> a few minutes (current setup has only 4 nsds, and no clients). > >>>> mmfsck triggers the errors very early during inode replica compare. > >>>> > >>>> > >>>> stijn > >>>> > >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>>>> How can you reproduce this so quick ? > >>>>> Did you restart all daemons after that ? > >>>>> > >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be > >>> > >>>>> wrote: > >>>>> > >>>>>> hi sven, > >>>>>> > >>>>>> > >>>>>>> the very first thing you should check is if you have this setting > >> set : > >>>>>> maybe the very first thing to check should be the faq/wiki that has > >> this > >>>>>> documented? > >>>>>> > >>>>>>> > >>>>>>> mmlsconfig envVar > >>>>>>> > >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF > 1 > >>>>>>> MLX5_USE_MUTEX 1 > >>>>>>> > >>>>>>> if that doesn't come back the way above you need to set it : > >>>>>>> > >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>>>> i just set this (wasn't set before), but problem is still present. > >>>>>> > >>>>>>> > >>>>>>> there was a problem in the Mellanox FW in various versions that was > >>>> never > >>>>>>> completely addressed (bugs where found and fixed, but it was never > >>>> fully > >>>>>>> proven to be addressed) the above environment variables turn code > on > >> in > >>>>>> the > >>>>>>> mellanox driver that prevents this potential code path from being > >> used > >>>> to > >>>>>>> begin with. > >>>>>>> > >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >>>> Scale > >>>>>>> that even you don't set this variables the problem can't happen > >> anymore > >>>>>>> until then the only choice you have is the envVar above (which btw > >>>> ships > >>>>>> as > >>>>>>> default on all ESS systems). > >>>>>>> > >>>>>>> you also should be on the latest available Mellanox FW & Drivers as > >> not > >>>>>> all > >>>>>>> versions even have the code that is activated by the environment > >>>>>> variables > >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't > >> remember > >>>>>> the > >>>>>>> exact version. There had been multiple defects opened around this > >> area, > >>>>>> the > >>>>>>> last one i remember was : > >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards > from > >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to > make > >>>>>> new one. mellanox used to allow to make your own, but they don't > >>>> anymore. > >>>>>> > >>>>>>> > >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>>>> pthread_spin_lock > >>>>>>> > >>>>>>> you may ask your mellanox representative if they can get you access > >> to > >>>>>> this > >>>>>>> defect. while it was found on ESS , means on PPC64 and with > >> ConnectX-3 > >>>>>>> cards its a general issue that affects all cards and on intel as > well > >>>> as > >>>>>>> Power. > >>>>>> ok, thanks for this. maybe such a reference is enough for dell to > >> update > >>>>>> their firmware. > >>>>>> > >>>>>> stijn > >>>>>> > >>>>>>> > >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >>>> stijn.deweirdt at ugent.be> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> hi all, > >>>>>>>> > >>>>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data > >>>> written > >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from > >> the > >>>>>>>> nsd gpfs daemon to disk. > >>>>>>>> > >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it > the > >>>>>> same? > >>>>>>>> > >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says > >> it's > >>>>>>>> network issue; and we see no errors anywhere...) > >>>>>>>> > >>>>>>>> thanks a lot, > >>>>>>>> > >>>>>>>> stijn > >>>>>>>> _______________________________________________ > >>>>>>>> gpfsug-discuss mailing list > >>>>>>>> gpfsug-discuss at spectrumscale.org > >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> gpfsug-discuss mailing list > >>>>>>> gpfsug-discuss at spectrumscale.org > >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sxiao at us.ibm.com Wed Aug 2 22:36:06 2017 From: sxiao at us.ibm.com (Steve Xiao) Date: Wed, 2 Aug 2017 17:36:06 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: Message-ID: The nsdChksum settings for none GNR/ESS based system is not officially supported. It will perform checksum on data transfer over the network only and can be used to help debug data corruption when network is a suspect. Did any of those "Encountered XYZ checksum errors on network I/O to NSD Client disk" warning messages resulted in disk been changed to "down" state due to IO error? If no disk IO error was reported in GPFS log, that means data was retransmitted successfully on retry. As sven said, only GNR/ESS provids the full end to end data integrity. Steve Y. Xiao -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 22:47:36 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:47:36 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: hi sven, > ok, you can't be any newer that that. i just wonder why you have 512b > inodes if this is a new system ? because we rsynced 100M files to it ;) it's supposed to replace another system. > are this raw disks in this setup or raid controllers ? raid (DDP on MD3460) > whats the disk sector size euhm, you mean the luns? for metadata disks (SSD in raid 1): > # parted /dev/mapper/f1v01e0g0_Dm01o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0g0_Dm01o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0g0_Dm01o0: 219GB > Sector size (logical/physical): 512B/512B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 219GB 219GB GPFS: hidden for data disks (DDP) > [root at nsd01 ~]# parted /dev/mapper/f1v01e0p0_S17o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0p0_S17o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0p0_S17o0: 35.2TB > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 35.2TB 35.2TB GPFS: hidden > > (parted) q and how was the filesystem created (mmlsfs FSNAME would show > answer to the last question) > # mmlsfs somefilesystem > flag value description > ------------------- ------------------------ ----------------------------------- > -f 16384 Minimum fragment size in bytes (system pool) > 262144 Minimum fragment size in bytes (other pools) > -i 4096 Inode size in bytes > -I 32768 Indirect block size in bytes > -m 2 Default number of metadata replicas > -M 2 Maximum number of metadata replicas > -r 1 Default number of data replicas > -R 2 Maximum number of data replicas > -j scatter Block allocation type > -D nfs4 File locking semantics in effect > -k all ACL semantics in effect > -n 850 Estimated number of nodes that will mount file system > -B 524288 Block size (system pool) > 8388608 Block size (other pools) > -Q user;group;fileset Quotas accounting enabled > user;group;fileset Quotas enforced > none Default quotas enabled > --perfileset-quota Yes Per-fileset quota enforcement > --filesetdf Yes Fileset df enabled? > -V 17.00 (4.2.3.0) File system version > --create-time Wed May 31 12:54:00 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E No Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation option > --fastea Yes Fast external attributes enabled? > --encryption No Encryption enabled? > --inode-limit 313524224 Maximum number of inodes in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > --subblocks-per-full-block 32 Number of subblocks per full block > -P system;MD3260 Disk storage pools in file system > -d f0v00e0g0_Sm00o0;f0v00e0p0_S00o0;f1v01e0g0_Sm01o0;f1v01e0p0_S01o0;f0v02e0g0_Sm02o0;f0v02e0p0_S02o0;f1v03e0g0_Sm03o0;f1v03e0p0_S03o0;f0v04e0g0_Sm04o0;f0v04e0p0_S04o0; > -d f1v05e0g0_Sm05o0;f1v05e0p0_S05o0;f0v06e0g0_Sm06o0;f0v06e0p0_S06o0;f1v07e0g0_Sm07o0;f1v07e0p0_S07o0;f0v00e0g0_Sm08o1;f0v00e0p0_S08o1;f1v01e0g0_Sm09o1;f1v01e0p0_S09o1; > -d f0v02e0g0_Sm10o1;f0v02e0p0_S10o1;f1v03e0g0_Sm11o1;f1v03e0p0_S11o1;f0v04e0g0_Sm12o1;f0v04e0p0_S12o1;f1v05e0g0_Sm13o1;f1v05e0p0_S13o1;f0v06e0g0_Sm14o1;f0v06e0p0_S14o1; > -d f1v07e0g0_Sm15o1;f1v07e0p0_S15o1;f0v00e0p0_S16o0;f1v01e0p0_S17o0;f0v02e0p0_S18o0;f1v03e0p0_S19o0;f0v04e0p0_S20o0;f1v05e0p0_S21o0;f0v06e0p0_S22o0;f1v07e0p0_S23o0; > -d f0v00e0p0_S24o1;f1v01e0p0_S25o1;f0v02e0p0_S26o1;f1v03e0p0_S27o1;f0v04e0p0_S28o1;f1v05e0p0_S29o1;f0v06e0p0_S30o1;f1v07e0p0_S31o1 Disks in file system > -A no Automatic mount option > -o none Additional mount options > -T /scratch Default mount point > --mount-priority 0 > > on the tsdbfs i am not sure if it gave wrong results, but it would be worth > a test to see whats actually on the disk . ok. i'll try this tomorrow. > > you are correct that GNR extends this to the disk, but the network part is > covered by the nsdchecksums you turned on > when you enable the not to be named checksum parameter do you actually > still get an error from fsck ? hah, no, we don't. mmfsck says the filesystem is clean. we found this odd, so we already asked ibm support about this but no answer yet. stijn > > sven > > > On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt > wrote: > >> hi sven, >> >>> before i answer the rest of your questions, can you share what version of >>> GPFS exactly you are on mmfsadm dump version would be best source for >> that. >> it returns >> Build branch "4.2.3.3 ". >> >>> if you have 2 inodes and you know the exact address of where they are >>> stored on disk one could 'dd' them of the disk and compare if they are >>> really equal. >> ok, i can try that later. are you suggesting that the "tsdbfs comp" >> might gave wrong results? because we ran that and got eg >> >>> # tsdbfs somefs comp 7:5137408 25:221785088 1024 >>> Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = >> 0x19:D382C00: >>> All sectors identical >> >> >>> we only support checksums when you use GNR based systems, they cover >>> network as well as Disk side for that. >>> the nsdchecksum code you refer to is the one i mentioned above thats only >>> supported with GNR at least i am not aware that we ever claimed it to be >>> supported outside of it, but i can check that. >> ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, >> and they are not in the same gpfs cluster. >> >> i thought the GNR extended the checksumming to disk, and that it was >> already there for the network part. thanks for clearing this up. but >> that is worse then i thought... >> >> stijn >> >>> >>> sven >>> >>> On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt >> >>> wrote: >>> >>>> hi sven, >>>> >>>> the data is not corrupted. mmfsck compares 2 inodes, says they don't >>>> match, but checking the data with tbdbfs reveals they are equal. >>>> (one replica has to be fetched over the network; the nsds cannot access >>>> all disks) >>>> >>>> with some nsdChksum... settings we get during this mmfsck a lot of >>>> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >>>> >>>> ibm support says these are hardware issues, but wrt to mmfsck false >>>> positives. >>>> >>>> anyway, our current question is: if these are hardware issues, is there >>>> anything in gpfs client->nsd (on the network side) that would detect >>>> such errors. ie can we trust the data (and metadata). >>>> i was under the impression that client to disk is not covered, but i >>>> assumed that at least client to nsd (the network part) was checksummed. >>>> >>>> stijn >>>> >>>> >>>> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>>>> ok, i think i understand now, the data was already corrupted. the >> config >>>>> change i proposed only prevents a potentially known future on the wire >>>>> corruption, this will not fix something that made it to the disk >> already. >>>>> >>>>> Sven >>>>> >>>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be >>>>> >>>>> wrote: >>>>> >>>>>> yes ;) >>>>>> >>>>>> the system is in preproduction, so nothing that can't stopped/started >> in >>>>>> a few minutes (current setup has only 4 nsds, and no clients). >>>>>> mmfsck triggers the errors very early during inode replica compare. >>>>>> >>>>>> >>>>>> stijn >>>>>> >>>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>>>> How can you reproduce this so quick ? >>>>>>> Did you restart all daemons after that ? >>>>>>> >>>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> hi sven, >>>>>>>> >>>>>>>> >>>>>>>>> the very first thing you should check is if you have this setting >>>> set : >>>>>>>> maybe the very first thing to check should be the faq/wiki that has >>>> this >>>>>>>> documented? >>>>>>>> >>>>>>>>> >>>>>>>>> mmlsconfig envVar >>>>>>>>> >>>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF >> 1 >>>>>>>>> MLX5_USE_MUTEX 1 >>>>>>>>> >>>>>>>>> if that doesn't come back the way above you need to set it : >>>>>>>>> >>>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>>>> i just set this (wasn't set before), but problem is still present. >>>>>>>> >>>>>>>>> >>>>>>>>> there was a problem in the Mellanox FW in various versions that was >>>>>> never >>>>>>>>> completely addressed (bugs where found and fixed, but it was never >>>>>> fully >>>>>>>>> proven to be addressed) the above environment variables turn code >> on >>>> in >>>>>>>> the >>>>>>>>> mellanox driver that prevents this potential code path from being >>>> used >>>>>> to >>>>>>>>> begin with. >>>>>>>>> >>>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>>>> Scale >>>>>>>>> that even you don't set this variables the problem can't happen >>>> anymore >>>>>>>>> until then the only choice you have is the envVar above (which btw >>>>>> ships >>>>>>>> as >>>>>>>>> default on all ESS systems). >>>>>>>>> >>>>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>> not >>>>>>>> all >>>>>>>>> versions even have the code that is activated by the environment >>>>>>>> variables >>>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >>>> remember >>>>>>>> the >>>>>>>>> exact version. There had been multiple defects opened around this >>>> area, >>>>>>>> the >>>>>>>>> last one i remember was : >>>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards >> from >>>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to >> make >>>>>>>> new one. mellanox used to allow to make your own, but they don't >>>>>> anymore. >>>>>>>> >>>>>>>>> >>>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>>>> pthread_spin_lock >>>>>>>>> >>>>>>>>> you may ask your mellanox representative if they can get you access >>>> to >>>>>>>> this >>>>>>>>> defect. while it was found on ESS , means on PPC64 and with >>>> ConnectX-3 >>>>>>>>> cards its a general issue that affects all cards and on intel as >> well >>>>>> as >>>>>>>>> Power. >>>>>>>> ok, thanks for this. maybe such a reference is enough for dell to >>>> update >>>>>>>> their firmware. >>>>>>>> >>>>>>>> stijn >>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>>>> stijn.deweirdt at ugent.be> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> hi all, >>>>>>>>>> >>>>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>>>> written >>>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >>>> the >>>>>>>>>> nsd gpfs daemon to disk. >>>>>>>>>> >>>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it >> the >>>>>>>> same? >>>>>>>>>> >>>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >>>> it's >>>>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>>>> >>>>>>>>>> thanks a lot, >>>>>>>>>> >>>>>>>>>> stijn >>>>>>>>>> _______________________________________________ >>>>>>>>>> gpfsug-discuss mailing list >>>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> gpfsug-discuss mailing list >>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From stijn.deweirdt at ugent.be Wed Aug 2 22:53:50 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:53:50 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: Message-ID: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> hi steve, > The nsdChksum settings for none GNR/ESS based system is not officially > supported. It will perform checksum on data transfer over the network > only and can be used to help debug data corruption when network is a > suspect. i'll take not officially supported over silent bitrot any day. > > Did any of those "Encountered XYZ checksum errors on network I/O to NSD > Client disk" warning messages resulted in disk been changed to "down" > state due to IO error? no. If no disk IO error was reported in GPFS log, > that means data was retransmitted successfully on retry. we suspected as much. as sven already asked, mmfsck now reports clean filesystem. i have an ibdump of 2 involved nsds during the reported checksums, i'll have a closer look if i can spot these retries. > > As sven said, only GNR/ESS provids the full end to end data integrity. so with the silent network error, we have high probabilty that the data is corrupted. we are now looking for a test to find out what adapters are affected. we hoped that nsdperf with verify=on would tell us, but it doesn't. > > Steve Y. Xiao > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From aaron.s.knister at nasa.gov Thu Aug 3 01:48:07 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 20:48:07 -0400 Subject: [gpfsug-discuss] documentation about version compatibility Message-ID: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> Hey All, I swear that some time recently someone posted a link to some IBM documentation that outlined the recommended versions of GPFS to upgrade to/from (e.g. if you're at 3.5 get to 4.1 before going to 4.2.3). I can't for the life of me find it. Does anyone know what I'm talking about? Thanks, Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Aug 3 02:00:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 21:00:00 -0400 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: I'm a little late to the party here but I thought I'd share our recent experiences. We recently completed a mass UID number migration (half a billion inodes) and developed two tools ("luke filewalker" and the "mmilleniumfacl") to get the job done. Both luke filewalker and the mmilleniumfacl are based heavily on the code in /usr/lpp/mmfs/samples/util/tsreaddir.c and /usr/lpp/mmfs/samples/util/tsinode.c. luke filewalker targets traditional POSIX permissions whereas mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem in parallel and both but particularly the 2nd, are extremely I/O intensive on your metadata disks. The gist of luke filewalker is to scan the inode structures using the gpfs APIs and populate a mapping of inode number to gid and uid number. It then walks the filesystem in parallel using the APIs, looks up the inode number in an in-memory hash, and if appropriate changes ownership using the chown() API. The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs using the GPFS inode API so it walks the filesystem and reads the ACL of any and every file, updating the ACL entries as appropriate. I'm going to see if I can share the source code for both tools, although I don't know if I can post it here since it modified existing IBM source code. Could someone from IBM chime in here? If I were to send the code to IBM could they publish it perhaps on the wiki? -Aaron On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: > Hello, > > We're trying to change most of our users uids, is there a clean way to > migrate all of one users files with say `mmapplypolicy`? We have to change the > owner of around 273539588 files, and my estimates for runtime are around 6 days. > > What we've been doing is indexing all of the files and splitting them up by > owner which takes around an hour, and then we were locking the user out while we > chown their files. I made it multi threaded as it weirdly gave a 10% speedup > despite my expectation that multi threading access from a single node would not > give any speedup. > > Generally I'm looking for advice on how to make the chowning faster. Would > spreading the chowning processes over multiple nodes improve performance? Should > I not stat the files before running lchown on them, since lchown checks the file > before changing it? I saw mention of inodescan(), in an old gpfsug email, which > speeds up disk read access, by not guaranteeing that the data is up to date. We > have a maintenance day coming up where all users will be locked out, so the file > handles(?) from GPFS's perspective will not be able to go stale. Is there a > function with similar constraints to inodescan that I can use to speed up this > process? > > Thank you for your time, > > Luke > Storrs-HPC > University of Connecticut > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Aug 3 02:03:23 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 21:03:23 -0400 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: Oh, the one *huge* gotcha I thought I'd share-- we wrote a perl script to drive the migration and part of the perl script's process was to clone quotas from old uid numbers to the new number. I upset our GPFS cluster during a particular migration in which the user was over the grace period of the quota so after a certain point every chown() put the destination UID even further over its quota. The problem with this being that at this point every chown() operation would cause GPFS to do some cluster-wide quota accounting-related RPCs. That hurt. It's worth making sure there are no quotas defined for the destination UID numbers and if they are that the data coming from the source UID number will fit. -Aaron On 8/2/17 9:00 PM, Aaron Knister wrote: > I'm a little late to the party here but I thought I'd share our recent > experiences. > > We recently completed a mass UID number migration (half a billion > inodes) and developed two tools ("luke filewalker" and the > "mmilleniumfacl") to get the job done. Both luke filewalker and the > mmilleniumfacl are based heavily on the code in > /usr/lpp/mmfs/samples/util/tsreaddir.c and > /usr/lpp/mmfs/samples/util/tsinode.c. > > luke filewalker targets traditional POSIX permissions whereas > mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem in > parallel and both but particularly the 2nd, are extremely I/O intensive > on your metadata disks. > > The gist of luke filewalker is to scan the inode structures using the > gpfs APIs and populate a mapping of inode number to gid and uid number. > It then walks the filesystem in parallel using the APIs, looks up the > inode number in an in-memory hash, and if appropriate changes ownership > using the chown() API. > > The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs > using the GPFS inode API so it walks the filesystem and reads the ACL of > any and every file, updating the ACL entries as appropriate. > > I'm going to see if I can share the source code for both tools, although > I don't know if I can post it here since it modified existing IBM source > code. Could someone from IBM chime in here? If I were to send the code > to IBM could they publish it perhaps on the wiki? > > -Aaron > > On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: >> Hello, >> >> We're trying to change most of our users uids, is there a clean >> way to >> migrate all of one users files with say `mmapplypolicy`? We have to >> change the >> owner of around 273539588 files, and my estimates for runtime are >> around 6 days. >> >> What we've been doing is indexing all of the files and splitting >> them up by >> owner which takes around an hour, and then we were locking the user >> out while we >> chown their files. I made it multi threaded as it weirdly gave a 10% >> speedup >> despite my expectation that multi threading access from a single node >> would not >> give any speedup. >> >> Generally I'm looking for advice on how to make the chowning >> faster. Would >> spreading the chowning processes over multiple nodes improve >> performance? Should >> I not stat the files before running lchown on them, since lchown >> checks the file >> before changing it? I saw mention of inodescan(), in an old gpfsug >> email, which >> speeds up disk read access, by not guaranteeing that the data is up to >> date. We >> have a maintenance day coming up where all users will be locked out, >> so the file >> handles(?) from GPFS's perspective will not be able to go stale. Is >> there a >> function with similar constraints to inodescan that I can use to speed >> up this >> process? >> >> Thank you for your time, >> >> Luke >> Storrs-HPC >> University of Connecticut >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From scale at us.ibm.com Thu Aug 3 06:18:46 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Thu, 3 Aug 2017 13:18:46 +0800 Subject: [gpfsug-discuss] Gpfs Memory Usaage Keeps going up and we don't know why. In-Reply-To: <1500906086.571.9.camel@qmul.ac.uk> References: <261384244.3866909.1500901872347.ref@mail.yahoo.com><261384244.3866909.1500901872347@mail.yahoo.com><1500903047.571.7.camel@qmul.ac.uk><1770436429.3911327.1500905445052@mail.yahoo.com> <1500906086.571.9.camel@qmul.ac.uk> Message-ID: Can you provide the output of "pmap 4444"? If there's no "pmap" command on your system, then get the memory maps of mmfsd from file of /proc/4444/maps. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Peter Childs To: "gpfsug-discuss at spectrumscale.org" Date: 07/24/2017 10:22 PM Subject: Re: [gpfsug-discuss] Gpfs Memory Usaage Keeps going up and we don't know why. Sent by: gpfsug-discuss-bounces at spectrumscale.org top but ps gives the same value. [root at dn29 ~]# ps auww -q 4444 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 4444 2.7 22.3 10537600 5472580 ? S wrote: I've had a look at mmfsadm dump malloc and it looks to agree with the output from mmdiag --memory. and does not seam to account for the excessive memory usage. The new machines do have idleSocketTimout set to 0 from what your saying it could be related to keeping that many connections between nodes working. Thanks in advance Peter. [root at dn29 ~]# mmdiag --memory === mmdiag: memory === mmfsd heap size: 2039808 bytes Statistics for MemoryPool id 1 ("Shared Segment (EPHEMERAL)") 128 bytes in use 17500049370 hard limit on memory usage 1048576 bytes committed to regions 1 number of regions 555 allocations 555 frees 0 allocation failures Statistics for MemoryPool id 2 ("Shared Segment") 42179592 bytes in use 17500049370 hard limit on memory usage 56623104 bytes committed to regions 9 number of regions 100027 allocations 79624 frees 0 allocation failures Statistics for MemoryPool id 3 ("Token Manager") 2099520 bytes in use 17500049370 hard limit on memory usage 16778240 bytes committed to regions 1 number of regions 4 allocations 0 frees 0 allocation failures On Mon, 2017-07-24 at 13:11 +0000, Jim Doherty wrote: There are 3 places that the GPFS mmfsd uses memory the pagepool plus 2 shared memory segments. To see the memory utilization of the shared memory segments run the command mmfsadm dump malloc . The statistics for memory pool id 2 is where maxFilesToCache/maxStatCache objects are and the manager nodes use memory pool id 3 to track the MFTC/MSC objects. You might want to upgrade to later PTF as there was a PTF to fix a memory leak that occurred in tscomm associated with network connection drops. On Monday, July 24, 2017 5:29 AM, Peter Childs wrote: We have two GPFS clusters. One is fairly old and running 4.2.1-2 and non CCR and the nodes run fine using up about 1.5G of memory and is consistent (GPFS pagepool is set to 1G, so that looks about right.) The other one is "newer" running 4.2.1-3 with CCR and the nodes keep increasing in there memory usage, starting at about 1.1G and are find for a few days however after a while they grow to 4.2G which when the node need to run real work, means the work can't be done. I'm losing track of what maybe different other than CCR, and I'm trying to find some more ideas of where to look. I'm checked all the standard things like pagepool and maxFilesToCache (set to the default of 4000), workerThreads is set to 128 on the new gpfs cluster (against default 48 on the old) I'm not sure what else to look at on this one hence why I'm asking the community. Thanks in advance Peter Childs ITS Research Storage Queen Mary University of London. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Peter Childs ITS Research Storage Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Peter Childs ITS Research Storage Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Thu Aug 3 07:42:37 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Thu, 3 Aug 2017 07:42:37 +0100 Subject: [gpfsug-discuss] documentation about version compatibility In-Reply-To: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> References: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> Message-ID: <0a283eb9-a458-bd2c-4e7b-1f46bb22e385@pixitmedia.com> Hi This is the Installation Guide of each target version under the section 'Migrating from to '. Jez On 03/08/17 01:48, Aaron Knister wrote: > Hey All, > > I swear that some time recently someone posted a link to some IBM > documentation that outlined the recommended versions of GPFS to > upgrade to/from (e.g. if you're at 3.5 get to 4.1 before going to > 4.2.3). I can't for the life of me find it. Does anyone know what I'm > talking about? > > Thanks, > Aaron > -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Thu Aug 3 07:46:36 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Thu, 3 Aug 2017 07:46:36 +0100 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: Perhaps IBM might consider letting you commit it to https://github.com/gpfsug/gpfsug-tools he says, asking out loud... It'll require a friendly IBMer to take the reins up for you. Scott? :-) Jez On 03/08/17 02:00, Aaron Knister wrote: > I'm a little late to the party here but I thought I'd share our recent > experiences. > > We recently completed a mass UID number migration (half a billion > inodes) and developed two tools ("luke filewalker" and the > "mmilleniumfacl") to get the job done. Both luke filewalker and the > mmilleniumfacl are based heavily on the code in > /usr/lpp/mmfs/samples/util/tsreaddir.c and > /usr/lpp/mmfs/samples/util/tsinode.c. > > luke filewalker targets traditional POSIX permissions whereas > mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem > in parallel and both but particularly the 2nd, are extremely I/O > intensive on your metadata disks. > > The gist of luke filewalker is to scan the inode structures using the > gpfs APIs and populate a mapping of inode number to gid and uid > number. It then walks the filesystem in parallel using the APIs, looks > up the inode number in an in-memory hash, and if appropriate changes > ownership using the chown() API. > > The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs > using the GPFS inode API so it walks the filesystem and reads the ACL > of any and every file, updating the ACL entries as appropriate. > > I'm going to see if I can share the source code for both tools, > although I don't know if I can post it here since it modified existing > IBM source code. Could someone from IBM chime in here? If I were to > send the code to IBM could they publish it perhaps on the wiki? > > -Aaron > > On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: >> Hello, >> >> We're trying to change most of our users uids, is there a clean >> way to >> migrate all of one users files with say `mmapplypolicy`? We have to >> change the >> owner of around 273539588 files, and my estimates for runtime are >> around 6 days. >> >> What we've been doing is indexing all of the files and splitting >> them up by >> owner which takes around an hour, and then we were locking the user >> out while we >> chown their files. I made it multi threaded as it weirdly gave a 10% >> speedup >> despite my expectation that multi threading access from a single node >> would not >> give any speedup. >> >> Generally I'm looking for advice on how to make the chowning >> faster. Would >> spreading the chowning processes over multiple nodes improve >> performance? Should >> I not stat the files before running lchown on them, since lchown >> checks the file >> before changing it? I saw mention of inodescan(), in an old gpfsug >> email, which >> speeds up disk read access, by not guaranteeing that the data is up >> to date. We >> have a maintenance day coming up where all users will be locked out, >> so the file >> handles(?) from GPFS's perspective will not be able to go stale. Is >> there a >> function with similar constraints to inodescan that I can use to >> speed up this >> process? >> >> Thank you for your time, >> >> Luke >> Storrs-HPC >> University of Connecticut >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Aug 3 09:49:26 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 03 Aug 2017 09:49:26 +0100 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: <1501750166.17548.43.camel@strath.ac.uk> On Wed, 2017-08-02 at 21:03 -0400, Aaron Knister wrote: > Oh, the one *huge* gotcha I thought I'd share-- we wrote a perl script > to drive the migration and part of the perl script's process was to > clone quotas from old uid numbers to the new number. I upset our GPFS > cluster during a particular migration in which the user was over the > grace period of the quota so after a certain point every chown() put the > destination UID even further over its quota. The problem with this being > that at this point every chown() operation would cause GPFS to do some > cluster-wide quota accounting-related RPCs. That hurt. It's worth making > sure there are no quotas defined for the destination UID numbers and if > they are that the data coming from the source UID number will fit. For similar reasons if you are doing a restore of a file system (any file system for that matter not just GPFS) for whatever reason, don't turn quotas back on till *after* the restore is complete. Well unless you can be sure a user is not going to go over quota during the restore. However as this is generally not possible to determine you end up with no quota's either set/enforced till the restore is complete. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From oehmes at gmail.com Thu Aug 3 14:06:49 2017 From: oehmes at gmail.com (Sven Oehme) Date: Thu, 03 Aug 2017 13:06:49 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: a trace during a mmfsck with the checksum parameters turned on would reveal it. the support team should be able to give you specific triggers to cut a trace during checksum errors , this way the trace is cut when the issue happens and then from the trace on server and client side one can extract which card was used on each side. sven On Wed, Aug 2, 2017 at 2:53 PM Stijn De Weirdt wrote: > hi steve, > > > The nsdChksum settings for none GNR/ESS based system is not officially > > supported. It will perform checksum on data transfer over the network > > only and can be used to help debug data corruption when network is a > > suspect. > i'll take not officially supported over silent bitrot any day. > > > > > Did any of those "Encountered XYZ checksum errors on network I/O to NSD > > Client disk" warning messages resulted in disk been changed to "down" > > state due to IO error? > no. > > If no disk IO error was reported in GPFS log, > > that means data was retransmitted successfully on retry. > we suspected as much. as sven already asked, mmfsck now reports clean > filesystem. > i have an ibdump of 2 involved nsds during the reported checksums, i'll > have a closer look if i can spot these retries. > > > > > As sven said, only GNR/ESS provids the full end to end data integrity. > so with the silent network error, we have high probabilty that the data > is corrupted. > > we are now looking for a test to find out what adapters are affected. we > hoped that nsdperf with verify=on would tell us, but it doesn't. > > > > > Steve Y. Xiao > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamiedavis at us.ibm.com Thu Aug 3 14:11:23 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Thu, 3 Aug 2017 13:11:23 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Fri Aug 4 06:02:22 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 4 Aug 2017 01:02:22 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: 4.2.2.3 I want to think maybe this started after expanding inode space On Thu, Aug 3, 2017 at 9:11 AM, James Davis wrote: > Hey, > > Hmm, your invocation looks valid to me. What's your GPFS level? > > Cheers, > > Jamie > > > ----- Original message ----- > From: "J. Eric Wonderley" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] mmsetquota produces error > Date: Wed, Aug 2, 2017 5:03 PM > > for one of our home filesystem we get: > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > 'Invalid argument'. > > > mmedquota -j home:nathanfootest > does work however > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Aug 4 09:00:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 4 Aug 2017 04:00:35 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 Message-ID: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Hey All, Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some rather disconcerting behavior. Specifically on some of the upgraded nodes GPFS will seemingly deadlock on the entire node rendering it unusable. I can't even get a session on the node (but I can trigger a crash dump via a sysrq trigger). Most blocked tasks are blocked are in cxiWaitEventWait at the top of their call trace. That's probably not very helpful in of itself but I'm curious if anyone else out there has run into this issue or if this is a known bug. (I'll open a PMR later today once I've gathered more diagnostic information). -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From eric.wonderley at vt.edu Fri Aug 4 13:58:12 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 4 Aug 2017 08:58:12 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: i actually hit this assert and turned it in to support on this version: Build branch "4.2.2.3 efix6 (987197)". i was told do to exactly what sven mentioned. i thought it strange that i did NOT hit the assert in a no pass but hit it in a yes pass. On Thu, Aug 3, 2017 at 9:06 AM, Sven Oehme wrote: > a trace during a mmfsck with the checksum parameters turned on would > reveal it. > the support team should be able to give you specific triggers to cut a > trace during checksum errors , this way the trace is cut when the issue > happens and then from the trace on server and client side one can extract > which card was used on each side. > > sven > > On Wed, Aug 2, 2017 at 2:53 PM Stijn De Weirdt > wrote: > >> hi steve, >> >> > The nsdChksum settings for none GNR/ESS based system is not officially >> > supported. It will perform checksum on data transfer over the network >> > only and can be used to help debug data corruption when network is a >> > suspect. >> i'll take not officially supported over silent bitrot any day. >> >> > >> > Did any of those "Encountered XYZ checksum errors on network I/O to NSD >> > Client disk" warning messages resulted in disk been changed to "down" >> > state due to IO error? >> no. >> >> If no disk IO error was reported in GPFS log, >> > that means data was retransmitted successfully on retry. >> we suspected as much. as sven already asked, mmfsck now reports clean >> filesystem. >> i have an ibdump of 2 involved nsds during the reported checksums, i'll >> have a closer look if i can spot these retries. >> >> > >> > As sven said, only GNR/ESS provids the full end to end data integrity. >> so with the silent network error, we have high probabilty that the data >> is corrupted. >> >> we are now looking for a test to find out what adapters are affected. we >> hoped that nsdperf with verify=on would tell us, but it doesn't. >> >> > >> > Steve Y. Xiao >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Aug 4 15:45:49 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 4 Aug 2017 16:45:49 +0200 Subject: [gpfsug-discuss] restrict user quota on specific filesets Message-ID: Hi, Is it possible to let users only write data in filesets where some quota is explicitly set ? We have independent filesets with quota defined for users that should have access in a specific fileset. The problem is when users using another fileset give eg global write access on their directories, the former users can write without limits, because it is by default 0 == no limits. Setting the quota on the file system will only restrict users quota in the root fileset, and setting quota for each user - fileset combination would be a huge mess. Setting default quotas does not work for existing users. Thank you !! Kenneth From aaron.s.knister at nasa.gov Fri Aug 4 16:02:04 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 4 Aug 2017 11:02:04 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 In-Reply-To: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> References: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Message-ID: I've narrowed the problem down to 4.1.1.16. We'll most likely be downgrading to 4.1.1.15. -Aaron On 8/4/17 4:00 AM, Aaron Knister wrote: > Hey All, > > Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? > > We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some rather > disconcerting behavior. Specifically on some of the upgraded nodes GPFS > will seemingly deadlock on the entire node rendering it unusable. I > can't even get a session on the node (but I can trigger a crash dump via > a sysrq trigger). > > Most blocked tasks are blocked are in cxiWaitEventWait at the top of > their call trace. That's probably not very helpful in of itself but I'm > curious if anyone else out there has run into this issue or if this is a > known bug. > > (I'll open a PMR later today once I've gathered more diagnostic > information). > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan.buzzard at strath.ac.uk Fri Aug 4 16:15:44 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 04 Aug 2017 16:15:44 +0100 Subject: [gpfsug-discuss] restrict user quota on specific filesets In-Reply-To: References: Message-ID: <1501859744.17548.69.camel@strath.ac.uk> On Fri, 2017-08-04 at 16:45 +0200, Kenneth Waegeman wrote: > Hi, > > Is it possible to let users only write data in filesets where some quota > is explicitly set ? > > We have independent filesets with quota defined for users that should > have access in a specific fileset. The problem is when users using > another fileset give eg global write access on their directories, the > former users can write without limits, because it is by default 0 == no > limits. Setting appropriate ACL's on the junction point of the fileset so that they can only write to file sets that they have permissions to is how you achieve this. I would say create groups and do it that way, but *nasty* things happen when you are a member of more than 16 supplemental groups and are using NFSv3 (NFSv4 and up is fine). So as long as that is not an issue go nuts with groups as it is much easier to manage. > Setting the quota on the file system will only restrict users quota in > the root fileset, and setting quota for each user - fileset combination > would be a huge mess. Setting default quotas does not work for existing > users. Not sure abusing the quota system for permissions a sensible approach. Put another way it was not designed with that purpose in mind so don't be surprised when you can't use it to do that. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From ilan84 at gmail.com Sun Aug 6 09:26:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 11:26:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood Message-ID: Hi guys, I see IBM spectrumscale configure the NFS via command: mmnfs Is the command mmnfs is a wrapper on top of the normal kernel NFS (Kernel VFS) ? Is it a wrapper on top of ganesha NFS ? Or it is NFS implemented by SpectrumScale team ? Thanks -- - Ilan Schwarts From S.J.Thompson at bham.ac.uk Sun Aug 6 10:10:45 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Sun, 6 Aug 2017 09:10:45 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] Sent: 06 August 2017 09:26 To: gpfsug main discussion list Subject: [gpfsug-discuss] what is mmnfs under the hood Hi guys, I see IBM spectrumscale configure the NFS via command: mmnfs Is the command mmnfs is a wrapper on top of the normal kernel NFS (Kernel VFS) ? Is it a wrapper on top of ganesha NFS ? Or it is NFS implemented by SpectrumScale team ? Thanks -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Sun Aug 6 10:42:30 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 12:42:30 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, I cannot use ganesha NFS. How do I make NFS exports ? just editing all nodes /etc/exports is enough ? I should i use the CNFS as described here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) wrote: > Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... > > Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. > > Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > Sent: 06 August 2017 09:26 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] what is mmnfs under the hood > > Hi guys, > > I see IBM spectrumscale configure the NFS via command: mmnfs > > Is the command mmnfs is a wrapper on top of the normal kernel NFS > (Kernel VFS) ? > Is it a wrapper on top of ganesha NFS ? > Or it is NFS implemented by SpectrumScale team ? > > > Thanks > > -- > > > - > Ilan Schwarts > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- - Ilan Schwarts From ilan84 at gmail.com Sun Aug 6 10:49:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 12:49:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts From S.J.Thompson at bham.ac.uk Sun Aug 6 11:54:17 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Sun, 6 Aug 2017 10:54:17 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: , Message-ID: What do you mean by cannot use mmsmb and cannot use Ganesha? Do you functionally you are not allowed to or they are not working for you? If it's the latter, then this should be resolvable. If you are under active maintenance you could try raising a ticket with IBM, though basic implementation is not really a support issue and so you may be better engaging a business partner or integrator to help you out. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] Sent: 06 August 2017 10:49 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] what is mmnfs under the hood I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Sun Aug 6 12:39:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 14:39:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: In my case, I cannot use nfs ganesha, this means I cannot use mmsnb since its part of "ces", if i want to use cnfs i cannot combine it with ces.. so the system architecture need to solve this issue. On Aug 6, 2017 13:54, "Simon Thompson (IT Research Support)" < S.J.Thompson at bham.ac.uk> wrote: > What do you mean by cannot use mmsmb and cannot use Ganesha? Do you > functionally you are not allowed to or they are not working for you? > > If it's the latter, then this should be resolvable. If you are under > active maintenance you could try raising a ticket with IBM, though basic > implementation is not really a support issue and so you may be better > engaging a business partner or integrator to help you out. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces@ > spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > Sent: 06 August 2017 10:49 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] what is mmnfs under the hood > > I have read this atricle: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2. > 0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm > > So, in a shortcut, CNFS cannot be used when sharing via CES. > I cannot use ganesha NFS. > > Is it possible to share a cluster via SMB and NFS without using CES ? > the nfs will be expored via CNFS but what about SMB ? i cannot use > mmsmb.. > > > On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > > I cannot use ganesha NFS. > > How do I make NFS exports ? just editing all nodes /etc/exports is > enough ? > > I should i use the CNFS as described here: > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2. > 2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > > wrote: > >> Under the hood, the NFS services are provided by IBM supplied Ganesha > rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle > locking, ACLs, quota etc... > >> > >> Note it's different from using the cnfs support in Spectrum Scale which > uses Kernel NFS AFAIK. Using user space Ganesha means they have control of > the NFS stack, so if something needs patching/fixing, then can roll out new > Ganesha rpms rather than having to get (e.g.) RedHat to incorporate > something into kernel NFS. > >> > >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute > the config to the nodes. > >> > >> Simon > >> ________________________________________ > >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces@ > spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > >> Sent: 06 August 2017 09:26 > >> To: gpfsug main discussion list > >> Subject: [gpfsug-discuss] what is mmnfs under the hood > >> > >> Hi guys, > >> > >> I see IBM spectrumscale configure the NFS via command: mmnfs > >> > >> Is the command mmnfs is a wrapper on top of the normal kernel NFS > >> (Kernel VFS) ? > >> Is it a wrapper on top of ganesha NFS ? > >> Or it is NFS implemented by SpectrumScale team ? > >> > >> > >> Thanks > >> > >> -- > >> > >> > >> - > >> Ilan Schwarts > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > -- > > > > > > - > > Ilan Schwarts > > > > -- > > > - > Ilan Schwarts > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Mon Aug 7 05:58:13 2017 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Mon, 7 Aug 2017 04:58:13 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: It would be nice to know why you cannot use ganesha or mmsmb. You don't have to use protocols or CES. We are migrating to CES from doing our own thing with NFS and samba on Debian. Debian does not have support for CES, so we had to roll our own. We did not use CNFS either. To get to CES we had to change OS. We did this because we valued the support. I'd say the failover works better with CES than with our solution, particularly with regards failing over and Infiniband IP address. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ilan Schwarts Sent: Sunday, 6 August 2017 7:50 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] what is mmnfs under the hood I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.sp > ectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of >> ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Mon Aug 7 14:27:07 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Mon, 7 Aug 2017 16:27:07 +0300 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Hi all, My setup is 2 nodes GPFS and 1 machine as NFS Client. All machines (3 total) run CentOS 7.2 The 3rd CentOS machine (not part of the cluster) used as NFS Client. I mount the NFS Client machine to one of the nodes: mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 This gives me the following: [root at CentOS7286-64 ~]# mount -v | grep gpfs 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) Now, From the Client NFS Machine, I go to the mount directory ("cd /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I use nfs4_getfacl: [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 Operation to request attribute not supported. [root at CentOS7286-64 nfs4]# >From the NODE machine i see the status: [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 flag value description ------------------- ------------------------ ----------------------------------- -f 8192 Minimum fragment size in bytes -i 4096 Inode size in bytes -I 16384 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k nfs4 ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 262144 Block size -Q none Quotas accounting enabled none Quotas enforced none Default quotas enabled --perfileset-quota No Per-fileset quota enforcement --filesetdf No Fileset df enabled? -V 16.00 (4.2.2.0) File system version --create-time Wed Jul 5 12:28:39 2017 File system creation time -z No Is DMAPI enabled? -L 4194304 Logfile size -E Yes Exact mtime mount option -S No Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 171840 Maximum number of inodes in all inode spaces --log-replicas 0 Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) -P system Disk storage pools in file system -d nynsd1;nynsd2 Disks in file system -A yes Automatic mount option -o none Additional mount options -T /fs_gpfs01 Default mount point --mount-priority 0 Mount priority I saw this thread: https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 Is it still relevant ? Since 2014.. Thanks ! From makaplan at us.ibm.com Mon Aug 7 17:48:39 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 7 Aug 2017 12:48:39 -0400 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Indeed. You can consider and use GPFS/Spectrum Scale as "just another" file system type that can be loaded into/onto a Linux system. But you should consider the pluses and minuses of using other software subsystems that may or may not be designed to work better or inter-operate with Spectrum Scale specific features and APIs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Mon Aug 7 18:14:41 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Mon, 7 Aug 2017 20:14:41 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Thanks for response. I am not a system engineer / storage architect. I maintain kernel module that interact with file system drivers.. so I need to configure gpfs and perform tests.. for example I noticed that gpfs set extended attribute does not go via VFS On Aug 7, 2017 19:48, "Marc A Kaplan" wrote: > Indeed. You can consider and use GPFS/Spectrum Scale as "just another" > file system type that can be loaded into/onto a Linux system. > > But you should consider the pluses and minuses of using other software > subsystems that may or may not be designed to work better or inter-operate > with Spectrum Scale specific features and APIs. > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Aug 7 21:27:03 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 07 Aug 2017 16:27:03 -0400 Subject: [gpfsug-discuss] 'ltfsee info tapes' - Unusable tapes... Message-ID: <8652.1502137623@turing-police.cc.vt.edu> The LTFSEE docs say: https://www.ibm.com/support/knowledgecenter/en/ST9MBR_1.2.3/ltfs_ee_ltfsee_info_tapes.html "Unusable The Unusable status indicates that the tape can't be used. To change the status, remove the tape from the pool by using the ltfsee pool remove command with the -r option. Then, add the tape back into the pool by using the ltfsee pool add command." Do they really mean that? What happens to data that was on the tape? Does the 'pool add' command re-import LTFS's knowledge of what files were on that tape? It's one thing to remove/add tapes with no files on them - but I'm leery of doing it for tapes that contain migrated data, given a lack of clear statement that file index recovery is done at 'pool add' time. (We had a tape get stuck in a drive, and LTFS/EE tried to use the drive, wasn't able to load the tape because the drive was occupied, marked the tape as Unusable. Lather rinse repeat until there's no usable tapes left in the pool... but that's a different issue...) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From jamiedavis at us.ibm.com Mon Aug 7 22:10:06 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Mon, 7 Aug 2017 21:10:06 +0000 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 8 05:28:20 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 8 Aug 2017 07:28:20 +0300 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: Hi, The command should work from server side i know.. but isnt the scenario of: Root user, that is mounted via nfsv4 to a gpfs filesystem, cannot edit any of the mounted files/dirs acls? The acls are editable only from server side? Thanks! On Aug 8, 2017 00:10, "James Davis" wrote: > Hi Ilan, > > 1. Your command might work from the server side; you said you tried it > from the client side. Could you find anything in the docs about this? I > could not. > > 2. I can share this NFSv4-themed wrapper around mmputacl if it would be > useful to you. You would have to run it from the GPFS side, not the NFS > client side. > > Regards, > > Jamie > > # ./updateNFSv4ACL -h > Update the NFSv4 ACL governing a file's access permissions. > Appends to the existing ACL, overwriting conflicting permissions. > Usage: ./updateNFSv4ACL -file /path/to/file { ADD_PERM_SPEC | > DEL_PERM_SPEC }+ > ADD_PERM_SPEC: { -owningUser PERM | -owningGroup PERM | -other PERM | > -ace nameType:name:PERM:aceType } > DEL_PERM_SPEC: { -noACEFor nameType:name } > PERM: Specify a string composed of one or more of the following letters > in no particular order: > r (ead) > w (rite) > a (ppend) Must agree with write > x (execute) > d (elete) > D (elete child) Dirs only > t (read attrs) > T (write attrs) > c (read ACL) > C (write ACL) > o (change owner) > You can also provide these, but they will have no effect in GPFS: > n (read named attrs) > N (write named attrs) > y (support synchronous I/O) > > To indicate no permissions, give a - > nameType: 'user' or 'group'. > aceType: 'allow' or 'deny'. > Examples: ./updateNFSv4ACL -file /fs1/f -owningUser rtc -owningGroup > rwaxdtc -other '-' > Assign these permissions to 'owner', 'group', 'other'. > ./updateNFSv4ACL -file /fs1/f -ace 'user:pfs001:rtc:allow' > -noACEFor 'group:fvt001' > Allow user pfs001 read/read attrs/read ACL permission > Remove all ACEs (allow and deny) for group fvt001. > Notes: > Permissions you do not allow are denied by default. > See the GPFS docs for some other restrictions. > ace is short for Access Control Entry > > > ----- Original message ----- > From: Ilan Schwarts > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster > Date: Mon, Aug 7, 2017 9:27 AM > > Hi all, > My setup is 2 nodes GPFS and 1 machine as NFS Client. > All machines (3 total) run CentOS 7.2 > > The 3rd CentOS machine (not part of the cluster) used as NFS Client. > > I mount the NFS Client machine to one of the nodes: mount -t nfs > 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 > > This gives me the following: > > [root at CentOS7286-64 ~]# mount -v | grep gpfs > 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 > (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen= > 255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys, > clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) > > Now, From the Client NFS Machine, I go to the mount directory ("cd > /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I > use nfs4_getfacl: > [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 > Operation to request attribute not supported. > [root at CentOS7286-64 nfs4]# > > From the NODE machine i see the status: > [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 > flag value description > ------------------- ------------------------ ------------------------------ > ----- > -f 8192 Minimum fragment size in bytes > -i 4096 Inode size in bytes > -I 16384 Indirect block size in bytes > -m 1 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data > replicas > -R 2 Maximum number of data > replicas > -j cluster Block allocation type > -D nfs4 File locking semantics in > effect > -k nfs4 ACL semantics in effect > -n 32 Estimated number of nodes > that will mount file system > -B 262144 Block size > -Q none Quotas accounting enabled > none Quotas enforced > none Default quotas enabled > --perfileset-quota No Per-fileset quota enforcement > --filesetdf No Fileset df enabled? > -V 16.00 (4.2.2.0) File system version > --create-time Wed Jul 5 12:28:39 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E Yes Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation > option > --fastea Yes Fast external attributes > enabled? > --encryption No Encryption enabled? > --inode-limit 171840 Maximum number of inodes > in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > -P system Disk storage pools in file > system > -d nynsd1;nynsd2 Disks in file system > -A yes Automatic mount option > -o none Additional mount options > -T /fs_gpfs01 Default mount point > --mount-priority 0 Mount priority > > > > I saw this thread: > https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 > > Is it still relevant ? Since 2014.. > > Thanks ! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Tue Aug 8 05:50:10 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Tue, 8 Aug 2017 10:20:10 +0530 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Tue Aug 8 17:30:13 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Tue, 8 Aug 2017 22:00:13 +0530 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: (seems my earlier reply created a new topic; hence trying to reply back original thread started by Ilan Schwarts...) >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/08/2017 04:30 PM Subject: gpfsug-discuss Digest, Vol 67, Issue 21 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: How to use nfs4_getfacl (or set) on GPFS cluster (Ilan Schwarts) 2. How to use nfs4_getfacl (or set) on GPFS cluster (Chetan R Kulkarni) ---------------------------------------------------------------------- Message: 1 Date: Tue, 8 Aug 2017 07:28:20 +0300 From: Ilan Schwarts To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Content-Type: text/plain; charset="utf-8" Hi, The command should work from server side i know.. but isnt the scenario of: Root user, that is mounted via nfsv4 to a gpfs filesystem, cannot edit any of the mounted files/dirs acls? The acls are editable only from server side? Thanks! On Aug 8, 2017 00:10, "James Davis" wrote: > Hi Ilan, > > 1. Your command might work from the server side; you said you tried it > from the client side. Could you find anything in the docs about this? I > could not. > > 2. I can share this NFSv4-themed wrapper around mmputacl if it would be > useful to you. You would have to run it from the GPFS side, not the NFS > client side. > > Regards, > > Jamie > > # ./updateNFSv4ACL -h > Update the NFSv4 ACL governing a file's access permissions. > Appends to the existing ACL, overwriting conflicting permissions. > Usage: ./updateNFSv4ACL -file /path/to/file { ADD_PERM_SPEC | > DEL_PERM_SPEC }+ > ADD_PERM_SPEC: { -owningUser PERM | -owningGroup PERM | -other PERM | > -ace nameType:name:PERM:aceType } > DEL_PERM_SPEC: { -noACEFor nameType:name } > PERM: Specify a string composed of one or more of the following letters > in no particular order: > r (ead) > w (rite) > a (ppend) Must agree with write > x (execute) > d (elete) > D (elete child) Dirs only > t (read attrs) > T (write attrs) > c (read ACL) > C (write ACL) > o (change owner) > You can also provide these, but they will have no effect in GPFS: > n (read named attrs) > N (write named attrs) > y (support synchronous I/O) > > To indicate no permissions, give a - > nameType: 'user' or 'group'. > aceType: 'allow' or 'deny'. > Examples: ./updateNFSv4ACL -file /fs1/f -owningUser rtc -owningGroup > rwaxdtc -other '-' > Assign these permissions to 'owner', 'group', 'other'. > ./updateNFSv4ACL -file /fs1/f -ace 'user:pfs001:rtc:allow' > -noACEFor 'group:fvt001' > Allow user pfs001 read/read attrs/read ACL permission > Remove all ACEs (allow and deny) for group fvt001. > Notes: > Permissions you do not allow are denied by default. > See the GPFS docs for some other restrictions. > ace is short for Access Control Entry > > > ----- Original message ----- > From: Ilan Schwarts > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster > Date: Mon, Aug 7, 2017 9:27 AM > > Hi all, > My setup is 2 nodes GPFS and 1 machine as NFS Client. > All machines (3 total) run CentOS 7.2 > > The 3rd CentOS machine (not part of the cluster) used as NFS Client. > > I mount the NFS Client machine to one of the nodes: mount -t nfs > 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 > > This gives me the following: > > [root at CentOS7286-64 ~]# mount -v | grep gpfs > 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 > (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen= > 255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys, > clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) > > Now, From the Client NFS Machine, I go to the mount directory ("cd > /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I > use nfs4_getfacl: > [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 > Operation to request attribute not supported. > [root at CentOS7286-64 nfs4]# > > From the NODE machine i see the status: > [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 > flag value description > ------------------- ------------------------ ------------------------------ > ----- > -f 8192 Minimum fragment size in bytes > -i 4096 Inode size in bytes > -I 16384 Indirect block size in bytes > -m 1 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data > replicas > -R 2 Maximum number of data > replicas > -j cluster Block allocation type > -D nfs4 File locking semantics in > effect > -k nfs4 ACL semantics in effect > -n 32 Estimated number of nodes > that will mount file system > -B 262144 Block size > -Q none Quotas accounting enabled > none Quotas enforced > none Default quotas enabled > --perfileset-quota No Per-fileset quota enforcement > --filesetdf No Fileset df enabled? > -V 16.00 (4.2.2.0) File system version > --create-time Wed Jul 5 12:28:39 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E Yes Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation > option > --fastea Yes Fast external attributes > enabled? > --encryption No Encryption enabled? > --inode-limit 171840 Maximum number of inodes > in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > -P system Disk storage pools in file > system > -d nynsd1;nynsd2 Disks in file system > -A yes Automatic mount option > -o none Additional mount options > -T /fs_gpfs01 Default mount point > --mount-priority 0 Mount priority > > > > I saw this thread: > https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 > > Is it still relevant ? Since 2014.. > > Thanks ! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170808/0e20196d/attachment-0001.html > ------------------------------ Message: 2 Date: Tue, 8 Aug 2017 10:20:10 +0530 From: "Chetan R Kulkarni" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Content-Type: text/plain; charset="us-ascii" >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170808/42fbe6c2/attachment-0001.html > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 21 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stefan.dietrich at desy.de Tue Aug 8 18:16:33 2017 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Tue, 8 Aug 2017 19:16:33 +0200 (CEST) Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS Message-ID: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Hello, I am currently trying to understand an issue with ACLs and how GPFS handles the umask. The filesystem is configured for NFS4 ACLs only (-k nfs4), filesets have been configured for chmodAndUpdateACL and the access is through a native GPFS client (v4.2.3). If I create a new file in a directory, which has an ACE with inheritance, the configured umask on the shell is completely ignored. The new file only contains ACEs from the inherited ACL. As soon as the ACE with inheritance is removed, newly created files receive the correct configured umask. Obvious downside, no ACLs anymore :( Additionally, it looks like that the specified mode bits for an open call are ignored as well. E.g. with an strace I see, that the open call includes the correct mode bits. However, the new file only has inherited ACEs. According to the NFSv4 RFC, the behavior is more or less undefined, only with NFSv4.2 umask will be added to the protocol. For GPFS, I found a section in the traditional ACL administration section, but nothing in the NFS4 ACL section of the docs. Is my current observation the intended behavior of GPFS? Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ From kkr at lbl.gov Tue Aug 8 19:33:22 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Tue, 8 Aug 2017 11:33:22 -0700 Subject: [gpfsug-discuss] Hold the Date - Spectrum Scale Day @ HPCXXL (Sept 2017, NYC) In-Reply-To: References: Message-ID: Hello, The GPFS Day of the HPCXXL conference is confirmed for Thursday, September 28th. Here is an updated URL, the agenda and registration are still being put together http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ . The GPFS Day will require registration, so we can make sure there is enough room (and coffee/food) for all attendees ?however, there will be no registration fee if you attend the GPFS Day only. I?ll send another update when the agenda is closer to settled. Cheers, Kristy > On Jul 7, 2017, at 3:32 PM, Kristy Kallback-Rose wrote: > > Hello, > > More details will be provided as they become available, but just so you can make a placeholder on your calendar, there will be a Spectrum Scale Day the week of September 25th - 29th, likely Thursday, September 28th. > > This will be a part of the larger HPCXXL meeting (https://www.spxxl.org/?q=New-York-City-2017 ). You may recall this group was formerly called SPXXL and the website is in the process of transitioning to the new name (and getting a new certificate). You will be able to attend *just* the Spectrum Scale day if that is the only portion of the event you would like to attend. > > The NYC location is great for Spectrum Scale events because many IBMers, including developers, can come in from Poughkeepsie. > > More as we get closer to the date and details are settled. > > Cheers, > Kristy -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Aug 8 20:28:31 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 8 Aug 2017 14:28:31 -0500 Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS In-Reply-To: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> References: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Message-ID: Yes, that is the intended behavior. As in the section on traditional ACLs that you found, the intent is that if there is a default/inherited ACL, the object is created with that (and if there is no default/inherited ACL, then the mode and umask are the basis for the initial set of permissions). Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dietrich, Stefan" To: gpfsug-discuss at spectrumscale.org Date: 08/08/2017 12:17 PM Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, I am currently trying to understand an issue with ACLs and how GPFS handles the umask. The filesystem is configured for NFS4 ACLs only (-k nfs4), filesets have been configured for chmodAndUpdateACL and the access is through a native GPFS client (v4.2.3). If I create a new file in a directory, which has an ACE with inheritance, the configured umask on the shell is completely ignored. The new file only contains ACEs from the inherited ACL. As soon as the ACE with inheritance is removed, newly created files receive the correct configured umask. Obvious downside, no ACLs anymore :( Additionally, it looks like that the specified mode bits for an open call are ignored as well. E.g. with an strace I see, that the open call includes the correct mode bits. However, the new file only has inherited ACEs. According to the NFSv4 RFC, the behavior is more or less undefined, only with NFSv4.2 umask will be added to the protocol. For GPFS, I found a section in the traditional ACL administration section, but nothing in the NFS4 ACL section of the docs. Is my current observation the intended behavior of GPFS? Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Aug 8 22:27:20 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 8 Aug 2017 17:27:20 -0400 Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS In-Reply-To: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> References: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Message-ID: (IMO) NFSv4 ACLs are complicated. Confusing. Difficult. Befuddling. PIA. Before questioning the GPFS implementation, see how they work in other file systems. If GPFS does it differently, perhaps there is a rationale, or perhaps you've found a bug. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Wed Aug 9 11:32:32 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 9 Aug 2017 10:32:32 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: <09520659d6cb44a1bbbed066106b39a2@R01UKEXCASM223.r01.fujitsu.local> Hello Experts, Does GPFS start "down" disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in "down" state by GPFS. Will GPFS in such case start the disk? With best regards / Mit freundlichen Gr??en / Pozdrawiam Tomasz Wolski Development Engineer NDC Eternus CS HE (ET2) [cid:image002.gif at 01CE62B9.8ACFA960] FUJITSU Fujitsu Technology Solutions Sp. z o.o. Textorial Park Bldg C, ul. Fabryczna 17 90-344 Lodz, Poland E-mail: Tomasz.Wolski at ts.fujitsu.com Web: ts.fujitsu.com Company details: ts.fujitsu.com/imprint This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 2774 bytes Desc: image001.gif URL: From chris.schlipalius at pawsey.org.au Wed Aug 9 11:50:22 2017 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Wed, 9 Aug 2017 18:50:22 +0800 Subject: [gpfsug-discuss] Announcement of the next Australian SpectrumScale User Group - Half day August 2017 (Melbourne) References: Message-ID: <0190993E-B870-4A37-9671-115A1201A59D@pawsey.org.au> Hello we have a half day (afternoon) usergroup next week. Please check out the event registration link below for tickets, speakers and topics. https://goo.gl/za8g3r Regards, Chris Schlipalius Lead Organiser Spectrum Scale Usergroups Australia Senior Storage Infrastucture Specialist, The Pawsey Supercomputing Centre From Robert.Oesterlin at nuance.com Wed Aug 9 13:14:46 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 9 Aug 2017 12:14:46 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: By default, GPFS does not automatically start down disks. You could add a callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? if you wanted. If a disk is marked down, it?s better to determine why before trying to start it as it may involve other issues that need investigation. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of "Tomasz.Wolski at ts.fujitsu.com" Reply-To: gpfsug main discussion list Date: Wednesday, August 9, 2017 at 6:33 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs automatically? Does GPFS start ?down? disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in ?down? state by GPFS. Will GPFS in such case start the disk? -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Aug 9 13:22:57 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 9 Aug 2017 14:22:57 +0200 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? In-Reply-To: References: Message-ID: If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. That can be quite useful for stretched clusters, where you want to replicate all blocks to both locations, and this way recover automatically. -jf On Wed, Aug 9, 2017 at 2:14 PM, Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: > By default, GPFS does not automatically start down disks. You could add a > callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? > if you wanted. If a disk is marked down, it?s better to determine why > before trying to start it as it may involve other issues that need > investigation. > > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > *From: * on behalf of " > Tomasz.Wolski at ts.fujitsu.com" > *Reply-To: *gpfsug main discussion list > *Date: *Wednesday, August 9, 2017 at 6:33 AM > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs > automatically? > > > > > > Does GPFS start ?down? disks in a filesystem automatically? For instance, > when connection to NSD is recovered, but it the meantime disk was put in > ?down? state by GPFS. Will GPFS in such case start the disk? > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Aug 9 13:48:00 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 9 Aug 2017 12:48:00 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: <3F6EFDF7-B96B-4E89-ABFE-4EEBEE0C0878@nuance.com> Be careful here, as this does: ?When a disk experiences a failure and becomes unavailable, the recovery procedure will first attempt to restart the disk and if this fails, the disk is suspended and its data moved to other disks. ? Which may not be what you want to happen. :-) If you have disks marked down due to a transient failure, kicking of restripes to move the data off might not be the best choice. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Jan-Frode Myklebust Reply-To: gpfsug main discussion list Date: Wednesday, August 9, 2017 at 8:23 AM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] Is GPFS starting NSDs automatically? If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 9 16:04:35 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 9 Aug 2017 15:04:35 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? In-Reply-To: References: Message-ID: <44e67260b5104860adaaf8222e11e995@jumptrading.com> For non-stretch clusters, I think best practice would be to have an administrator analyze the situation and understand why the NSD was considered unavailable before attempting to start the disks back in the file system. Down NSDs are usually indicative of a serious issue. However I have seen a transient network communication problems or NSD server recovery cause a NSD Client to report a NSD as failed. I would prefer that the FS manager check first that the NSDs are actually not accessible and that there isn?t a recovery operation within the NSD Servers supporting an NSD before marking NSDs as down. Recovery should be allowed to complete and a NSD client should just wait for that to happen. NSDs being marked down can cause serious file system outages!! We?ve also requested that a settable retry configuration setting be provided to have NSD Clients retry access to the NSD before reporting the NSD as failed (https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=104474 if you want to add a vote!). Cheers, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jan-Frode Myklebust Sent: Wednesday, August 09, 2017 7:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Is GPFS starting NSDs automatically? Note: External Email ________________________________ If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. That can be quite useful for stretched clusters, where you want to replicate all blocks to both locations, and this way recover automatically. -jf On Wed, Aug 9, 2017 at 2:14 PM, Oesterlin, Robert > wrote: By default, GPFS does not automatically start down disks. You could add a callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? if you wanted. If a disk is marked down, it?s better to determine why before trying to start it as it may involve other issues that need investigation. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: > on behalf of "Tomasz.Wolski at ts.fujitsu.com" > Reply-To: gpfsug main discussion list > Date: Wednesday, August 9, 2017 at 6:33 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs automatically? Does GPFS start ?down? disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in ?down? state by GPFS. Will GPFS in such case start the disk? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Aug 14 22:53:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 14 Aug 2017 17:53:35 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 In-Reply-To: References: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Message-ID: <5e25d9b1-13de-20b7-567d-e14601fd4bd0@nasa.gov> I was remiss in not following up with this sooner and thank you to the kind individual that shot me a direct message to ask the question. It turns out that when I asked for the fix for APAR IV96776 I got an early release of 4.1.1.16 that had a fix for the APAR but also introduced the lockup bug. IBM kindly delayed the release of 4.1.1.16 proper until they had addressed the lockup bug (APAR IV98888). As I understand it the version of 4.1.1.16 that was released via fix central should have a fix for this bug although I haven't tested it I have no reason to believe it's not fixed. -Aaron On 08/04/2017 11:02 AM, Aaron Knister wrote: > I've narrowed the problem down to 4.1.1.16. We'll most likely be > downgrading to 4.1.1.15. > > -Aaron > > On 8/4/17 4:00 AM, Aaron Knister wrote: >> Hey All, >> >> Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? >> >> We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some >> rather disconcerting behavior. Specifically on some of the upgraded >> nodes GPFS will seemingly deadlock on the entire node rendering it >> unusable. I can't even get a session on the node (but I can trigger a >> crash dump via a sysrq trigger). >> >> Most blocked tasks are blocked are in cxiWaitEventWait at the top of >> their call trace. That's probably not very helpful in of itself but >> I'm curious if anyone else out there has run into this issue or if >> this is a known bug. >> >> (I'll open a PMR later today once I've gathered more diagnostic >> information). >> >> -Aaron >> > From aaron.s.knister at nasa.gov Thu Aug 17 14:12:28 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Thu, 17 Aug 2017 13:12:28 +0000 Subject: [gpfsug-discuss] NSD Server/FS Manager Memory Requirements Message-ID: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> Hi Everyone, In the world of GPFS 4.2 is there a particular advantage to having a large amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD Server/FS manager nodes? We currently have half of physical memory allocated to pagepool on these nodes. For some historical context-- we had two indicidents that drove us to increase our NSD server/FS manager pagepools. One was a weird behavior in GPFS 3.5 that was causing bouncing FS managers until we bumped the page pool from a few gigs to about half of the physical memory on the node. The other was a mass round of parallel mmfsck's of all 20 something of our filesystems. It came highly recommended to us to increase the pagepool to something very large for that. I'm curious to hear what other folks do and what the recommendations from IBM folks are. Thanks, Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Aug 17 14:43:48 2017 From: ewahl at osc.edu (Edward Wahl) Date: Thu, 17 Aug 2017 09:43:48 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: <20170817094348.37d2f51b@osc.edu> On Fri, 4 Aug 2017 01:02:22 -0400 "J. Eric Wonderley" wrote: > 4.2.2.3 > > I want to think maybe this started after expanding inode space What does 'mmlsfileset home nathanfootest -L' say? Ed > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis wrote: > > > Hey, > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > Cheers, > > > > Jamie > > > > > > ----- Original message ----- > > From: "J. Eric Wonderley" > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list > > Cc: > > Subject: [gpfsug-discuss] mmsetquota produces error > > Date: Wed, Aug 2, 2017 5:03 PM > > > > for one of our home filesystem we get: > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > 'Invalid argument'. > > > > > > mmedquota -j home:nathanfootest > > does work however > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From eric.wonderley at vt.edu Thu Aug 17 15:13:57 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 17 Aug 2017 10:13:57 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: <20170817094348.37d2f51b@osc.edu> References: <20170817094348.37d2f51b@osc.edu> Message-ID: The error is very repeatable... [root at cl001 ~]# mmcrfileset home setquotafoo Fileset setquotafoo created with id 61 root inode 3670407. [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo Fileset setquotafoo linked at /gpfs/home/setquotafoo [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid argument'. mmsetquota: Command failed. Examine previous error messages to determine cause. [root at cl001 ~]# mmlsfileset home setquotafoo -L Filesets in file system 'home': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 61 3670407 0 Thu Aug 17 10:10:54 2017 0 0 0 On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > On Fri, 4 Aug 2017 01:02:22 -0400 > "J. Eric Wonderley" wrote: > > > 4.2.2.3 > > > > I want to think maybe this started after expanding inode space > > What does 'mmlsfileset home nathanfootest -L' say? > > Ed > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > > > Hey, > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > Cheers, > > > > > > Jamie > > > > > > > > > ----- Original message ----- > > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > To: gpfsug main discussion list > > > Cc: > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > for one of our home filesystem we get: > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > 'Invalid argument'. > > > > > > > > > mmedquota -j home:nathanfootest > > > does work however > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Aug 17 15:20:06 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 17 Aug 2017 14:20:06 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: I?ve just done exactly that and can?t reproduce it in my prod environment. Running 4.2.3-2 though. [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L Filesets in file system 'gpfs': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 251 8408295 0 Thu Aug 17 15:17:18 2017 0 0 0 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of J. Eric Wonderley Sent: 17 August 2017 15:14 To: Edward Wahl Cc: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmsetquota produces error The error is very repeatable... [root at cl001 ~]# mmcrfileset home setquotafoo Fileset setquotafoo created with id 61 root inode 3670407. [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo Fileset setquotafoo linked at /gpfs/home/setquotafoo [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid argument'. mmsetquota: Command failed. Examine previous error messages to determine cause. [root at cl001 ~]# mmlsfileset home setquotafoo -L Filesets in file system 'home': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 61 3670407 0 Thu Aug 17 10:10:54 2017 0 0 0 On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl > wrote: On Fri, 4 Aug 2017 01:02:22 -0400 "J. Eric Wonderley" > wrote: > 4.2.2.3 > > I want to think maybe this started after expanding inode space What does 'mmlsfileset home nathanfootest -L' say? Ed > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > Hey, > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > Cheers, > > > > Jamie > > > > > > ----- Original message ----- > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list > > > Cc: > > Subject: [gpfsug-discuss] mmsetquota produces error > > Date: Wed, Aug 2, 2017 5:03 PM > > > > for one of our home filesystem we get: > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > 'Invalid argument'. > > > > > > mmedquota -j home:nathanfootest > > does work however > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamiedavis at us.ibm.com Thu Aug 17 15:30:19 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Thu, 17 Aug 2017 14:30:19 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: , <20170817094348.37d2f51b@osc.edu> Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Thu Aug 17 15:34:26 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 17 Aug 2017 10:34:26 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: I recently opened a pmr on this issue(24603,442,000)...I'll keep this thread posted on results. On Thu, Aug 17, 2017 at 10:30 AM, James Davis wrote: > I've also tried on our in-house latest release and cannot recreate it. > > I'll ask around to see who's running a 4.2.2 cluster I can look at. > > > ----- Original message ----- > From: "Sobey, Richard A" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list , > "Edward Wahl" > Cc: > Subject: Re: [gpfsug-discuss] mmsetquota produces error > Date: Thu, Aug 17, 2017 10:20 AM > > > I?ve just done exactly that and can?t reproduce it in my prod environment. > Running 4.2.3-2 though. > > > > [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L > > Filesets in file system 'gpfs': > > Name Id RootInode ParentId > Created InodeSpace MaxInodes AllocInodes > Comment > > setquotafoo 251 8408295 0 Thu Aug 17 > 15:17:18 2017 0 0 0 > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] *On Behalf Of *J. Eric Wonderley > *Sent:* 17 August 2017 15:14 > *To:* Edward Wahl > *Cc:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] mmsetquota produces error > > > > The error is very repeatable... > [root at cl001 ~]# mmcrfileset home setquotafoo > Fileset setquotafoo created with id 61 root inode 3670407. > [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo > Fileset setquotafoo linked at /gpfs/home/setquotafoo > [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files > 10M:10M > tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid > argument'. > mmsetquota: Command failed. Examine previous error messages to determine > cause. > [root at cl001 ~]# mmlsfileset home setquotafoo -L > Filesets in file system 'home': > Name Id RootInode ParentId > Created InodeSpace MaxInodes AllocInodes > Comment > setquotafoo 61 3670407 0 Thu Aug 17 > 10:10:54 2017 0 0 0 > > > > On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > > On Fri, 4 Aug 2017 01:02:22 -0400 > "J. Eric Wonderley" wrote: > > > 4.2.2.3 > > > > I want to think maybe this started after expanding inode space > > What does 'mmlsfileset home nathanfootest -L' say? > > Ed > > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > > > Hey, > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > Cheers, > > > > > > Jamie > > > > > > > > > ----- Original message ----- > > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > To: gpfsug main discussion list > > > Cc: > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > for one of our home filesystem we get: > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > 'Invalid argument'. > > > > > > > > > mmedquota -j home:nathanfootest > > > does work however > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Aug 17 15:50:27 2017 From: ewahl at osc.edu (Edward Wahl) Date: Thu, 17 Aug 2017 10:50:27 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: <20170817105027.316ce609@osc.edu> We're running 4.2.2.3 (well technically 4.2.2.3 efix21 (1028007) since yesterday" and we use filesets extensively for everything and I cannot reproduce this. I would guess this is somehow an inode issue, but... ?? checked the logs for the FS creation and looked for odd errors? So this fileset is not a stand-alone, Is there anything odd about the mmlsfileset for the root fileset? mmlsfileset gpfs root -L can you create files in the Junction directory? Does the increase in inodes show up? nothing weird from 'mmdf gpfs -m' ? none of your metadata NSDs are offline? Ed On Thu, 17 Aug 2017 10:34:26 -0400 "J. Eric Wonderley" wrote: > I recently opened a pmr on this issue(24603,442,000)...I'll keep this > thread posted on results. > > On Thu, Aug 17, 2017 at 10:30 AM, James Davis wrote: > > > I've also tried on our in-house latest release and cannot recreate it. > > > > I'll ask around to see who's running a 4.2.2 cluster I can look at. > > > > > > ----- Original message ----- > > From: "Sobey, Richard A" > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list , > > "Edward Wahl" > > Cc: > > Subject: Re: [gpfsug-discuss] mmsetquota produces error > > Date: Thu, Aug 17, 2017 10:20 AM > > > > > > I?ve just done exactly that and can?t reproduce it in my prod environment. > > Running 4.2.3-2 though. > > > > > > > > [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L > > > > Filesets in file system 'gpfs': > > > > Name Id RootInode ParentId > > Created InodeSpace MaxInodes AllocInodes > > Comment > > > > setquotafoo 251 8408295 0 Thu Aug 17 > > 15:17:18 2017 0 0 0 > > > > > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] *On Behalf Of *J. Eric Wonderley > > *Sent:* 17 August 2017 15:14 > > *To:* Edward Wahl > > *Cc:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] mmsetquota produces error > > > > > > > > The error is very repeatable... > > [root at cl001 ~]# mmcrfileset home setquotafoo > > Fileset setquotafoo created with id 61 root inode 3670407. > > [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo > > Fileset setquotafoo linked at /gpfs/home/setquotafoo > > [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files > > 10M:10M > > tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid > > argument'. > > mmsetquota: Command failed. Examine previous error messages to determine > > cause. > > [root at cl001 ~]# mmlsfileset home setquotafoo -L > > Filesets in file system 'home': > > Name Id RootInode ParentId > > Created InodeSpace MaxInodes AllocInodes > > Comment > > setquotafoo 61 3670407 0 Thu Aug 17 > > 10:10:54 2017 0 0 0 > > > > > > > > On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > > > > On Fri, 4 Aug 2017 01:02:22 -0400 > > "J. Eric Wonderley" wrote: > > > > > 4.2.2.3 > > > > > > I want to think maybe this started after expanding inode space > > > > What does 'mmlsfileset home nathanfootest -L' say? > > > > Ed > > > > > > > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > > wrote: > > > > > > > Hey, > > > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > > > Cheers, > > > > > > > > Jamie > > > > > > > > > > > > ----- Original message ----- > > > > From: "J. Eric Wonderley" > > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > To: gpfsug main discussion list > > > > Cc: > > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > > > for one of our home filesystem we get: > > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > > 'Invalid argument'. > > > > > > > > > > > > mmedquota -j home:nathanfootest > > > > does work however > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > -- > > > > Ed Wahl > > Ohio Supercomputer Center > > 614-292-9302 > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From alex.chekholko at gmail.com Thu Aug 17 19:11:39 2017 From: alex.chekholko at gmail.com (Alex Chekholko) Date: Thu, 17 Aug 2017 18:11:39 +0000 Subject: [gpfsug-discuss] NSD Server/FS Manager Memory Requirements In-Reply-To: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> References: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> Message-ID: Hi Aaron, What would be the advantage of decreasing the pagepool size? Regards, Alex On Thu, Aug 17, 2017 at 6:12 AM Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > In the world of GPFS 4.2 is there a particular advantage to having a large > amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD > Server/FS manager nodes? We currently have half of physical memory > allocated to pagepool on these nodes. > > For some historical context-- we had two indicidents that drove us to > increase our NSD server/FS manager pagepools. One was a weird behavior in > GPFS 3.5 that was causing bouncing FS managers until we bumped the page > pool from a few gigs to about half of the physical memory on the node. The > other was a mass round of parallel mmfsck's of all 20 something of our > filesystems. It came highly recommended to us to increase the pagepool to > something very large for that. > > I'm curious to hear what other folks do and what the recommendations from > IBM folks are. > > Thanks, > Aaron > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Aug 19 02:07:29 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 18 Aug 2017 21:07:29 -0400 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. Message-ID: <35574.1503104849@turing-police.cc.vt.edu> So for a variety of reasons, we had accumulated some 45 tapes that had found ways to get out of Valid status. I've cleaned up most of them, but I'm stuck on a few corner cases. Case 1: l% tfsee info tapes | sort | grep -C 1 'Not Sup' AV0186JD Valid TS1150(J5) 9022 0 56 vbi_tapes VTC 1148 - AV0187JD Not Supported TS1150(J5) 9022 2179 37 vbi_tapes VTC 1149 - AV0188JD Valid TS1150(J5) 9022 1559 67 vbi_tapes VTC 1150 - -- AV0540JD Valid TS1150(J5) 9022 9022 0 vtti_tapes VTC 1607 - AV0541JD Not Supported TS1150(J5) 9022 1797 6 vtti_tapes VTC 1606 - AV0542JD Valid TS1150(J5) 9022 9022 0 vtti_tapes VTC 1605 - How the heck does *that* happen? And how do you fix it? Case 2: The docs say that for 'Invalid', you need to add it to the pool with -c. % ltfsee pool remove -p arc_tapes -l ISB -t AI0084JD; ltfsee pool add -c -p arc_tapes -l ISB -t AI0084JD GLESL043I(01052): Removing tape AI0084JD from storage pool arc_tapes. GLESL041E(01129): Tape AI0084JD does not exist in storage pool arc_tapes or is in an invalid state. Specify a valid tape ID. GLESL042I(00809): Adding tape AI0084JD to storage pool arc_tapes. (Not sure why the last 2 messages got out of order..) % ltfsee info tapes | grep AI0084JD AI0084JD Invalid LTFS TS1150 0 0 0 - ISB 1262 - What do you do if adding it with -c doesn't work? Time to reformat the tape? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From makaplan at us.ibm.com Sat Aug 19 16:45:48 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 19 Aug 2017 11:45:48 -0400 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. In-Reply-To: <35574.1503104849@turing-police.cc.vt.edu> References: <35574.1503104849@turing-police.cc.vt.edu> Message-ID: I'm kinda curious... I've noticed a few message on this subject -- so I went to the doc.... The doc seems to indicate there are some circumstances where removing the tape with the appropriate command and options and then adding it back will result in the files on the tape becoming available again... But, of course, tapes are not 100% (nothing is), so no guarantee. Perhaps the rigamarole of removing and adding back is compensating for software glitch (bug!) -- Logically seems it shouldn't be necessary -- either the tape is readable or not -- the system should be able to do retries and error correction without removing -- but worth a shot. (I'm a gpfs guy, but not an LTFS/EE/tape guy) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Sat Aug 19 20:05:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Sat, 19 Aug 2017 20:05:05 +0100 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. In-Reply-To: References: <35574.1503104849@turing-police.cc.vt.edu> Message-ID: On 19/08/17 16:45, Marc A Kaplan wrote: > I'm kinda curious... I've noticed a few message on this subject -- so I > went to the doc.... > > The doc seems to indicate there are some circumstances where removing > the tape with the appropriate command and options and then adding it > back will result in the files on the tape becoming available again... > But, of course, tapes are not 100% (nothing is), so no guarantee. > Perhaps the rigamarole of removing and adding back is compensating for > software glitch (bug!) -- Logically seems it shouldn't be necessary -- > either the tape is readable or not -- the system should be able to do > retries and error correction without removing -- but worth a shot. > > (I'm a gpfs guy, but not an LTFS/EE/tape guy) > Well with a TSM based HSM there are all sorts of reasons for a tape being marked "offline". Usually it's because there has been some sort of problem with the tape library in my experience. Say there is a problem with the gripper and the library is unable to get the tape, it will mark it as unavailable. Of course issues with reading data from the tape would be another reasons. Typically beyond a number of errors TSM would mark the tape as bad, which is why you always have a copy pool. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From aaron.s.knister at nasa.gov Sun Aug 20 21:02:36 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sun, 20 Aug 2017 16:02:36 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: <30bfb6ca-3d86-08ab-0eec-06def4a2f6db@nasa.gov> I think it would be a huge advantage to support these mysterious nsdChecksum settings for us non GNR folks. Even if the checksums aren't being stored on disk I would think the ability to protect against network-level corruption would be valuable enough to warrant its support. I've created RFE 109269 to request this. We'll see what IBM says. If this is valuable to other folks then please vote for the RFE. -Aaron On 8/2/17 5:53 PM, Stijn De Weirdt wrote: > hi steve, > >> The nsdChksum settings for none GNR/ESS based system is not officially >> supported. It will perform checksum on data transfer over the network >> only and can be used to help debug data corruption when network is a >> suspect. > i'll take not officially supported over silent bitrot any day. > >> Did any of those "Encountered XYZ checksum errors on network I/O to NSD >> Client disk" warning messages resulted in disk been changed to "down" >> state due to IO error? > no. > > If no disk IO error was reported in GPFS log, >> that means data was retransmitted successfully on retry. > we suspected as much. as sven already asked, mmfsck now reports clean > filesystem. > i have an ibdump of 2 involved nsds during the reported checksums, i'll > have a closer look if i can spot these retries. > >> As sven said, only GNR/ESS provids the full end to end data integrity. > so with the silent network error, we have high probabilty that the data > is corrupted. > > we are now looking for a test to find out what adapters are affected. we > hoped that nsdperf with verify=on would tell us, but it doesn't. > >> Steve Y. Xiao >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From evan.koutsandreou at adventone.com Mon Aug 21 04:05:40 2017 From: evan.koutsandreou at adventone.com (Evan Koutsandreou) Date: Mon, 21 Aug 2017 03:05:40 +0000 Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Message-ID: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you From mweil at wustl.edu Mon Aug 21 20:54:27 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 21 Aug 2017 14:54:27 -0500 Subject: [gpfsug-discuss] pmcollector node In-Reply-To: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> References: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> Message-ID: <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> any input on this Thanks On 7/5/17 10:51 AM, Matt Weil wrote: > Hello all, > > Question on the requirements on pmcollector node/s for a 500+ node > cluster. Is there a sizing guide? What specifics should we scale? > CPU Disks memory? > > Thanks > > Matt > From kkr at lbl.gov Mon Aug 21 23:33:36 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 21 Aug 2017 15:33:36 -0700 Subject: [gpfsug-discuss] Hold the Date - Spectrum Scale Day @ HPCXXL (Sept 2017, NYC) In-Reply-To: References: Message-ID: <6EF4187F-D8A1-4927-9E4F-4DF703DA04F5@lbl.gov> If you plan on attending the GPFS Day, please use the HPCXXL registration form (link to Eventbrite registration at the link below). The GPFS day is a free event, but you *must* register so we can make sure there are enough seats and food available. If you would like to speak or suggest a topic, please let me know. http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ The agenda is still being worked on, here are some likely topics: --RoadMap/Updates --"New features - New Bugs? (Julich) --GPFS + Openstack (CSCS) --ORNL Update on Spider3-related GPFS work --ANL Site Update --File Corruption Session Best, Kristy > On Aug 8, 2017, at 11:33 AM, Kristy Kallback-Rose wrote: > > Hello, > > The GPFS Day of the HPCXXL conference is confirmed for Thursday, September 28th. Here is an updated URL, the agenda and registration are still being put together http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ . The GPFS Day will require registration, so we can make sure there is enough room (and coffee/food) for all attendees ?however, there will be no registration fee if you attend the GPFS Day only. > > I?ll send another update when the agenda is closer to settled. > > Cheers, > Kristy > >> On Jul 7, 2017, at 3:32 PM, Kristy Kallback-Rose > wrote: >> >> Hello, >> >> More details will be provided as they become available, but just so you can make a placeholder on your calendar, there will be a Spectrum Scale Day the week of September 25th - 29th, likely Thursday, September 28th. >> >> This will be a part of the larger HPCXXL meeting (https://www.spxxl.org/?q=New-York-City-2017 ). You may recall this group was formerly called SPXXL and the website is in the process of transitioning to the new name (and getting a new certificate). You will be able to attend *just* the Spectrum Scale day if that is the only portion of the event you would like to attend. >> >> The NYC location is great for Spectrum Scale events because many IBMers, including developers, can come in from Poughkeepsie. >> >> More as we get closer to the date and details are settled. >> >> Cheers, >> Kristy > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Aug 22 04:03:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 21 Aug 2017 23:03:35 -0400 Subject: [gpfsug-discuss] multicluster security Message-ID: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> Hi Everyone, I have a theoretical question about GPFS multiclusters and security. Let's say I have clusters A and B. Cluster A is exporting a filesystem as read-only to cluster B. Where does the authorization burden lay? Meaning, does the security rely on mmfsd in cluster B to behave itself and enforce the conditions of the multi-cluster export? Could someone using the credentials on a compromised node in cluster B just start sending arbitrary nsd read/write commands to the nsds from cluster A (or something along those lines)? Do the NSD servers in cluster A do any sort of sanity or security checking on the I/O requests coming from cluster B to the NSDs they're serving to exported filesystems? I imagine any enforcement would go out the window with shared disks in a multi-cluster environment since a compromised node could just "dd" over the LUNs. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From kkr at lbl.gov Tue Aug 22 05:52:58 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 21 Aug 2017 21:52:58 -0700 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Message-ID: Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E -------------- next part -------------- An HTML attachment was scrubbed... URL: From NSCHULD at de.ibm.com Tue Aug 22 08:44:28 2017 From: NSCHULD at de.ibm.com (Norbert Schuld) Date: Tue, 22 Aug 2017 09:44:28 +0200 Subject: [gpfsug-discuss] pmcollector node In-Reply-To: <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> References: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> Message-ID: Above ~100 nodes the answer is "it depends" but memory is certainly the main factor. Important parts for the estimation are the number of nodes, filesystems, NSDs, NFS & SMB shares and the frequency (aka period) with which measurements are made. For a lot of sensors today the default is 1/sec which is quite high. Depending on your needs 1/ 10 sec might do or even 1/min. With just guessing on some numbers I end up with ~24-32 GB RAM needed in total and about the same number for disk space. If you want HA double the number, then divide by the number of collector nodes used in the federation setup. Place the collectors on nodes which do not play an additional important part in your cluster, then CPU should not be an issue. Mit freundlichen Gr??en / Kind regards Norbert Schuld From: Matt Weil To: gpfsug-discuss at spectrumscale.org Date: 21/08/2017 21:54 Subject: Re: [gpfsug-discuss] pmcollector node Sent by: gpfsug-discuss-bounces at spectrumscale.org any input on this Thanks On 7/5/17 10:51 AM, Matt Weil wrote: > Hello all, > > Question on the requirements on pmcollector node/s for a 500+ node > cluster. Is there a sizing guide? What specifics should we scale? > CPU Disks memory? > > Thanks > > Matt > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Jochen.Zeller at sva.de Tue Aug 22 12:09:31 2017 From: Jochen.Zeller at sva.de (Zeller, Jochen) Date: Tue, 22 Aug 2017 11:09:31 +0000 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss Message-ID: Dear community, this morning I started in a good mood, until I've checked my mailbox. Again a reported bug in Spectrum Scale that could lead to data loss. During the last year I was looking for a stable Scale version, and each time I've thought: "Yes, this one is stable and without serious data loss bugs" - a few day later, IBM announced a new APAR with possible data loss for this version. I am supporting many clients in central Europe. They store databases, backup data, life science data, video data, results of technical computing, do HPC on the file systems, etc. Some of them had to change their Scale version nearly monthly during the last year to prevent running in one of the serious data loss bugs in Scale. From my perspective, it was and is a shame to inform clients about new reported bugs right after the last update. From client perspective, it was and is a lot of work and planning to do to get a new downtime for updates. And their internal customers are not satisfied with those many downtimes of the clusters and applications. For me, it seems that Scale development is working on features for a specific project or client, to achieve special requirements. But they forgot the existing clients, using Scale for storing important data or running important workloads on it. To make us more visible, I've used the IBM recommended way to notify about mandatory enhancements, the less favored RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334 If you like, vote for more reliability in Scale. I hope this a good way to show development and responsible persons that we have trouble and are not satisfied with the quality of the releases. Regards, Jochen -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From stockf at us.ibm.com Tue Aug 22 13:31:52 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 22 Aug 2017 08:31:52 -0400 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata"bug In-Reply-To: References: Message-ID: My understanding is that the problem is not with the policy engine scanning but with the commands that move data, for example mmrestripefs. So if you are using the policy engine for other purposes you are not impacted by the problem. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Kristy Kallback-Rose To: gpfsug main discussion list Date: 08/22/2017 12:53 AM Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Sent by: gpfsug-discuss-bounces at spectrumscale.org Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=L6hGADgajb-s1ezkPaD4wQhytCTKnUBGorgQEbmlEzk&s=nDmkF6EvhbMgktl3Oks3UkCb-2-cwR1QLEpOi6qeea4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From sxiao at us.ibm.com Tue Aug 22 14:51:25 2017 From: sxiao at us.ibm.com (Steve Xiao) Date: Tue, 22 Aug 2017 09:51:25 -0400 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" In-Reply-To: References: Message-ID: ILM policy engine scans of metadata is safe and will not trigger the problem. Steve Y. Xiao -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Aug 22 15:06:00 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 22 Aug 2017 14:06:00 +0000 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug In-Reply-To: References: Message-ID: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> Can anyone tell us when a normal PTF release (4.2.3-4 ??) will be made available that will fix this issue? Trying to decide if I should roll an e-fix or just wait for a normal release, thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Monday, August 21, 2017 11:53 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Note: External Email ________________________________ Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Tue Aug 22 15:27:36 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Tue, 22 Aug 2017 16:27:36 +0200 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata"bug In-Reply-To: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> References: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: From aaron.knister at gmail.com Tue Aug 22 15:37:06 2017 From: aaron.knister at gmail.com (Aaron Knister) Date: Tue, 22 Aug 2017 10:37:06 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: Hi Jochen, I share your concern about data loss bugs and I too have found it troubling especially since the 4.2 stream is in my immediate future (although I would have rather stayed on 4.1 due to my perception of stability/integrity issues in 4.2). By and large 4.1 has been *extremely* stable for me. While not directly related to the stability concerns, I'm curious as to why your customer sites are requiring downtime to do the upgrades? While, of course, individual servers need to be taken offline to update GPFS the collective should be able to stay up. Perhaps your customer environments just don't lend themselves to that. It occurs to me that some of these bugs sound serious (and indeed I believe this one is) I recently found myself jumping prematurely into an update for the metanode filesize corruption bug that as it turns out that while very scary sounding is not necessarily a particularly common bug (if I understand correctly). Perhaps it would be helpful if IBM could clarify the believed risk of these updates or give us some indication if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild". I could imagine IBM legal wanting to avoid a situation where IBM indicates something is low risk but someone hits it and it eats data. Although many companies do this with security patches so perhaps it's a non-issue. >From my perspective I don't think existing customers are being "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to an ever-changing world and I think these features are necessary and useful. Perhaps Scale would benefit from more resources being dedicated to QA/Testing which isn't a particularly sexy thing-- it doesn't result in any new shiny features for customers (although "not eating your data" is a feature I find really attractive). Anyway, I hope IBM can find a way to minimize the frequency of these bugs. Personally speaking, I'm pretty convinced, it's not for lack of capability or dedication on the part of the great folks actually writing the code. -Aaron On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen wrote: > Dear community, > > this morning I started in a good mood, until I?ve checked my mailbox. > Again a reported bug in Spectrum Scale that could lead to data loss. During > the last year I was looking for a stable Scale version, and each time I?ve > thought: ?Yes, this one is stable and without serious data loss bugs? - a > few day later, IBM announced a new APAR with possible data loss for this > version. > > I am supporting many clients in central Europe. They store databases, > backup data, life science data, video data, results of technical computing, > do HPC on the file systems, etc. Some of them had to change their Scale > version nearly monthly during the last year to prevent running in one of > the serious data loss bugs in Scale. From my perspective, it was and is a > shame to inform clients about new reported bugs right after the last > update. From client perspective, it was and is a lot of work and planning > to do to get a new downtime for updates. And their internal customers are > not satisfied with those many downtimes of the clusters and applications. > > For me, it seems that Scale development is working on features for a > specific project or client, to achieve special requirements. But they > forgot the existing clients, using Scale for storing important data or > running important workloads on it. > > To make us more visible, I?ve used the IBM recommended way to notify about > mandatory enhancements, the less favored RFE: > > > *http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334* > > > If you like, vote for more reliability in Scale. > > I hope this a good way to show development and responsible persons that we > have trouble and are not satisfied with the quality of the releases. > > > Regards, > > Jochen > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Tue Aug 22 16:24:46 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Tue, 22 Aug 2017 08:24:46 -0700 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" In-Reply-To: References: Message-ID: Thanks we just wanted to confirm given the use of the word "scanning" in describing the trigger. On Aug 22, 2017 6:51 AM, "Steve Xiao" wrote: > ILM policy engine scans of metadatais safe and will not trigger the > problem. > > > Steve Y. Xiao > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Aug 22 17:45:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 22 Aug 2017 12:45:00 -0400 Subject: [gpfsug-discuss] Fwd: FLASH: IBM Spectrum Scale (GPFS): RDMA-enabled network adapter failure on the NSD server may result in file IO error (2017.06.30) In-Reply-To: References: <487469581.449569.1498832342497.JavaMail.webinst@w30112> <2689cf86-eca2-dab6-c6aa-7fc54d923e55@nasa.gov> Message-ID: <3b16ad01-4d83-8106-f2e2-110364f31566@nasa.gov> (I'm slowly catching up on a backlog of e-mail, sorry for the delayed reply). Thanks, Sven. I recognize the complexity and appreciate your explanation. In my mind I had envisioned either the block integrity information being stored as a new metadata structure or stored leveraging T10-DIX/DIF (perhaps configurable on a per-pool basis) to pass the checksums down to the RAID controller. I would quite like to run GNR as software on generic hardware and in fact voted, along with 26 other customers, on an RFE (https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090) requesting this but the request was declined. I think customers spoke pretty loudly there and IBM gave it the kibosh. -Aaron On 06/30/2017 02:25 PM, Sven Oehme wrote: > > end-to-end data integrity is very important and the reason it hasn't > been done in Scale is not because its not important, its because its > very hard to do without impacting performance in a very dramatic way. > > imagine your raid controller blocksize is 1mb and your filesystem > blocksize is 1MB . if your application does a 1 MB write this ends up > being a perfect full block , full track de-stage to your raid layer > and everything works fine and fast. as soon as you add checksum > support you need to add data somehow into this, means your 1MB is no > longer 1 MB but 1 MB+checksum. > > to store this additional data you have multiple options, inline , > outside the data block or some combination ,the net is either you need > to do more physical i/o's to different places to get both the data and > the corresponding checksum or your per block on disc structure becomes > bigger than than what your application reads/or writes, both put > massive burden on the Storage layer as e.g. a 1 MB write will now, > even the blocks are all aligned from the application down to the raid > layer, cause a read/modify/write on the raid layer as the data is > bigger than the physical track size. > > so to get end-to-end checksum in Scale outside of ESS the best way is > to get GNR as SW to run on generic HW, this is what people should vote > for as RFE if they need that functionality. beside end-to-end > checksums you get read/write cache and acceleration , fast rebuild and > many other goodies as a added bonus. > > Sven > > > On Fri, Jun 30, 2017 at 10:53 AM Aaron Knister > > wrote: > > In fact the answer was quite literally "no": > > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=84523 > (the RFE was declined and the answer was that the "function is already > available in GNR environments"). > > Regarding GNR, see this RFE request > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090 > requesting the use of GNR outside of an ESS/GSS environment. It's > interesting to note this is the highest voted Public RFE for GPFS > that I > can see, at least. It too was declined. > > -Aaron > > On 6/30/17 1:41 PM, Aaron Knister wrote: > > Thanks Olaf, that's good to know (and is kind of what I > suspected). I've > > requested a number of times this capability for those of us who > can't > > use or aren't using GNR and the answer is effectively "no". This > > response is curious to me because I'm sure IBM doesn't believe > that data > > integrity is only important and of value to customers who > purchase their > > hardware *and* software. > > > > -Aaron > > > > On Fri, Jun 30, 2017 at 1:37 PM, Olaf Weiser > > > >> > wrote: > > > > yes.. in case of GNR (GPFS native raid) .. we do end-to-end > > check-summing ... client --> server --> downToDisk > > GNR writes down a chksum to disk (to all pdisks /all "raid" > segments > > ) so that dropped writes can be detected as well as miss-done > > writes (bit flips..) > > > > > > > > From: Aaron Knister > > >> > > To: gpfsug main discussion list > > > >> > > Date: 06/30/2017 07:15 PM > > Subject: [gpfsug-discuss] Fwd: FLASH: IBM Spectrum Scale (GPFS): > > RDMA-enabled network adapter failure on the NSD server may > result in > > file IO error (2017.06.30) > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > ------------------------------------------------------------------------ > > > > > > > > I'm curious to know why this doesn't affect GSS/ESS? Is it a > feature of > > the additional check-summing done on those platforms? > > > > > > -------- Forwarded Message -------- > > Subject: FLASH: IBM Spectrum Scale (GPFS): RDMA-enabled network > > adapter > > failure on the NSD server may result in file IO error > (2017.06.30) > > Date: Fri, 30 Jun 2017 14:19:02 +0000 > > From: IBM My Notifications > > > >> > > To: aaron.s.knister at nasa.gov > > > > > > > > > > > > My Notifications for Storage - 30 Jun 2017 > > > > Dear Subscriber (aaron.s.knister at nasa.gov > > > >), > > > > Here are your updates from IBM My Notifications. > > > > Your support Notifications display in English by default. > Machine > > translation based on your IBM profile > > language setting is added if you specify this option in My > defaults > > within My Notifications. > > (Note: Not all languages are available at this time, and the > English > > version always takes precedence > > over the machine translated version.) > > > > > ------------------------------------------------------------------------------ > > 1. IBM Spectrum Scale > > > > - TITLE: IBM Spectrum Scale (GPFS): RDMA-enabled network adapter > > failure > > on the NSD server may result in file IO error > > - URL: > > > http://www.ibm.com/support/docview.wss?uid=ssg1S1010233&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E > > > > > - ABSTRACT: IBM has identified an issue with all IBM GPFS > and IBM > > Spectrum Scale versions where the NSD server is enabled to > use RDMA for > > file IO and the storage used in your GPFS cluster accessed > via NSD > > servers (not fully SAN accessible) includes anything other > than IBM > > Elastic Storage Server (ESS) or GPFS Storage Server (GSS); > under these > > conditions, when the RDMA-enabled network adapter fails, the > issue may > > result in undetected data corruption for file write or read > operations. > > > > > ------------------------------------------------------------------------------ > > Manage your My Notifications subscriptions, or send > questions and > > comments. > > - Subscribe or Unsubscribe - > > https://www.ibm.com/support/mynotifications > > > > - Feedback - > > > https://www-01.ibm.com/support/feedback/techFeedbackCardContentMyNotifications.html > > > > > > > - Follow us on Twitter - https://twitter.com/IBMStorageSupt > > > > > > > > > > To ensure proper delivery please add > mynotify at stg.events.ihost.com > > > to > > your address book. > > You received this email because you are subscribed to IBM My > > Notifications as: > > aaron.s.knister at nasa.gov > > > > > > Please do not reply to this message as it is generated by an > automated > > service machine. > > > > (C) International Business Machines Corporation 2017. All rights > > reserved. > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Aug 23 05:40:19 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 23 Aug 2017 00:40:19 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: Aaron, IBM's policy is to issue a flash when such data corruption/loss problem has been identified, even if the problem has never been encountered by any customer. In fact, most of the flashes have been the result of internal test activity, even though the discovery took place after the affected versions/PTFs have already been released. This is the case of two of the recent flashes: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293 http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 The flashes normally do not indicate the risk level that a given problem has of being hit, since there are just too many variables at play, given that clusters and workloads vary significantly. The first issue above appears to be uncommon (and potentially rare). The second issue seems to have a higher probability of occurring -- and as described in the flash, the problem is triggered by failures being encountered while running one of the commands listed in the "Users Affected" section of the writeup. I don't think precise recommendations could be given on if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild" since different clusters, configuration, or workload may drastically affect the the likelihood of hitting the problem. On the other hand, when coming up with the text for the flash, the team attempts to provide as much information as possible/available on the known triggers and mitigation circumstances. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Aaron Knister To: gpfsug main discussion list Date: 08/22/2017 10:37 AM Subject: Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Jochen, I share your concern about data loss bugs and I too have found it troubling especially since the 4.2 stream is in my immediate future (although I would have rather stayed on 4.1 due to my perception of stability/integrity issues in 4.2). By and large 4.1 has been *extremely* stable for me. While not directly related to the stability concerns, I'm curious as to why your customer sites are requiring downtime to do the upgrades? While, of course, individual servers need to be taken offline to update GPFS the collective should be able to stay up. Perhaps your customer environments just don't lend themselves to that. It occurs to me that some of these bugs sound serious (and indeed I believe this one is) I recently found myself jumping prematurely into an update for the metanode filesize corruption bug that as it turns out that while very scary sounding is not necessarily a particularly common bug (if I understand correctly). Perhaps it would be helpful if IBM could clarify the believed risk of these updates or give us some indication if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild". I could imagine IBM legal wanting to avoid a situation where IBM indicates something is low risk but someone hits it and it eats data. Although many companies do this with security patches so perhaps it's a non-issue. From my perspective I don't think existing customers are being "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to an ever-changing world and I think these features are necessary and useful. Perhaps Scale would benefit from more resources being dedicated to QA/Testing which isn't a particularly sexy thing-- it doesn't result in any new shiny features for customers (although "not eating your data" is a feature I find really attractive). Anyway, I hope IBM can find a way to minimize the frequency of these bugs. Personally speaking, I'm pretty convinced, it's not for lack of capability or dedication on the part of the great folks actually writing the code. -Aaron On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen wrote: Dear community, this morning I started in a good mood, until I?ve checked my mailbox. Again a reported bug in Spectrum Scale that could lead to data loss. During the last year I was looking for a stable Scale version, and each time I?ve thought: ?Yes, this one is stable and without serious data loss bugs? - a few day later, IBM announced a new APAR with possible data loss for this version. I am supporting many clients in central Europe. They store databases, backup data, life science data, video data, results of technical computing, do HPC on the file systems, etc. Some of them had to change their Scale version nearly monthly during the last year to prevent running in one of the serious data loss bugs in Scale. From my perspective, it was and is a shame to inform clients about new reported bugs right after the last update. From client perspective, it was and is a lot of work and planning to do to get a new downtime for updates. And their internal customers are not satisfied with those many downtimes of the clusters and applications. For me, it seems that Scale development is working on features for a specific project or client, to achieve special requirements. But they forgot the existing clients, using Scale for storing important data or running important workloads on it. To make us more visible, I?ve used the IBM recommended way to notify about mandatory enhancements, the less favored RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334 If you like, vote for more reliability in Scale. I hope this a good way to show development and responsible persons that we have trouble and are not satisfied with the quality of the releases. Regards, Jochen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 23 11:11:37 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 10:11:37 +0000 Subject: [gpfsug-discuss] AFM weirdness Message-ID: We're using an AFM cache from our HPC nodes to access data in another GPFS cluster, mostly this seems to be working fine, but we've just come across an interesting problem with a user using gfortran from the GCC 5.2.0 toolset. When linking their code, they get a "no space left on device" error back from the linker. If we do this on a node that mounts the file-system directly (I.e. Not via AFM cache), then it works fine. We tried with GCC 4.5 based tools and it works OK, but the difference there is that 4.x uses ld and 5x uses ld.gold. If we strike the ld.gold when using AFM, we see: stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 unlink("program") = 0 open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on device) Vs when running directly on the file-system: stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 unlink("program") = 0 open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 fallocate(30, 0, 0, 248480) = 0 Anyone seen anything like this before? ... Actually I'm about to go off and see if its a function of AFM, or maybe something to do with the FS in use (I.e. Make a local directory on the filesystem on the "AFM" FS and see if that works ...) Thanks Simon From S.J.Thompson at bham.ac.uk Wed Aug 23 11:17:58 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 10:17:58 +0000 Subject: [gpfsug-discuss] AFM weirdness Message-ID: OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From vpuvvada at in.ibm.com Wed Aug 23 13:36:33 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 23 Aug 2017 18:06:33 +0530 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 23 14:01:55 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 13:01:55 +0000 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: I've got a PMR open about this ... Will email you the number directly. Looking at the man page for ld.gold, it looks to set '--posix-fallocate' by default. In fact, testing with '-Xlinker -no-posix-fallocate' does indeed make the code compile. Simon From: "vpuvvada at in.ibm.com" > Date: Wednesday, 23 August 2017 at 13:36 To: "gpfsug-discuss at spectrumscale.org" >, Simon Thompson > Subject: Re: [gpfsug-discuss] AFM weirdness I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: gpfsug main discussion list > Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" on behalf of S.J.Thompson at bham.ac.uk> wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Aug 24 13:56:49 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 24 Aug 2017 08:56:49 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: <12c154d2-8095-408e-ac7e-e654b1448a25@nasa.gov> Thanks Felipe, and everything you said makes sense and I think holds true to my experiences concerning different workloads affecting likelihood of hitting various problems (especially being one of only a handful of sites that hit that 301 SGpanic error from several years back). Perhaps language as subtle as "internal testing revealed" vs "based on reports from customer sites" could be used? But then again I imagine you could encounter a case where you discover something in testing that a customer site subsequently experiences which might limit the usefulness of the wording. I still think it's useful to know if an issue has been exacerbated or triggered by in the wild workloads vs what I imagine to be quite rigorous lab testing perhaps deigned to shake out certain bugs. -Aaron On 8/23/17 12:40 AM, Felipe Knop wrote: > Aaron, > > IBM's policy is to issue a flash when such data corruption/loss > problem has been identified, even if the problem has never been > encountered by any customer. In fact, most of the flashes have been > the result of internal test activity, even though the discovery took > place after the affected versions/PTFs have already been released. > ?This is the case of two of the recent flashes: > > http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293 > > http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 > > The flashes normally do not indicate the risk level that a given > problem has of being hit, since there are just too many variables at > play, given that clusters and workloads vary significantly. > > The first issue above appears to be uncommon (and potentially rare). > ?The second issue seems to have a higher probability of occurring -- > and as described in the flash, the problem is triggered by failures > being encountered while running one of the commands listed in the > "Users Affected" section of the writeup. > > I don't think precise recommendations could be given on > > ?if the bugs fall in the category of "drop everything and patch *now*" > or "this is a theoretically nasty bug but we've yet to see it in the wild" > > since different clusters, configuration, or workload may drastically > affect the the likelihood of hitting the problem. ?On the other hand, > when coming up with the text for the flash, the team attempts to > provide as much information as possible/available on the known > triggers and mitigation circumstances. > > ? Felipe > > ---- > Felipe Knop ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 ?T/L 293-9314 > > > > > > From: ? ? ? ?Aaron Knister > To: ? ? ? ?gpfsug main discussion list > Date: ? ? ? ?08/22/2017 10:37 AM > Subject: ? ? ? ?Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale > could lead to data loss > Sent by: ? ? ? ?gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi Jochen, > > I share your concern about data loss bugs and I too have found it > troubling especially since the 4.2 stream is in my immediate future > (although I would have rather stayed on 4.1 due to my perception of > stability/integrity issues in 4.2). By and large 4.1 has been > *extremely* stable for me. > > While not directly related to the stability concerns, I'm curious as > to why your customer sites are requiring downtime to do the upgrades? > While, of course, individual servers need to be taken offline to > update GPFS the collective should be able to stay up. Perhaps your > customer environments just don't lend themselves to that.? > > It occurs to me that some of these bugs sound serious (and indeed I > believe this one is) I recently found myself jumping prematurely into > an update for the metanode filesize corruption bug that as it turns > out that while very scary sounding is not necessarily a particularly > common bug (if I understand correctly). Perhaps it would be helpful if > IBM could clarify the believed risk of these updates or give us some > indication if the bugs fall in the category of "drop everything and > patch *now*" or "this is a theoretically nasty bug but we've yet to > see it in the wild". I could imagine IBM legal wanting to avoid a > situation where IBM indicates something is low risk but someone hits > it and it eats data. Although many companies do this with security > patches so perhaps it's a non-issue. > > From my perspective I don't think existing customers are being > "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt > to an ever-changing world and I think these features are necessary and > useful. Perhaps Scale would benefit from more resources being > dedicated to QA/Testing which isn't a particularly sexy thing-- it > doesn't result in any new shiny features for customers (although "not > eating your data" is a feature I find really attractive). > > Anyway, I hope IBM can find a way to minimize the frequency of these > bugs. Personally speaking, I'm pretty convinced, it's not for lack of > capability or dedication on the part of the great folks actually > writing the code. > > -Aaron > > On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen > <_Jochen.Zeller at sva.de_ > wrote: > Dear community, > ? > this morning I started in a good mood, until I?ve checked my mailbox. > Again a reported bug in Spectrum Scale that could lead to data loss. > During the last year I was looking for a stable Scale version, and > each time I?ve thought: ?Yes, this one is stable and without serious > data loss bugs? - a few day later, IBM announced a new APAR with > possible data loss for this version. > ? > I am supporting many clients in central Europe. They store databases, > backup data, life science data, video data, results of technical > computing, do HPC on the file systems, etc. Some of them had to change > their Scale version nearly monthly during the last year to prevent > running in one of the serious data loss bugs in Scale. From my > perspective, it was and is a shame to inform clients about new > reported bugs right after the last update. From client perspective, it > was and is a lot of work and planning to do to get a new downtime for > updates. And their internal customers are not satisfied with those > many downtimes of the clusters and applications. > ? > For me, it seems that Scale development is working on features for a > specific project or client, to achieve special requirements. But they > forgot the existing clients, using Scale for storing important data or > running important workloads on it. > ? > To make us more visible, I?ve used the IBM recommended way to notify > about mandatory enhancements, the less favored RFE: > ? > _http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334_ > ? > If you like, vote for more reliability in Scale. > ? > I hope this a good way to show development and responsible persons > that we have trouble and are not satisfied with the quality of the > releases. > ? > ? > Regards, > ? > Jochen > ? > ? > ? > ? > ? > ? > ? > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at _spectrumscale.org_ > _ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Aug 25 08:44:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Fri, 25 Aug 2017 07:44:35 +0000 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: So as Venkat says, AFM doesn't support using fallocate() to preallocate space. So why aren't other people seeing this ... Well ... We use EasyBuild to build our HPC cluster software including the compiler tool chains. This enables the new linker ld.gold by default rather than the "old" ld. Interestingly we don't seem to have seen this with C code being compiled, only fortran. We can work around it by using the options to gfortran I mention below. There is a mention to this limitation at: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_afmlimitations.htm We aren;t directly calling gpfs_prealloc, but I guess the linker is indirectly calling it by making a call to posix_fallocate. I do have a new problem with AFM where the data written to the cache differs from that replicated back to home... I'm beginning to think I don't like the decision to use AFM! Given the data written back to HOME is corrupt, I think this is definitely PMR time. But ... If you have Abaqus on you system and are using AFM, I'd be interested to see if someone else sees the same issue as us! Simon From: > on behalf of Simon Thompson > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 23 August 2017 at 14:01 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] AFM weirdness I've got a PMR open about this ... Will email you the number directly. Looking at the man page for ld.gold, it looks to set '--posix-fallocate' by default. In fact, testing with '-Xlinker -no-posix-fallocate' does indeed make the code compile. Simon From: "vpuvvada at in.ibm.com" > Date: Wednesday, 23 August 2017 at 13:36 To: "gpfsug-discuss at spectrumscale.org" >, Simon Thompson > Subject: Re: [gpfsug-discuss] AFM weirdness I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: gpfsug main discussion list > Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" on behalf of S.J.Thompson at bham.ac.uk> wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Fri Aug 25 22:36:39 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 25 Aug 2017 17:36:39 -0400 Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing In-Reply-To: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> References: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> Message-ID: Hi, >>I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? >> I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Please ensure that all the recommended FPO settings (e.g. allowWriteAffinity=yes in the FPO storage pool, readReplicaPolicy=local, restripeOnDiskFailure=yes) are set properly. Please find the FPO Best practices/tunings, in the links below: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Big%20Data%20Best%20practices https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ab5c2792-feef-4a3a-a21b-d22c6f5d728a/attachment/80d5c300-7b39-4d6e-9596-84934fcc4638/media/Deploying_a_big_data_solution_using_IBM_Spectrum_Scale_v1.7.5.pdf >> For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). >> Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. With FPO, GPFS metadata (-m) and data replication (-r) needs to be enabled. The Write-affinity-Depth (WAD) setting defines the policy for directing writes. It indicates that the node writing the data directs the write to disks on its own node for the first copy and to the disks on other nodes for the second and third copies (if specified). readReplicaPolicy=local will enable the policy to read replicas from local disks. At the minimum, ensure that the networking used for GPFS is sized properly and has bandwidth 2X or 3X that of the local disk speeds to ensure FPO write bandwidth is not being constrained by GPFS replication over the network. For example, if 24 x Drives in RAID-0 results in ~4.8 GB/s (assuming ~200MB/s per drive) and GPFS metadata/data replication is set to 3 (-m 3 -r 3) then for optimal FPO write bandwidth, we need to ensure the network-interconnect between the FPO nodes is non-blocking/high-speed and can sustain ~14.4 GB/s ( data_replication_factor * local_storage_bandwidth). One possibility, is minimum of 2 x EDR Infiniband (configure GPFS verbsRdma/verbsPorts) or bonded 40GigE between the FPO nodes (for GPFS daemon-to-daemon communication). Application reads requiring FPO reads from remote GPFS node would as well benefit from high-speed network-interconnect between the FPO nodes. Regards, -Kums From: Evan Koutsandreou To: "gpfsug-discuss at spectrumscale.org" Date: 08/20/2017 11:06 PM Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Aug 25 23:41:53 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 25 Aug 2017 18:41:53 -0400 Subject: [gpfsug-discuss] multicluster security In-Reply-To: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> References: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> Message-ID: Hi Aaron, If cluster A uses the mmauth command to grant a file system read-only access to a remote cluster B, nodes on cluster B can only mount that file system with read-only access. But the only checking being done at the RPC level is the TLS authentication. This should prevent non-root users from initiating RPCs, since TLS authentication requires access to the local cluster's private key. However, a root user on cluster B, having access to cluster B's private key, might be able to craft RPCs that may allow one to work around the checks which are implemented at the file system level. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 08/21/2017 11:04 PM Subject: [gpfsug-discuss] multicluster security Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Everyone, I have a theoretical question about GPFS multiclusters and security. Let's say I have clusters A and B. Cluster A is exporting a filesystem as read-only to cluster B. Where does the authorization burden lay? Meaning, does the security rely on mmfsd in cluster B to behave itself and enforce the conditions of the multi-cluster export? Could someone using the credentials on a compromised node in cluster B just start sending arbitrary nsd read/write commands to the nsds from cluster A (or something along those lines)? Do the NSD servers in cluster A do any sort of sanity or security checking on the I/O requests coming from cluster B to the NSDs they're serving to exported filesystems? I imagine any enforcement would go out the window with shared disks in a multi-cluster environment since a compromised node could just "dd" over the LUNs. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=oK_bEPbjuD7j6qLTHbe7HM4ujUlpcNYtX3tMW2QC7_w&s=BliMQ0pToLIIiO1jfyUp2Q3icewcONrcmHpsIj_hMtY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sat Aug 26 20:39:58 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sat, 26 Aug 2017 19:39:58 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Message-ID: Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Sun Aug 27 01:35:06 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Sat, 26 Aug 2017 20:35:06 -0400 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sun Aug 27 14:32:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sun, 27 Aug 2017 13:32:20 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> Fred / All, Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock > wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Sun Aug 27 23:07:17 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Sun, 27 Aug 2017 18:07:17 -0400 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> References: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> Message-ID: Hi Kevin, >> Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? I presume, by "mmrestripefs data loss bug" you are referring to APAR IV98609 (link below)? If yes, 4.2.3.4 contains the fix for APAR IV98609. http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 Problems fixed in GPFS 4.2.3.4 (details in link below): https://www.ibm.com/developerworks/community/forums/html/topic?id=f3705faa-b6aa-415c-a3e6-1fe9d8293db1&ps=25 * This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063. Regards, -Kums From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/27/2017 09:32 AM Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Fred / All, Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=McIf98wfiVqHU8ZygezLrQ&m=0rUCqrbJ4Ny44Rmr8x8HvX5q4yqS-4tkN02fiIm9ttg&s=FYfr0P3sVBhnGGsj33W-A9JoDj7X300yTt5D4y5rpJY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Aug 28 13:26:35 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 28 Aug 2017 08:26:35 -0400 Subject: [gpfsug-discuss] sas avago/lsi hba reseller recommendation Message-ID: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact with Advance HPC? -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Aug 28 13:36:16 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Mon, 28 Aug 2017 12:36:16 +0000 Subject: [gpfsug-discuss] sas avago/lsi hba reseller recommendation In-Reply-To: References: Message-ID: <28676C04-60E6-4AB6-8FEF-24EA719E8786@nasa.gov> Hi Eric, I shot you an email directly with contact info. -Aaron On August 28, 2017 at 08:26:56 EDT, J. Eric Wonderley wrote: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact with Advance HPC? -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Aug 29 15:30:25 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 29 Aug 2017 14:30:25 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Aug 29 16:53:51 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 29 Aug 2017 15:53:51 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Aug 29 18:52:41 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 29 Aug 2017 17:52:41 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: , Message-ID: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Buterbaugh, Kevin L Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 14:54:41 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 13:54:41 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Aug 30 14:56:29 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 30 Aug 2017 13:56:29 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 30 15:06:00 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 30 Aug 2017 14:06:00 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Oh, the first one looks like the AFM issue I mentioned a couple of days back with Abaqus ... (if you use Abaqus on your AFM cache, then this is for you!) Simon From: > on behalf of "Buterbaugh, Kevin L" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 30 August 2017 at 14:54 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From:gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Aug 30 15:12:30 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 30 Aug 2017 14:12:30 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 30 15:21:09 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 30 Aug 2017 14:21:09 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 15:28:07 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 14:28:07 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> References: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> Message-ID: <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> Hi Bryan, NO - it has the fix for the mmrestripefs data loss bug, but you need the efix on top of 4.2.3-4 for the mmadddisk / mmdeldisk issue. Let me take this opportunity to also explain a workaround that has worked for us so far for that issue ? the basic problem is two-fold (on our cluster, at least). First, the /var/mmfs/gen/mmsdrfs file isn?t making it out to all nodes all the time. That is simple enough to fix (mmrefresh -fa) and verify that it?s fixed (md5sum /var/mmfs/gen/mmsdrfs). Second, however - and this is the real problem ? some nodes are never actually rereading that file and therefore have incorrect information *in memory*. This has been especially problematic for us as we are replacing a batch of 80 8 TB drives with bad firmware. I am therefore deleting and subsequently recreating NSDs *with the same name*. If a client node still has the ?old? information in memory then it unmounts the filesystem when I try to mmadddisk the new NSD. The workaround is to identify those nodes (mmfsadm dump nsd and grep for the identifier of the NSD(s) in question) and force them to reread the info (tsctl rereadnsd). HTH? Kevin On Aug 30, 2017, at 9:21 AM, Bryan Banister > wrote: Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C493f1f9e41e343324f1508d4efb25f4f%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636396996783027614&sdata=qYxCMMg9O31LzFg%2FQkCdQg8vV%2FgL2AuRk%2B6V2j76c7Y%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 30 15:30:07 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 30 Aug 2017 14:30:07 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> References: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> Message-ID: <48dac1a1fc6945fdb0d8e94cb7269e3a@jumptrading.com> Thanks for the excellent description? I have my PMR open for the e-fix, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, August 30, 2017 9:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Hi Bryan, NO - it has the fix for the mmrestripefs data loss bug, but you need the efix on top of 4.2.3-4 for the mmadddisk / mmdeldisk issue. Let me take this opportunity to also explain a workaround that has worked for us so far for that issue ? the basic problem is two-fold (on our cluster, at least). First, the /var/mmfs/gen/mmsdrfs file isn?t making it out to all nodes all the time. That is simple enough to fix (mmrefresh -fa) and verify that it?s fixed (md5sum /var/mmfs/gen/mmsdrfs). Second, however - and this is the real problem ? some nodes are never actually rereading that file and therefore have incorrect information *in memory*. This has been especially problematic for us as we are replacing a batch of 80 8 TB drives with bad firmware. I am therefore deleting and subsequently recreating NSDs *with the same name*. If a client node still has the ?old? information in memory then it unmounts the filesystem when I try to mmadddisk the new NSD. The workaround is to identify those nodes (mmfsadm dump nsd and grep for the identifier of the NSD(s) in question) and force them to reread the info (tsctl rereadnsd). HTH? Kevin On Aug 30, 2017, at 9:21 AM, Bryan Banister > wrote: Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C493f1f9e41e343324f1508d4efb25f4f%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636396996783027614&sdata=qYxCMMg9O31LzFg%2FQkCdQg8vV%2FgL2AuRk%2B6V2j76c7Y%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 20:26:41 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 19:26:41 +0000 Subject: [gpfsug-discuss] Permissions issue in GPFS 4.2.3-4? Message-ID: Hi All, We have a script that takes the output of mmlsfs and mmlsquota and formats a users? GPFS quota usage into something a little ?nicer? than what mmlsquota displays (and doesn?t display 50 irrelevant lines of output for filesets they don?t have access to). After upgrading to 4.2.3-4 over the weekend it started throwing errors it hadn?t before: awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsfs: Unexpected error from awk. Return code: 2 awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsfs: Unexpected error from awk. Return code: 2 Home (user): 11.82G 30G 40G 10807 200000 300000 awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsquota: Unexpected error from awk. Return code: 2 It didn?t take long to track down that the mmfs.cfg.show file had permissions of 600 and a chmod 644 of it (on our login gateways only, which is the only place users run that script anyway) fixed the problem. So I just wanted to see if this was a known issue in 4.2.3-4? Notice that the error appears to be coming from the GPFS commands my script runs, not my script itself ? I sure don?t call awk! ;-) Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Wed Aug 30 20:34:46 2017 From: david_johnson at brown.edu (David Johnson) Date: Wed, 30 Aug 2017 15:34:46 -0400 Subject: [gpfsug-discuss] Permissions issue in GPFS 4.2.3-4? In-Reply-To: References: Message-ID: <13019F3B-AF64-4D92-AAB1-4CF3A635383C@brown.edu> We ran into this back in mid February. Never really got a satisfactory answer how it got this way, the thought was that a bunch of nodes were expelled during an mmchconfig, and the files ended up with the wrong permissions. ? ddj > On Aug 30, 2017, at 3:26 PM, Buterbaugh, Kevin L wrote: > > Hi All, > > We have a script that takes the output of mmlsfs and mmlsquota and formats a users? GPFS quota usage into something a little ?nicer? than what mmlsquota displays (and doesn?t display 50 irrelevant lines of output for filesets they don?t have access to). After upgrading to 4.2.3-4 over the weekend it started throwing errors it hadn?t before: > > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsfs: Unexpected error from awk. Return code: 2 > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsfs: Unexpected error from awk. Return code: 2 > Home (user): 11.82G 30G 40G 10807 200000 300000 > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsquota: Unexpected error from awk. Return code: 2 > > It didn?t take long to track down that the mmfs.cfg.show file had permissions of 600 and a chmod 644 of it (on our login gateways only, which is the only place users run that script anyway) fixed the problem. > > So I just wanted to see if this was a known issue in 4.2.3-4? Notice that the error appears to be coming from the GPFS commands my script runs, not my script itself ? I sure don?t call awk! ;-) > > Thanks? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 1 07:16:19 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 1 Aug 2017 09:16:19 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports Message-ID: Hi, I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum scale cluster (CentOS). I dont need NFSv4 ACLS enabled, but i dont mind them to be if its mandatory for the NFSv4 to work. I have created the domain user "fwuser" in the Active Directory (domain=LH20), it is in group Domain users, Domain Admins, Backup Operators and administrators. In the linux machine im with user ilanwalk (sudoer) [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) groups=12000513(LH20\domain users),12001603(LH20\fwuser),12000572(LH20\denied rodc password replication group),12000512(LH20\domain admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) and when trying to add smb export: [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share /fs_gpfs01 --option "admin users=LH20\fwuser" mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file system that does not enforce NFSv4 ACLs. [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) Also, when trying to enable NFS i get: [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS Failed to enable NFS service. Ensure file authentication is removed prior enabling service. What am I missing ? From jonathan at buzzard.me.uk Tue Aug 1 09:50:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Tue, 01 Aug 2017 09:50:05 +0100 Subject: [gpfsug-discuss] Quota and hardlimit enforcement In-Reply-To: <3789811E-523F-47AE-93F3-E2985DD84D60@vanderbilt.edu> References: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> <20170731160353.54412s4i1r957eax@support.scinet.utoronto.ca> <3789811E-523F-47AE-93F3-E2985DD84D60@vanderbilt.edu> Message-ID: <1501577405.17548.11.camel@buzzard.me.uk> On Mon, 2017-07-31 at 20:11 +0000, Buterbaugh, Kevin L wrote: > Jaime, > > > That?s heavily workload dependent. We run a traditional HPC cluster > and have a 7 day grace on home and 14 days on scratch. By setting the > soft and hard limits appropriately we?ve slammed the door on many a > runaway user / group / fileset. YMMV? > I would concur that it is heavily workload dependant. I have never had a problem with a 7 day period. Besides which if they can significantly blow through the hard limit due to heavy writing and the "in doubt" value then it matters not one jot that grace is 7 days or two hours. My preference however is to set the grace period to as long as possible (which from memory is about 10 years on GPFS) then set the soft at 90% of the hard and use over quota callbacks to signal that there is a problem. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From r.sobey at imperial.ac.uk Tue Aug 1 15:23:58 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 1 Aug 2017 14:23:58 +0000 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports Message-ID: You must have nfs4 Acl semantics only to create smb exports. Mmchfs -k parameter as I recall. On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: Hi, I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum scale cluster (CentOS). I dont need NFSv4 ACLS enabled, but i dont mind them to be if its mandatory for the NFSv4 to work. I have created the domain user "fwuser" in the Active Directory (domain=LH20), it is in group Domain users, Domain Admins, Backup Operators and administrators. In the linux machine im with user ilanwalk (sudoer) [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) groups=12000513(LH20\domain users),12001603(LH20\fwuser),12000572(LH20\denied rodc password replication group),12000512(LH20\domain admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) and when trying to add smb export: [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share /fs_gpfs01 --option "admin users=LH20\fwuser" mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file system that does not enforce NFSv4 ACLs. [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) Also, when trying to enable NFS i get: [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS Failed to enable NFS service. Ensure file authentication is removed prior enabling service. What am I missing ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 1 16:34:29 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 1 Aug 2017 18:34:29 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: References: Message-ID: Yes I succeeded to make smb share. But only the user i put in the command can write files to it. Others can read only. How can i enable write it to all domain users? The group. And what about the error when enabling nfs? On Aug 1, 2017 17:24, "Sobey, Richard A" wrote: > You must have nfs4 Acl semantics only to create smb exports. > > Mmchfs -k parameter as I recall. > > On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: > > Hi, > I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum > scale cluster (CentOS). > I dont need NFSv4 ACLS enabled, but i dont mind them to be if its > mandatory for the NFSv4 to work. > > I have created the domain user "fwuser" in the Active Directory > (domain=LH20), it is in group Domain users, Domain Admins, Backup > Operators and administrators. > > In the linux machine im with user ilanwalk (sudoer) > > [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser > uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) > groups=12000513(LH20\domain > users),12001603(LH20\fwuser),12000572(LH20\denied rodc password > replication group),12000512(LH20\domain > admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) > > > and when trying to add smb export: > [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share > /fs_gpfs01 --option "admin users=LH20\fwuser" > mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file > system that does not enforce NFSv4 ACLs. > > > [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs > fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) > > > > Also, when trying to enable NFS i get: > [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS > Failed to enable NFS service. Ensure file authentication is removed > prior enabling service. > > > What am I missing ? > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Tue Aug 1 23:40:38 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Tue, 1 Aug 2017 23:40:38 +0100 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: References: Message-ID: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> Could you please give a break down of the commands that you have used to configure/setup the CES services? Which guide did you follow? and what version of GPFS/SS are you currently running -- Lauz On 01/08/2017 16:34, Ilan Schwarts wrote: > > Yes I succeeded to make smb share. But only the user i put in the > command can write files to it. Others can read only. > > How can i enable write it to all domain users? The group. > And what about the error when enabling nfs? > > On Aug 1, 2017 17:24, "Sobey, Richard A" > wrote: > > You must have nfs4 Acl semantics only to create smb exports. > > Mmchfs -k parameter as I recall. > > On 1 Aug 2017 7:16 am, Ilan Schwarts > wrote: > > Hi, > I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes > spectrum > scale cluster (CentOS). > I dont need NFSv4 ACLS enabled, but i dont mind them to be if its > mandatory for the NFSv4 to work. > > I have created the domain user "fwuser" in the Active Directory > (domain=LH20), it is in group Domain users, Domain Admins, Backup > Operators and administrators. > > In the linux machine im with user ilanwalk (sudoer) > > [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser > uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) > groups=12000513(LH20\domain > users),12001603(LH20\fwuser),12000572(LH20\denied rodc password > replication group),12000512(LH20\domain > admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) > > > and when trying to add smb export: > [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share > /fs_gpfs01 --option "admin users=LH20\fwuser" > mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file > system that does not enforce NFSv4 ACLs. > > > [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs > fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) > > > > Also, when trying to enable NFS i get: > [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS > Failed to enable NFS service. Ensure file authentication is > removed > prior enabling service. > > > What am I missing ? > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Wed Aug 2 05:33:02 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Wed, 2 Aug 2017 07:33:02 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> References: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> Message-ID: Hi, I use SpectrumScale 4.2.2 I have configured the CES as in documentation: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_setcessharedroot.htm This means i did the following: mmchconfig cesSharedRoot=/fs_gpfs01 mmchnode -?ces-enable ?N LH20-GPFS1,LH20-GPFS2 Thank you Some output: [root at LH20-GPFS1 ~]# mmces state show -a NODE AUTH BLOCK NETWORK AUTH_OBJ NFS OBJ SMB CES LH20-GPFS1 HEALTHY DISABLED DEGRADED DISABLED DISABLED DISABLED HEALTHY DEGRADED LH20-GPFS2 HEALTHY DISABLED DEGRADED DISABLED DISABLED DISABLED HEALTHY DEGRADED [root at LH20-GPFS1 ~]# [root at LH20-GPFS1 ~]# mmces node list Node Name Node Flags Node Groups ----------------------------------------------------------------- 1 LH20-GPFS1 none 3 LH20-GPFS2 none [root at LH20-GPFS1 ~]# mmlscluster --ces GPFS cluster information ======================== GPFS cluster name: LH20-GPFS1 GPFS cluster id: 10777108240438931454 Cluster Export Services global parameters ----------------------------------------- Shared root directory: /fs_gpfs01 Enabled Services: SMB Log level: 0 Address distribution policy: even-coverage Node Daemon node name IP address CES IP address list ----------------------------------------------------------------------- 1 LH20-GPFS1 10.10.158.61 None 3 LH20-GPFS2 10.10.158.62 None On Wed, Aug 2, 2017 at 1:40 AM, Laurence Horrocks-Barlow wrote: > Could you please give a break down of the commands that you have used to > configure/setup the CES services? > > Which guide did you follow? and what version of GPFS/SS are you currently > running > > -- Lauz > > > On 01/08/2017 16:34, Ilan Schwarts wrote: > > Yes I succeeded to make smb share. But only the user i put in the command > can write files to it. Others can read only. > > How can i enable write it to all domain users? The group. > And what about the error when enabling nfs? > > On Aug 1, 2017 17:24, "Sobey, Richard A" wrote: >> >> You must have nfs4 Acl semantics only to create smb exports. >> >> Mmchfs -k parameter as I recall. >> >> On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: >> >> Hi, >> I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum >> scale cluster (CentOS). >> I dont need NFSv4 ACLS enabled, but i dont mind them to be if its >> mandatory for the NFSv4 to work. >> >> I have created the domain user "fwuser" in the Active Directory >> (domain=LH20), it is in group Domain users, Domain Admins, Backup >> Operators and administrators. >> >> In the linux machine im with user ilanwalk (sudoer) >> >> [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser >> uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) >> groups=12000513(LH20\domain >> users),12001603(LH20\fwuser),12000572(LH20\denied rodc password >> replication group),12000512(LH20\domain >> admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) >> >> >> and when trying to add smb export: >> [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share >> /fs_gpfs01 --option "admin users=LH20\fwuser" >> mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file >> system that does not enforce NFSv4 ACLs. >> >> >> [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs >> fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) >> >> >> >> Also, when trying to enable NFS i get: >> [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS >> Failed to enable NFS service. Ensure file authentication is removed >> prior enabling service. >> >> >> What am I missing ? >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- - Ilan Schwarts From john.hearns at asml.com Wed Aug 2 10:49:36 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 09:49:36 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn't work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Wed Aug 2 11:01:20 2017 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Wed, 2 Aug 2017 10:01:20 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Wed Aug 2 11:50:29 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 10:50:29 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> References: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> Message-ID: Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cgirda at wustl.edu Wed Aug 2 13:07:05 2017 From: cgirda at wustl.edu (Chakravarthy Girda) Date: Wed, 2 Aug 2017 17:37:05 +0530 Subject: [gpfsug-discuss] Modify template variables on pre-built grafana dashboard. In-Reply-To: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> References: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> Message-ID: <978835cd-4e29-4207-9936-6c95159356a3@wustl.edu> Hi, Successfully created bridge port and imported the pre-built grafana dashboard. https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/a180eb7e-9161-4e07-a6e4-35a0a076f7b3/attachment/5e9a5886-5bd9-4a6f-919e-bc66d16760cf/media/default%20dashboards%20set.zip Getting updates on some graphs but not all. Looks like I need to update the template variables. Need some help/instructions on how to evaluate those default variables on CLI, so I can fix them. Eg:- I get into the "File Systems View" Variable ( gpfsMetrics_fs1 ) --> Query ( gpfsMetrics_fs1 ) Regex ( /.*[^gpfs_fs_inode_used|gpfs_fs_inode_alloc|gpfs_fs_inode_free|gpfs_fs_inode_max]/ ) Question: * How can I execute the above Query and regex to fix the issues. * Is there any document on CLI options? Thank you Chakri -------------- next part -------------- An HTML attachment was scrubbed... URL: From truongv at us.ibm.com Wed Aug 2 15:35:12 2017 From: truongv at us.ibm.com (Truong Vu) Date: Wed, 2 Aug 2017 10:35:12 -0400 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 < https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Ftopic%3Fid%3D00104bb5-acf5-4036-93ba-29ea7b1d43b7%26ps%3D25&data=01%7C01%7Cjohn.hearns%40asml.com%7Caf48038c0f334674b53208d4d98d739e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=XuRlV4%2BRTilLfWD5NTK7n08m6IzjAmZ5mZOwUTNplSQ%3D&reserved=0 > Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org< mailto:gpfsug-discuss-bounces at spectrumscale.org> [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741< https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsystemd%2Fsystemd%2Fissues%2F1741&data=01%7C01%7Cjohn.hearns%40asml.com%7Caf48038c0f334674b53208d4d98d739e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=MNPDZ4bKsQBtYiz0j6SMI%2FCsKmnMbrc7kD6LMh0FQBw%3D&reserved=0 > However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170802/c0c43ae8/attachment.html > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From john.hearns at asml.com Wed Aug 2 15:49:15 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 14:49:15 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: Truong, thankyou for responding. The discussion which Renar referred to discussed system version 208, and suggested upgrading this. The system I am working on at the moment has systemd version 219, and there is only a slight minor number upgrade available. I should say that the temporary fix suggested in that discussion did work for me. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Truong Vu Sent: Wednesday, August 02, 2017 4:35 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. [Inactive hide details for gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: > Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de> ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From john.hearns at asml.com Wed Aug 2 16:19:27 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 15:19:27 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: Truong, thanks again for the response. I shall implement what is suggested in the FAQ. As we are in polite company I shall maintain a smiley face when mentioning systemd From: John Hearns Sent: Wednesday, August 02, 2017 4:49 PM To: 'gpfsug main discussion list' Subject: RE: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Truong, thankyou for responding. The discussion which Renar referred to discussed system version 208, and suggested upgrading this. The system I am working on at the moment has systemd version 219, and there is only a slight minor number upgrade available. I should say that the temporary fix suggested in that discussion did work for me. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Truong Vu Sent: Wednesday, August 02, 2017 4:35 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. [Inactive hide details for gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: > Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de> ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From stijn.deweirdt at ugent.be Wed Aug 2 16:57:55 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 17:57:55 +0200 Subject: [gpfsug-discuss] data integrity documentation Message-ID: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> hi all, is there any documentation wrt data integrity in spectrum scale: assuming a crappy network, does gpfs garantee somehow that data written by client ends up safe in the nsd gpfs daemon; and similarly from the nsd gpfs daemon to disk. and wrt crappy network, what about rdma on crappy network? is it the same? (we are hunting down a crappy infiniband issue; ibm support says it's network issue; and we see no errors anywhere...) thanks a lot, stijn From eric.wonderley at vt.edu Wed Aug 2 17:15:12 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 2 Aug 2017 12:15:12 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: No guarantee...unless you are using ess/gss solution. Crappy network will get you loads of expels and occasional fscks. Which I guess beats data loss and recovery from backup. YOu probably have a network issue...they can be subtle. Gpfs is a very extremely thorough network tester. Eric On Wed, Aug 2, 2017 at 11:57 AM, Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 17:26:29 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 16:26:29 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: the very first thing you should check is if you have this setting set : mmlsconfig envVar envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1 if that doesn't come back the way above you need to set it : mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" there was a problem in the Mellanox FW in various versions that was never completely addressed (bugs where found and fixed, but it was never fully proven to be addressed) the above environment variables turn code on in the mellanox driver that prevents this potential code path from being used to begin with. in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale that even you don't set this variables the problem can't happen anymore until then the only choice you have is the envVar above (which btw ships as default on all ESS systems). you also should be on the latest available Mellanox FW & Drivers as not all versions even have the code that is activated by the environment variables above, i think at a minimum you need to be at 3.4 but i don't remember the exact version. There had been multiple defects opened around this area, the last one i remember was : 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock you may ask your mellanox representative if they can get you access to this defect. while it was found on ESS , means on PPC64 and with ConnectX-3 cards its a general issue that affects all cards and on intel as well as Power. On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 17:26:29 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 16:26:29 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: the very first thing you should check is if you have this setting set : mmlsconfig envVar envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1 if that doesn't come back the way above you need to set it : mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" there was a problem in the Mellanox FW in various versions that was never completely addressed (bugs where found and fixed, but it was never fully proven to be addressed) the above environment variables turn code on in the mellanox driver that prevents this potential code path from being used to begin with. in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale that even you don't set this variables the problem can't happen anymore until then the only choice you have is the envVar above (which btw ships as default on all ESS systems). you also should be on the latest available Mellanox FW & Drivers as not all versions even have the code that is activated by the environment variables above, i think at a minimum you need to be at 3.4 but i don't remember the exact version. There had been multiple defects opened around this area, the last one i remember was : 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock you may ask your mellanox representative if they can get you access to this defect. while it was found on ESS , means on PPC64 and with ConnectX-3 cards its a general issue that affects all cards and on intel as well as Power. On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 19:38:13 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:38:13 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: <2518112e-0311-09c6-4f24-daa2f18bd80c@ugent.be> > No guarantee...unless you are using ess/gss solution. ok, so crappy network == corrupt data? hmmm, that is really a pity on 2017... > > Crappy network will get you loads of expels and occasional fscks. Which I > guess beats data loss and recovery from backup. if only we had errors like that. with the current issue mmfsck is the only tool that seems to trigger them (and setting some of the nsdChksum config flags reports checksum errors in the log files). but nsdperf with verify=on reports nothing. > > YOu probably have a network issue...they can be subtle. Gpfs is a very > extremely thorough network tester. we know ;) stijn > > > Eric > > On Wed, Aug 2, 2017 at 11:57 AM, Stijn De Weirdt > wrote: > >> hi all, >> >> is there any documentation wrt data integrity in spectrum scale: >> assuming a crappy network, does gpfs garantee somehow that data written >> by client ends up safe in the nsd gpfs daemon; and similarly from the >> nsd gpfs daemon to disk. >> >> and wrt crappy network, what about rdma on crappy network? is it the same? >> >> (we are hunting down a crappy infiniband issue; ibm support says it's >> network issue; and we see no errors anywhere...) >> >> thanks a lot, >> >> stijn >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From stijn.deweirdt at ugent.be Wed Aug 2 19:43:51 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:43:51 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> hi sven, > the very first thing you should check is if you have this setting set : maybe the very first thing to check should be the faq/wiki that has this documented? > > mmlsconfig envVar > > envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > MLX5_USE_MUTEX 1 > > if that doesn't come back the way above you need to set it : > > mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" i just set this (wasn't set before), but problem is still present. > > there was a problem in the Mellanox FW in various versions that was never > completely addressed (bugs where found and fixed, but it was never fully > proven to be addressed) the above environment variables turn code on in the > mellanox driver that prevents this potential code path from being used to > begin with. > > in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale > that even you don't set this variables the problem can't happen anymore > until then the only choice you have is the envVar above (which btw ships as > default on all ESS systems). > > you also should be on the latest available Mellanox FW & Drivers as not all > versions even have the code that is activated by the environment variables > above, i think at a minimum you need to be at 3.4 but i don't remember the > exact version. There had been multiple defects opened around this area, the > last one i remember was : we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from dell, and the fw is a bit behind. i'm trying to convince dell to make new one. mellanox used to allow to make your own, but they don't anymore. > > 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock > > you may ask your mellanox representative if they can get you access to this > defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > cards its a general issue that affects all cards and on intel as well as > Power. ok, thanks for this. maybe such a reference is enough for dell to update their firmware. stijn > > On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt > wrote: > >> hi all, >> >> is there any documentation wrt data integrity in spectrum scale: >> assuming a crappy network, does gpfs garantee somehow that data written >> by client ends up safe in the nsd gpfs daemon; and similarly from the >> nsd gpfs daemon to disk. >> >> and wrt crappy network, what about rdma on crappy network? is it the same? >> >> (we are hunting down a crappy infiniband issue; ibm support says it's >> network issue; and we see no errors anywhere...) >> >> thanks a lot, >> >> stijn >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 19:47:52 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 18:47:52 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> Message-ID: How can you reproduce this so quick ? Did you restart all daemons after that ? On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt wrote: > hi sven, > > > > the very first thing you should check is if you have this setting set : > maybe the very first thing to check should be the faq/wiki that has this > documented? > > > > > mmlsconfig envVar > > > > envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > > MLX5_USE_MUTEX 1 > > > > if that doesn't come back the way above you need to set it : > > > > mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > > MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > i just set this (wasn't set before), but problem is still present. > > > > > there was a problem in the Mellanox FW in various versions that was never > > completely addressed (bugs where found and fixed, but it was never fully > > proven to be addressed) the above environment variables turn code on in > the > > mellanox driver that prevents this potential code path from being used to > > begin with. > > > > in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale > > that even you don't set this variables the problem can't happen anymore > > until then the only choice you have is the envVar above (which btw ships > as > > default on all ESS systems). > > > > you also should be on the latest available Mellanox FW & Drivers as not > all > > versions even have the code that is activated by the environment > variables > > above, i think at a minimum you need to be at 3.4 but i don't remember > the > > exact version. There had been multiple defects opened around this area, > the > > last one i remember was : > we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > dell, and the fw is a bit behind. i'm trying to convince dell to make > new one. mellanox used to allow to make your own, but they don't anymore. > > > > > 00154843 : ESS ConnectX-3 performance issue - spinning on > pthread_spin_lock > > > > you may ask your mellanox representative if they can get you access to > this > > defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > > cards its a general issue that affects all cards and on intel as well as > > Power. > ok, thanks for this. maybe such a reference is enough for dell to update > their firmware. > > stijn > > > > > On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt > > wrote: > > > >> hi all, > >> > >> is there any documentation wrt data integrity in spectrum scale: > >> assuming a crappy network, does gpfs garantee somehow that data written > >> by client ends up safe in the nsd gpfs daemon; and similarly from the > >> nsd gpfs daemon to disk. > >> > >> and wrt crappy network, what about rdma on crappy network? is it the > same? > >> > >> (we are hunting down a crappy infiniband issue; ibm support says it's > >> network issue; and we see no errors anywhere...) > >> > >> thanks a lot, > >> > >> stijn > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 19:53:09 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:53:09 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> Message-ID: <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> yes ;) the system is in preproduction, so nothing that can't stopped/started in a few minutes (current setup has only 4 nsds, and no clients). mmfsck triggers the errors very early during inode replica compare. stijn On 08/02/2017 08:47 PM, Sven Oehme wrote: > How can you reproduce this so quick ? > Did you restart all daemons after that ? > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > wrote: > >> hi sven, >> >> >>> the very first thing you should check is if you have this setting set : >> maybe the very first thing to check should be the faq/wiki that has this >> documented? >> >>> >>> mmlsconfig envVar >>> >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>> MLX5_USE_MUTEX 1 >>> >>> if that doesn't come back the way above you need to set it : >>> >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >> i just set this (wasn't set before), but problem is still present. >> >>> >>> there was a problem in the Mellanox FW in various versions that was never >>> completely addressed (bugs where found and fixed, but it was never fully >>> proven to be addressed) the above environment variables turn code on in >> the >>> mellanox driver that prevents this potential code path from being used to >>> begin with. >>> >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale >>> that even you don't set this variables the problem can't happen anymore >>> until then the only choice you have is the envVar above (which btw ships >> as >>> default on all ESS systems). >>> >>> you also should be on the latest available Mellanox FW & Drivers as not >> all >>> versions even have the code that is activated by the environment >> variables >>> above, i think at a minimum you need to be at 3.4 but i don't remember >> the >>> exact version. There had been multiple defects opened around this area, >> the >>> last one i remember was : >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >> dell, and the fw is a bit behind. i'm trying to convince dell to make >> new one. mellanox used to allow to make your own, but they don't anymore. >> >>> >>> 00154843 : ESS ConnectX-3 performance issue - spinning on >> pthread_spin_lock >>> >>> you may ask your mellanox representative if they can get you access to >> this >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>> cards its a general issue that affects all cards and on intel as well as >>> Power. >> ok, thanks for this. maybe such a reference is enough for dell to update >> their firmware. >> >> stijn >> >>> >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt >>> wrote: >>> >>>> hi all, >>>> >>>> is there any documentation wrt data integrity in spectrum scale: >>>> assuming a crappy network, does gpfs garantee somehow that data written >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>> nsd gpfs daemon to disk. >>>> >>>> and wrt crappy network, what about rdma on crappy network? is it the >> same? >>>> >>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>> network issue; and we see no errors anywhere...) >>>> >>>> thanks a lot, >>>> >>>> stijn >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 20:10:07 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 19:10:07 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> Message-ID: ok, i think i understand now, the data was already corrupted. the config change i proposed only prevents a potentially known future on the wire corruption, this will not fix something that made it to the disk already. Sven On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt wrote: > yes ;) > > the system is in preproduction, so nothing that can't stopped/started in > a few minutes (current setup has only 4 nsds, and no clients). > mmfsck triggers the errors very early during inode replica compare. > > > stijn > > On 08/02/2017 08:47 PM, Sven Oehme wrote: > > How can you reproduce this so quick ? > > Did you restart all daemons after that ? > > > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > > wrote: > > > >> hi sven, > >> > >> > >>> the very first thing you should check is if you have this setting set : > >> maybe the very first thing to check should be the faq/wiki that has this > >> documented? > >> > >>> > >>> mmlsconfig envVar > >>> > >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>> MLX5_USE_MUTEX 1 > >>> > >>> if that doesn't come back the way above you need to set it : > >>> > >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >> i just set this (wasn't set before), but problem is still present. > >> > >>> > >>> there was a problem in the Mellanox FW in various versions that was > never > >>> completely addressed (bugs where found and fixed, but it was never > fully > >>> proven to be addressed) the above environment variables turn code on in > >> the > >>> mellanox driver that prevents this potential code path from being used > to > >>> begin with. > >>> > >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > Scale > >>> that even you don't set this variables the problem can't happen anymore > >>> until then the only choice you have is the envVar above (which btw > ships > >> as > >>> default on all ESS systems). > >>> > >>> you also should be on the latest available Mellanox FW & Drivers as not > >> all > >>> versions even have the code that is activated by the environment > >> variables > >>> above, i think at a minimum you need to be at 3.4 but i don't remember > >> the > >>> exact version. There had been multiple defects opened around this area, > >> the > >>> last one i remember was : > >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >> dell, and the fw is a bit behind. i'm trying to convince dell to make > >> new one. mellanox used to allow to make your own, but they don't > anymore. > >> > >>> > >>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >> pthread_spin_lock > >>> > >>> you may ask your mellanox representative if they can get you access to > >> this > >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > >>> cards its a general issue that affects all cards and on intel as well > as > >>> Power. > >> ok, thanks for this. maybe such a reference is enough for dell to update > >> their firmware. > >> > >> stijn > >> > >>> > >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be> > >>> wrote: > >>> > >>>> hi all, > >>>> > >>>> is there any documentation wrt data integrity in spectrum scale: > >>>> assuming a crappy network, does gpfs garantee somehow that data > written > >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the > >>>> nsd gpfs daemon to disk. > >>>> > >>>> and wrt crappy network, what about rdma on crappy network? is it the > >> same? > >>>> > >>>> (we are hunting down a crappy infiniband issue; ibm support says it's > >>>> network issue; and we see no errors anywhere...) > >>>> > >>>> thanks a lot, > >>>> > >>>> stijn > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 20:20:14 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 21:20:14 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> Message-ID: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> hi sven, the data is not corrupted. mmfsck compares 2 inodes, says they don't match, but checking the data with tbdbfs reveals they are equal. (one replica has to be fetched over the network; the nsds cannot access all disks) with some nsdChksum... settings we get during this mmfsck a lot of "Encountered XYZ checksum errors on network I/O to NSD Client disk" ibm support says these are hardware issues, but wrt to mmfsck false positives. anyway, our current question is: if these are hardware issues, is there anything in gpfs client->nsd (on the network side) that would detect such errors. ie can we trust the data (and metadata). i was under the impression that client to disk is not covered, but i assumed that at least client to nsd (the network part) was checksummed. stijn On 08/02/2017 09:10 PM, Sven Oehme wrote: > ok, i think i understand now, the data was already corrupted. the config > change i proposed only prevents a potentially known future on the wire > corruption, this will not fix something that made it to the disk already. > > Sven > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > wrote: > >> yes ;) >> >> the system is in preproduction, so nothing that can't stopped/started in >> a few minutes (current setup has only 4 nsds, and no clients). >> mmfsck triggers the errors very early during inode replica compare. >> >> >> stijn >> >> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>> How can you reproduce this so quick ? >>> Did you restart all daemons after that ? >>> >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >>> wrote: >>> >>>> hi sven, >>>> >>>> >>>>> the very first thing you should check is if you have this setting set : >>>> maybe the very first thing to check should be the faq/wiki that has this >>>> documented? >>>> >>>>> >>>>> mmlsconfig envVar >>>>> >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>> MLX5_USE_MUTEX 1 >>>>> >>>>> if that doesn't come back the way above you need to set it : >>>>> >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>> i just set this (wasn't set before), but problem is still present. >>>> >>>>> >>>>> there was a problem in the Mellanox FW in various versions that was >> never >>>>> completely addressed (bugs where found and fixed, but it was never >> fully >>>>> proven to be addressed) the above environment variables turn code on in >>>> the >>>>> mellanox driver that prevents this potential code path from being used >> to >>>>> begin with. >>>>> >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >> Scale >>>>> that even you don't set this variables the problem can't happen anymore >>>>> until then the only choice you have is the envVar above (which btw >> ships >>>> as >>>>> default on all ESS systems). >>>>> >>>>> you also should be on the latest available Mellanox FW & Drivers as not >>>> all >>>>> versions even have the code that is activated by the environment >>>> variables >>>>> above, i think at a minimum you need to be at 3.4 but i don't remember >>>> the >>>>> exact version. There had been multiple defects opened around this area, >>>> the >>>>> last one i remember was : >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>> new one. mellanox used to allow to make your own, but they don't >> anymore. >>>> >>>>> >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>> pthread_spin_lock >>>>> >>>>> you may ask your mellanox representative if they can get you access to >>>> this >>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>>>> cards its a general issue that affects all cards and on intel as well >> as >>>>> Power. >>>> ok, thanks for this. maybe such a reference is enough for dell to update >>>> their firmware. >>>> >>>> stijn >>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be> >>>>> wrote: >>>>> >>>>>> hi all, >>>>>> >>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>> assuming a crappy network, does gpfs garantee somehow that data >> written >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>>>> nsd gpfs daemon to disk. >>>>>> >>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>> same? >>>>>> >>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>>>> network issue; and we see no errors anywhere...) >>>>>> >>>>>> thanks a lot, >>>>>> >>>>>> stijn >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From ewahl at osc.edu Wed Aug 2 21:11:53 2017 From: ewahl at osc.edu (Edward Wahl) Date: Wed, 2 Aug 2017 16:11:53 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: <20170802161153.4eea6f61@osc.edu> What version of GPFS? Are you generating a patch file? Try using this before your mmfsck: mmdsh -N mmfsadm test fsck usePatchQueue 0 my notes say all, but I would have only had NSD nodes up at the time. Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. I won't know for sure until late August. Ed On Wed, 2 Aug 2017 21:20:14 +0200 Stijn De Weirdt wrote: > hi sven, > > the data is not corrupted. mmfsck compares 2 inodes, says they don't > match, but checking the data with tbdbfs reveals they are equal. > (one replica has to be fetched over the network; the nsds cannot access > all disks) > > with some nsdChksum... settings we get during this mmfsck a lot of > "Encountered XYZ checksum errors on network I/O to NSD Client disk" > > ibm support says these are hardware issues, but wrt to mmfsck false > positives. > > anyway, our current question is: if these are hardware issues, is there > anything in gpfs client->nsd (on the network side) that would detect > such errors. ie can we trust the data (and metadata). > i was under the impression that client to disk is not covered, but i > assumed that at least client to nsd (the network part) was checksummed. > > stijn > > > On 08/02/2017 09:10 PM, Sven Oehme wrote: > > ok, i think i understand now, the data was already corrupted. the config > > change i proposed only prevents a potentially known future on the wire > > corruption, this will not fix something that made it to the disk already. > > > > Sven > > > > > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > > wrote: > > > >> yes ;) > >> > >> the system is in preproduction, so nothing that can't stopped/started in > >> a few minutes (current setup has only 4 nsds, and no clients). > >> mmfsck triggers the errors very early during inode replica compare. > >> > >> > >> stijn > >> > >> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>> How can you reproduce this so quick ? > >>> Did you restart all daemons after that ? > >>> > >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > >>> wrote: > >>> > >>>> hi sven, > >>>> > >>>> > >>>>> the very first thing you should check is if you have this setting > >>>>> set : > >>>> maybe the very first thing to check should be the faq/wiki that has this > >>>> documented? > >>>> > >>>>> > >>>>> mmlsconfig envVar > >>>>> > >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>>>> MLX5_USE_MUTEX 1 > >>>>> > >>>>> if that doesn't come back the way above you need to set it : > >>>>> > >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>> i just set this (wasn't set before), but problem is still present. > >>>> > >>>>> > >>>>> there was a problem in the Mellanox FW in various versions that was > >> never > >>>>> completely addressed (bugs where found and fixed, but it was never > >> fully > >>>>> proven to be addressed) the above environment variables turn code on > >>>>> in > >>>> the > >>>>> mellanox driver that prevents this potential code path from being used > >> to > >>>>> begin with. > >>>>> > >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >> Scale > >>>>> that even you don't set this variables the problem can't happen anymore > >>>>> until then the only choice you have is the envVar above (which btw > >> ships > >>>> as > >>>>> default on all ESS systems). > >>>>> > >>>>> you also should be on the latest available Mellanox FW & Drivers as > >>>>> not > >>>> all > >>>>> versions even have the code that is activated by the environment > >>>> variables > >>>>> above, i think at a minimum you need to be at 3.4 but i don't remember > >>>> the > >>>>> exact version. There had been multiple defects opened around this > >>>>> area, > >>>> the > >>>>> last one i remember was : > >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make > >>>> new one. mellanox used to allow to make your own, but they don't > >> anymore. > >>>> > >>>>> > >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>> pthread_spin_lock > >>>>> > >>>>> you may ask your mellanox representative if they can get you access to > >>>> this > >>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > >>>>> cards its a general issue that affects all cards and on intel as well > >> as > >>>>> Power. > >>>> ok, thanks for this. maybe such a reference is enough for dell to update > >>>> their firmware. > >>>> > >>>> stijn > >>>> > >>>>> > >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >> stijn.deweirdt at ugent.be> > >>>>> wrote: > >>>>> > >>>>>> hi all, > >>>>>> > >>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>> assuming a crappy network, does gpfs garantee somehow that data > >> written > >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the > >>>>>> nsd gpfs daemon to disk. > >>>>>> > >>>>>> and wrt crappy network, what about rdma on crappy network? is it the > >>>> same? > >>>>>> > >>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's > >>>>>> network issue; and we see no errors anywhere...) > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> stijn > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From stijn.deweirdt at ugent.be Wed Aug 2 21:38:29 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 22:38:29 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <20170802161153.4eea6f61@osc.edu> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> <20170802161153.4eea6f61@osc.edu> Message-ID: <393b54ec-ec6a-040b-ef04-6076632db60c@ugent.be> hi ed, On 08/02/2017 10:11 PM, Edward Wahl wrote: > What version of GPFS? Are you generating a patch file? 4.2.3 series, now we run 4.2.3.3 to be clear, right now we use mmfsck to trigger the chksum issue hoping we can find the actual "hardware" issue. we know by elimination which HCAs to avoid, so we do not get the checksum errors. but to consider that a fix, we need to know if the data written by the client can be trusted due to these silent hw errors. > > Try using this before your mmfsck: > > mmdsh -N mmfsadm test fsck usePatchQueue 0 mmchmgr somefs nsdXYZ mmfsck somefs -Vn -m -N nsdXYZ -t /var/tmp/ the idea is to force everything as much as possible on one node, accessing the other failure group is forced over network > > my notes say all, but I would have only had NSD nodes up at the time. > Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. we had the "pleasure" last to have mmfsck segfaulting while we were trying to recover a filesystem, at least that was certainly fixed ;) stijn > I won't know for sure until late August. > > Ed > > > On Wed, 2 Aug 2017 21:20:14 +0200 > Stijn De Weirdt wrote: > >> hi sven, >> >> the data is not corrupted. mmfsck compares 2 inodes, says they don't >> match, but checking the data with tbdbfs reveals they are equal. >> (one replica has to be fetched over the network; the nsds cannot access >> all disks) >> >> with some nsdChksum... settings we get during this mmfsck a lot of >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >> >> ibm support says these are hardware issues, but wrt to mmfsck false >> positives. >> >> anyway, our current question is: if these are hardware issues, is there >> anything in gpfs client->nsd (on the network side) that would detect >> such errors. ie can we trust the data (and metadata). >> i was under the impression that client to disk is not covered, but i >> assumed that at least client to nsd (the network part) was checksummed. >> >> stijn >> >> >> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>> ok, i think i understand now, the data was already corrupted. the config >>> change i proposed only prevents a potentially known future on the wire >>> corruption, this will not fix something that made it to the disk already. >>> >>> Sven >>> >>> >>> >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt >>> wrote: >>> >>>> yes ;) >>>> >>>> the system is in preproduction, so nothing that can't stopped/started in >>>> a few minutes (current setup has only 4 nsds, and no clients). >>>> mmfsck triggers the errors very early during inode replica compare. >>>> >>>> >>>> stijn >>>> >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>> How can you reproduce this so quick ? >>>>> Did you restart all daemons after that ? >>>>> >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >>>>> wrote: >>>>> >>>>>> hi sven, >>>>>> >>>>>> >>>>>>> the very first thing you should check is if you have this setting >>>>>>> set : >>>>>> maybe the very first thing to check should be the faq/wiki that has this >>>>>> documented? >>>>>> >>>>>>> >>>>>>> mmlsconfig envVar >>>>>>> >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>>>> MLX5_USE_MUTEX 1 >>>>>>> >>>>>>> if that doesn't come back the way above you need to set it : >>>>>>> >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>> i just set this (wasn't set before), but problem is still present. >>>>>> >>>>>>> >>>>>>> there was a problem in the Mellanox FW in various versions that was >>>> never >>>>>>> completely addressed (bugs where found and fixed, but it was never >>>> fully >>>>>>> proven to be addressed) the above environment variables turn code on >>>>>>> in >>>>>> the >>>>>>> mellanox driver that prevents this potential code path from being used >>>> to >>>>>>> begin with. >>>>>>> >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>> Scale >>>>>>> that even you don't set this variables the problem can't happen anymore >>>>>>> until then the only choice you have is the envVar above (which btw >>>> ships >>>>>> as >>>>>>> default on all ESS systems). >>>>>>> >>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>>>>> not >>>>>> all >>>>>>> versions even have the code that is activated by the environment >>>>>> variables >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't remember >>>>>> the >>>>>>> exact version. There had been multiple defects opened around this >>>>>>> area, >>>>>> the >>>>>>> last one i remember was : >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>>>> new one. mellanox used to allow to make your own, but they don't >>>> anymore. >>>>>> >>>>>>> >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>> pthread_spin_lock >>>>>>> >>>>>>> you may ask your mellanox representative if they can get you access to >>>>>> this >>>>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>>>>>> cards its a general issue that affects all cards and on intel as well >>>> as >>>>>>> Power. >>>>>> ok, thanks for this. maybe such a reference is enough for dell to update >>>>>> their firmware. >>>>>> >>>>>> stijn >>>>>> >>>>>>> >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>> stijn.deweirdt at ugent.be> >>>>>>> wrote: >>>>>>> >>>>>>>> hi all, >>>>>>>> >>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>> written >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>>>>>> nsd gpfs daemon to disk. >>>>>>>> >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>>>> same? >>>>>>>> >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>> >>>>>>>> thanks a lot, >>>>>>>> >>>>>>>> stijn >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > From eric.wonderley at vt.edu Wed Aug 2 22:02:20 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 2 Aug 2017 17:02:20 -0400 Subject: [gpfsug-discuss] mmsetquota produces error Message-ID: for one of our home filesystem we get: mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'nathanfootest' error (22): 'Invalid argument'. mmedquota -j home:nathanfootest does work however -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 22:05:18 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 21:05:18 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: before i answer the rest of your questions, can you share what version of GPFS exactly you are on mmfsadm dump version would be best source for that. if you have 2 inodes and you know the exact address of where they are stored on disk one could 'dd' them of the disk and compare if they are really equal. we only support checksums when you use GNR based systems, they cover network as well as Disk side for that. the nsdchecksum code you refer to is the one i mentioned above thats only supported with GNR at least i am not aware that we ever claimed it to be supported outside of it, but i can check that. sven On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt wrote: > hi sven, > > the data is not corrupted. mmfsck compares 2 inodes, says they don't > match, but checking the data with tbdbfs reveals they are equal. > (one replica has to be fetched over the network; the nsds cannot access > all disks) > > with some nsdChksum... settings we get during this mmfsck a lot of > "Encountered XYZ checksum errors on network I/O to NSD Client disk" > > ibm support says these are hardware issues, but wrt to mmfsck false > positives. > > anyway, our current question is: if these are hardware issues, is there > anything in gpfs client->nsd (on the network side) that would detect > such errors. ie can we trust the data (and metadata). > i was under the impression that client to disk is not covered, but i > assumed that at least client to nsd (the network part) was checksummed. > > stijn > > > On 08/02/2017 09:10 PM, Sven Oehme wrote: > > ok, i think i understand now, the data was already corrupted. the config > > change i proposed only prevents a potentially known future on the wire > > corruption, this will not fix something that made it to the disk already. > > > > Sven > > > > > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > > > wrote: > > > >> yes ;) > >> > >> the system is in preproduction, so nothing that can't stopped/started in > >> a few minutes (current setup has only 4 nsds, and no clients). > >> mmfsck triggers the errors very early during inode replica compare. > >> > >> > >> stijn > >> > >> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>> How can you reproduce this so quick ? > >>> Did you restart all daemons after that ? > >>> > >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > > >>> wrote: > >>> > >>>> hi sven, > >>>> > >>>> > >>>>> the very first thing you should check is if you have this setting > set : > >>>> maybe the very first thing to check should be the faq/wiki that has > this > >>>> documented? > >>>> > >>>>> > >>>>> mmlsconfig envVar > >>>>> > >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>>>> MLX5_USE_MUTEX 1 > >>>>> > >>>>> if that doesn't come back the way above you need to set it : > >>>>> > >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>> i just set this (wasn't set before), but problem is still present. > >>>> > >>>>> > >>>>> there was a problem in the Mellanox FW in various versions that was > >> never > >>>>> completely addressed (bugs where found and fixed, but it was never > >> fully > >>>>> proven to be addressed) the above environment variables turn code on > in > >>>> the > >>>>> mellanox driver that prevents this potential code path from being > used > >> to > >>>>> begin with. > >>>>> > >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >> Scale > >>>>> that even you don't set this variables the problem can't happen > anymore > >>>>> until then the only choice you have is the envVar above (which btw > >> ships > >>>> as > >>>>> default on all ESS systems). > >>>>> > >>>>> you also should be on the latest available Mellanox FW & Drivers as > not > >>>> all > >>>>> versions even have the code that is activated by the environment > >>>> variables > >>>>> above, i think at a minimum you need to be at 3.4 but i don't > remember > >>>> the > >>>>> exact version. There had been multiple defects opened around this > area, > >>>> the > >>>>> last one i remember was : > >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make > >>>> new one. mellanox used to allow to make your own, but they don't > >> anymore. > >>>> > >>>>> > >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>> pthread_spin_lock > >>>>> > >>>>> you may ask your mellanox representative if they can get you access > to > >>>> this > >>>>> defect. while it was found on ESS , means on PPC64 and with > ConnectX-3 > >>>>> cards its a general issue that affects all cards and on intel as well > >> as > >>>>> Power. > >>>> ok, thanks for this. maybe such a reference is enough for dell to > update > >>>> their firmware. > >>>> > >>>> stijn > >>>> > >>>>> > >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >> stijn.deweirdt at ugent.be> > >>>>> wrote: > >>>>> > >>>>>> hi all, > >>>>>> > >>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>> assuming a crappy network, does gpfs garantee somehow that data > >> written > >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from > the > >>>>>> nsd gpfs daemon to disk. > >>>>>> > >>>>>> and wrt crappy network, what about rdma on crappy network? is it the > >>>> same? > >>>>>> > >>>>>> (we are hunting down a crappy infiniband issue; ibm support says > it's > >>>>>> network issue; and we see no errors anywhere...) > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> stijn > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 22:14:45 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:14:45 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: hi sven, > before i answer the rest of your questions, can you share what version of > GPFS exactly you are on mmfsadm dump version would be best source for that. it returns Build branch "4.2.3.3 ". > if you have 2 inodes and you know the exact address of where they are > stored on disk one could 'dd' them of the disk and compare if they are > really equal. ok, i can try that later. are you suggesting that the "tsdbfs comp" might gave wrong results? because we ran that and got eg > # tsdbfs somefs comp 7:5137408 25:221785088 1024 > Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = 0x19:D382C00: > All sectors identical > we only support checksums when you use GNR based systems, they cover > network as well as Disk side for that. > the nsdchecksum code you refer to is the one i mentioned above thats only > supported with GNR at least i am not aware that we ever claimed it to be > supported outside of it, but i can check that. ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, and they are not in the same gpfs cluster. i thought the GNR extended the checksumming to disk, and that it was already there for the network part. thanks for clearing this up. but that is worse then i thought... stijn > > sven > > On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt > wrote: > >> hi sven, >> >> the data is not corrupted. mmfsck compares 2 inodes, says they don't >> match, but checking the data with tbdbfs reveals they are equal. >> (one replica has to be fetched over the network; the nsds cannot access >> all disks) >> >> with some nsdChksum... settings we get during this mmfsck a lot of >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >> >> ibm support says these are hardware issues, but wrt to mmfsck false >> positives. >> >> anyway, our current question is: if these are hardware issues, is there >> anything in gpfs client->nsd (on the network side) that would detect >> such errors. ie can we trust the data (and metadata). >> i was under the impression that client to disk is not covered, but i >> assumed that at least client to nsd (the network part) was checksummed. >> >> stijn >> >> >> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>> ok, i think i understand now, the data was already corrupted. the config >>> change i proposed only prevents a potentially known future on the wire >>> corruption, this will not fix something that made it to the disk already. >>> >>> Sven >>> >>> >>> >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt >> >>> wrote: >>> >>>> yes ;) >>>> >>>> the system is in preproduction, so nothing that can't stopped/started in >>>> a few minutes (current setup has only 4 nsds, and no clients). >>>> mmfsck triggers the errors very early during inode replica compare. >>>> >>>> >>>> stijn >>>> >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>> How can you reproduce this so quick ? >>>>> Did you restart all daemons after that ? >>>>> >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >> >>>>> wrote: >>>>> >>>>>> hi sven, >>>>>> >>>>>> >>>>>>> the very first thing you should check is if you have this setting >> set : >>>>>> maybe the very first thing to check should be the faq/wiki that has >> this >>>>>> documented? >>>>>> >>>>>>> >>>>>>> mmlsconfig envVar >>>>>>> >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>>>> MLX5_USE_MUTEX 1 >>>>>>> >>>>>>> if that doesn't come back the way above you need to set it : >>>>>>> >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>> i just set this (wasn't set before), but problem is still present. >>>>>> >>>>>>> >>>>>>> there was a problem in the Mellanox FW in various versions that was >>>> never >>>>>>> completely addressed (bugs where found and fixed, but it was never >>>> fully >>>>>>> proven to be addressed) the above environment variables turn code on >> in >>>>>> the >>>>>>> mellanox driver that prevents this potential code path from being >> used >>>> to >>>>>>> begin with. >>>>>>> >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>> Scale >>>>>>> that even you don't set this variables the problem can't happen >> anymore >>>>>>> until then the only choice you have is the envVar above (which btw >>>> ships >>>>>> as >>>>>>> default on all ESS systems). >>>>>>> >>>>>>> you also should be on the latest available Mellanox FW & Drivers as >> not >>>>>> all >>>>>>> versions even have the code that is activated by the environment >>>>>> variables >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >> remember >>>>>> the >>>>>>> exact version. There had been multiple defects opened around this >> area, >>>>>> the >>>>>>> last one i remember was : >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>>>> new one. mellanox used to allow to make your own, but they don't >>>> anymore. >>>>>> >>>>>>> >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>> pthread_spin_lock >>>>>>> >>>>>>> you may ask your mellanox representative if they can get you access >> to >>>>>> this >>>>>>> defect. while it was found on ESS , means on PPC64 and with >> ConnectX-3 >>>>>>> cards its a general issue that affects all cards and on intel as well >>>> as >>>>>>> Power. >>>>>> ok, thanks for this. maybe such a reference is enough for dell to >> update >>>>>> their firmware. >>>>>> >>>>>> stijn >>>>>> >>>>>>> >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>> stijn.deweirdt at ugent.be> >>>>>>> wrote: >>>>>>> >>>>>>>> hi all, >>>>>>>> >>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>> written >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >> the >>>>>>>> nsd gpfs daemon to disk. >>>>>>>> >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>>>> same? >>>>>>>> >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >> it's >>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>> >>>>>>>> thanks a lot, >>>>>>>> >>>>>>>> stijn >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 22:23:44 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 21:23:44 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: ok, you can't be any newer that that. i just wonder why you have 512b inodes if this is a new system ? are this raw disks in this setup or raid controllers ? whats the disk sector size and how was the filesystem created (mmlsfs FSNAME would show answer to the last question) on the tsdbfs i am not sure if it gave wrong results, but it would be worth a test to see whats actually on the disk . you are correct that GNR extends this to the disk, but the network part is covered by the nsdchecksums you turned on when you enable the not to be named checksum parameter do you actually still get an error from fsck ? sven On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt wrote: > hi sven, > > > before i answer the rest of your questions, can you share what version of > > GPFS exactly you are on mmfsadm dump version would be best source for > that. > it returns > Build branch "4.2.3.3 ". > > > if you have 2 inodes and you know the exact address of where they are > > stored on disk one could 'dd' them of the disk and compare if they are > > really equal. > ok, i can try that later. are you suggesting that the "tsdbfs comp" > might gave wrong results? because we ran that and got eg > > > # tsdbfs somefs comp 7:5137408 25:221785088 1024 > > Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = > 0x19:D382C00: > > All sectors identical > > > > we only support checksums when you use GNR based systems, they cover > > network as well as Disk side for that. > > the nsdchecksum code you refer to is the one i mentioned above thats only > > supported with GNR at least i am not aware that we ever claimed it to be > > supported outside of it, but i can check that. > ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, > and they are not in the same gpfs cluster. > > i thought the GNR extended the checksumming to disk, and that it was > already there for the network part. thanks for clearing this up. but > that is worse then i thought... > > stijn > > > > > sven > > > > On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt > > > wrote: > > > >> hi sven, > >> > >> the data is not corrupted. mmfsck compares 2 inodes, says they don't > >> match, but checking the data with tbdbfs reveals they are equal. > >> (one replica has to be fetched over the network; the nsds cannot access > >> all disks) > >> > >> with some nsdChksum... settings we get during this mmfsck a lot of > >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" > >> > >> ibm support says these are hardware issues, but wrt to mmfsck false > >> positives. > >> > >> anyway, our current question is: if these are hardware issues, is there > >> anything in gpfs client->nsd (on the network side) that would detect > >> such errors. ie can we trust the data (and metadata). > >> i was under the impression that client to disk is not covered, but i > >> assumed that at least client to nsd (the network part) was checksummed. > >> > >> stijn > >> > >> > >> On 08/02/2017 09:10 PM, Sven Oehme wrote: > >>> ok, i think i understand now, the data was already corrupted. the > config > >>> change i proposed only prevents a potentially known future on the wire > >>> corruption, this will not fix something that made it to the disk > already. > >>> > >>> Sven > >>> > >>> > >>> > >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be > >>> > >>> wrote: > >>> > >>>> yes ;) > >>>> > >>>> the system is in preproduction, so nothing that can't stopped/started > in > >>>> a few minutes (current setup has only 4 nsds, and no clients). > >>>> mmfsck triggers the errors very early during inode replica compare. > >>>> > >>>> > >>>> stijn > >>>> > >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>>>> How can you reproduce this so quick ? > >>>>> Did you restart all daemons after that ? > >>>>> > >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be > >>> > >>>>> wrote: > >>>>> > >>>>>> hi sven, > >>>>>> > >>>>>> > >>>>>>> the very first thing you should check is if you have this setting > >> set : > >>>>>> maybe the very first thing to check should be the faq/wiki that has > >> this > >>>>>> documented? > >>>>>> > >>>>>>> > >>>>>>> mmlsconfig envVar > >>>>>>> > >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF > 1 > >>>>>>> MLX5_USE_MUTEX 1 > >>>>>>> > >>>>>>> if that doesn't come back the way above you need to set it : > >>>>>>> > >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>>>> i just set this (wasn't set before), but problem is still present. > >>>>>> > >>>>>>> > >>>>>>> there was a problem in the Mellanox FW in various versions that was > >>>> never > >>>>>>> completely addressed (bugs where found and fixed, but it was never > >>>> fully > >>>>>>> proven to be addressed) the above environment variables turn code > on > >> in > >>>>>> the > >>>>>>> mellanox driver that prevents this potential code path from being > >> used > >>>> to > >>>>>>> begin with. > >>>>>>> > >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >>>> Scale > >>>>>>> that even you don't set this variables the problem can't happen > >> anymore > >>>>>>> until then the only choice you have is the envVar above (which btw > >>>> ships > >>>>>> as > >>>>>>> default on all ESS systems). > >>>>>>> > >>>>>>> you also should be on the latest available Mellanox FW & Drivers as > >> not > >>>>>> all > >>>>>>> versions even have the code that is activated by the environment > >>>>>> variables > >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't > >> remember > >>>>>> the > >>>>>>> exact version. There had been multiple defects opened around this > >> area, > >>>>>> the > >>>>>>> last one i remember was : > >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards > from > >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to > make > >>>>>> new one. mellanox used to allow to make your own, but they don't > >>>> anymore. > >>>>>> > >>>>>>> > >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>>>> pthread_spin_lock > >>>>>>> > >>>>>>> you may ask your mellanox representative if they can get you access > >> to > >>>>>> this > >>>>>>> defect. while it was found on ESS , means on PPC64 and with > >> ConnectX-3 > >>>>>>> cards its a general issue that affects all cards and on intel as > well > >>>> as > >>>>>>> Power. > >>>>>> ok, thanks for this. maybe such a reference is enough for dell to > >> update > >>>>>> their firmware. > >>>>>> > >>>>>> stijn > >>>>>> > >>>>>>> > >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >>>> stijn.deweirdt at ugent.be> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> hi all, > >>>>>>>> > >>>>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data > >>>> written > >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from > >> the > >>>>>>>> nsd gpfs daemon to disk. > >>>>>>>> > >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it > the > >>>>>> same? > >>>>>>>> > >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says > >> it's > >>>>>>>> network issue; and we see no errors anywhere...) > >>>>>>>> > >>>>>>>> thanks a lot, > >>>>>>>> > >>>>>>>> stijn > >>>>>>>> _______________________________________________ > >>>>>>>> gpfsug-discuss mailing list > >>>>>>>> gpfsug-discuss at spectrumscale.org > >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> gpfsug-discuss mailing list > >>>>>>> gpfsug-discuss at spectrumscale.org > >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sxiao at us.ibm.com Wed Aug 2 22:36:06 2017 From: sxiao at us.ibm.com (Steve Xiao) Date: Wed, 2 Aug 2017 17:36:06 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: Message-ID: The nsdChksum settings for none GNR/ESS based system is not officially supported. It will perform checksum on data transfer over the network only and can be used to help debug data corruption when network is a suspect. Did any of those "Encountered XYZ checksum errors on network I/O to NSD Client disk" warning messages resulted in disk been changed to "down" state due to IO error? If no disk IO error was reported in GPFS log, that means data was retransmitted successfully on retry. As sven said, only GNR/ESS provids the full end to end data integrity. Steve Y. Xiao -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 22:47:36 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:47:36 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: hi sven, > ok, you can't be any newer that that. i just wonder why you have 512b > inodes if this is a new system ? because we rsynced 100M files to it ;) it's supposed to replace another system. > are this raw disks in this setup or raid controllers ? raid (DDP on MD3460) > whats the disk sector size euhm, you mean the luns? for metadata disks (SSD in raid 1): > # parted /dev/mapper/f1v01e0g0_Dm01o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0g0_Dm01o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0g0_Dm01o0: 219GB > Sector size (logical/physical): 512B/512B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 219GB 219GB GPFS: hidden for data disks (DDP) > [root at nsd01 ~]# parted /dev/mapper/f1v01e0p0_S17o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0p0_S17o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0p0_S17o0: 35.2TB > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 35.2TB 35.2TB GPFS: hidden > > (parted) q and how was the filesystem created (mmlsfs FSNAME would show > answer to the last question) > # mmlsfs somefilesystem > flag value description > ------------------- ------------------------ ----------------------------------- > -f 16384 Minimum fragment size in bytes (system pool) > 262144 Minimum fragment size in bytes (other pools) > -i 4096 Inode size in bytes > -I 32768 Indirect block size in bytes > -m 2 Default number of metadata replicas > -M 2 Maximum number of metadata replicas > -r 1 Default number of data replicas > -R 2 Maximum number of data replicas > -j scatter Block allocation type > -D nfs4 File locking semantics in effect > -k all ACL semantics in effect > -n 850 Estimated number of nodes that will mount file system > -B 524288 Block size (system pool) > 8388608 Block size (other pools) > -Q user;group;fileset Quotas accounting enabled > user;group;fileset Quotas enforced > none Default quotas enabled > --perfileset-quota Yes Per-fileset quota enforcement > --filesetdf Yes Fileset df enabled? > -V 17.00 (4.2.3.0) File system version > --create-time Wed May 31 12:54:00 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E No Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation option > --fastea Yes Fast external attributes enabled? > --encryption No Encryption enabled? > --inode-limit 313524224 Maximum number of inodes in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > --subblocks-per-full-block 32 Number of subblocks per full block > -P system;MD3260 Disk storage pools in file system > -d f0v00e0g0_Sm00o0;f0v00e0p0_S00o0;f1v01e0g0_Sm01o0;f1v01e0p0_S01o0;f0v02e0g0_Sm02o0;f0v02e0p0_S02o0;f1v03e0g0_Sm03o0;f1v03e0p0_S03o0;f0v04e0g0_Sm04o0;f0v04e0p0_S04o0; > -d f1v05e0g0_Sm05o0;f1v05e0p0_S05o0;f0v06e0g0_Sm06o0;f0v06e0p0_S06o0;f1v07e0g0_Sm07o0;f1v07e0p0_S07o0;f0v00e0g0_Sm08o1;f0v00e0p0_S08o1;f1v01e0g0_Sm09o1;f1v01e0p0_S09o1; > -d f0v02e0g0_Sm10o1;f0v02e0p0_S10o1;f1v03e0g0_Sm11o1;f1v03e0p0_S11o1;f0v04e0g0_Sm12o1;f0v04e0p0_S12o1;f1v05e0g0_Sm13o1;f1v05e0p0_S13o1;f0v06e0g0_Sm14o1;f0v06e0p0_S14o1; > -d f1v07e0g0_Sm15o1;f1v07e0p0_S15o1;f0v00e0p0_S16o0;f1v01e0p0_S17o0;f0v02e0p0_S18o0;f1v03e0p0_S19o0;f0v04e0p0_S20o0;f1v05e0p0_S21o0;f0v06e0p0_S22o0;f1v07e0p0_S23o0; > -d f0v00e0p0_S24o1;f1v01e0p0_S25o1;f0v02e0p0_S26o1;f1v03e0p0_S27o1;f0v04e0p0_S28o1;f1v05e0p0_S29o1;f0v06e0p0_S30o1;f1v07e0p0_S31o1 Disks in file system > -A no Automatic mount option > -o none Additional mount options > -T /scratch Default mount point > --mount-priority 0 > > on the tsdbfs i am not sure if it gave wrong results, but it would be worth > a test to see whats actually on the disk . ok. i'll try this tomorrow. > > you are correct that GNR extends this to the disk, but the network part is > covered by the nsdchecksums you turned on > when you enable the not to be named checksum parameter do you actually > still get an error from fsck ? hah, no, we don't. mmfsck says the filesystem is clean. we found this odd, so we already asked ibm support about this but no answer yet. stijn > > sven > > > On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt > wrote: > >> hi sven, >> >>> before i answer the rest of your questions, can you share what version of >>> GPFS exactly you are on mmfsadm dump version would be best source for >> that. >> it returns >> Build branch "4.2.3.3 ". >> >>> if you have 2 inodes and you know the exact address of where they are >>> stored on disk one could 'dd' them of the disk and compare if they are >>> really equal. >> ok, i can try that later. are you suggesting that the "tsdbfs comp" >> might gave wrong results? because we ran that and got eg >> >>> # tsdbfs somefs comp 7:5137408 25:221785088 1024 >>> Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = >> 0x19:D382C00: >>> All sectors identical >> >> >>> we only support checksums when you use GNR based systems, they cover >>> network as well as Disk side for that. >>> the nsdchecksum code you refer to is the one i mentioned above thats only >>> supported with GNR at least i am not aware that we ever claimed it to be >>> supported outside of it, but i can check that. >> ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, >> and they are not in the same gpfs cluster. >> >> i thought the GNR extended the checksumming to disk, and that it was >> already there for the network part. thanks for clearing this up. but >> that is worse then i thought... >> >> stijn >> >>> >>> sven >>> >>> On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt >> >>> wrote: >>> >>>> hi sven, >>>> >>>> the data is not corrupted. mmfsck compares 2 inodes, says they don't >>>> match, but checking the data with tbdbfs reveals they are equal. >>>> (one replica has to be fetched over the network; the nsds cannot access >>>> all disks) >>>> >>>> with some nsdChksum... settings we get during this mmfsck a lot of >>>> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >>>> >>>> ibm support says these are hardware issues, but wrt to mmfsck false >>>> positives. >>>> >>>> anyway, our current question is: if these are hardware issues, is there >>>> anything in gpfs client->nsd (on the network side) that would detect >>>> such errors. ie can we trust the data (and metadata). >>>> i was under the impression that client to disk is not covered, but i >>>> assumed that at least client to nsd (the network part) was checksummed. >>>> >>>> stijn >>>> >>>> >>>> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>>>> ok, i think i understand now, the data was already corrupted. the >> config >>>>> change i proposed only prevents a potentially known future on the wire >>>>> corruption, this will not fix something that made it to the disk >> already. >>>>> >>>>> Sven >>>>> >>>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be >>>>> >>>>> wrote: >>>>> >>>>>> yes ;) >>>>>> >>>>>> the system is in preproduction, so nothing that can't stopped/started >> in >>>>>> a few minutes (current setup has only 4 nsds, and no clients). >>>>>> mmfsck triggers the errors very early during inode replica compare. >>>>>> >>>>>> >>>>>> stijn >>>>>> >>>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>>>> How can you reproduce this so quick ? >>>>>>> Did you restart all daemons after that ? >>>>>>> >>>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> hi sven, >>>>>>>> >>>>>>>> >>>>>>>>> the very first thing you should check is if you have this setting >>>> set : >>>>>>>> maybe the very first thing to check should be the faq/wiki that has >>>> this >>>>>>>> documented? >>>>>>>> >>>>>>>>> >>>>>>>>> mmlsconfig envVar >>>>>>>>> >>>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF >> 1 >>>>>>>>> MLX5_USE_MUTEX 1 >>>>>>>>> >>>>>>>>> if that doesn't come back the way above you need to set it : >>>>>>>>> >>>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>>>> i just set this (wasn't set before), but problem is still present. >>>>>>>> >>>>>>>>> >>>>>>>>> there was a problem in the Mellanox FW in various versions that was >>>>>> never >>>>>>>>> completely addressed (bugs where found and fixed, but it was never >>>>>> fully >>>>>>>>> proven to be addressed) the above environment variables turn code >> on >>>> in >>>>>>>> the >>>>>>>>> mellanox driver that prevents this potential code path from being >>>> used >>>>>> to >>>>>>>>> begin with. >>>>>>>>> >>>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>>>> Scale >>>>>>>>> that even you don't set this variables the problem can't happen >>>> anymore >>>>>>>>> until then the only choice you have is the envVar above (which btw >>>>>> ships >>>>>>>> as >>>>>>>>> default on all ESS systems). >>>>>>>>> >>>>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>> not >>>>>>>> all >>>>>>>>> versions even have the code that is activated by the environment >>>>>>>> variables >>>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >>>> remember >>>>>>>> the >>>>>>>>> exact version. There had been multiple defects opened around this >>>> area, >>>>>>>> the >>>>>>>>> last one i remember was : >>>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards >> from >>>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to >> make >>>>>>>> new one. mellanox used to allow to make your own, but they don't >>>>>> anymore. >>>>>>>> >>>>>>>>> >>>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>>>> pthread_spin_lock >>>>>>>>> >>>>>>>>> you may ask your mellanox representative if they can get you access >>>> to >>>>>>>> this >>>>>>>>> defect. while it was found on ESS , means on PPC64 and with >>>> ConnectX-3 >>>>>>>>> cards its a general issue that affects all cards and on intel as >> well >>>>>> as >>>>>>>>> Power. >>>>>>>> ok, thanks for this. maybe such a reference is enough for dell to >>>> update >>>>>>>> their firmware. >>>>>>>> >>>>>>>> stijn >>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>>>> stijn.deweirdt at ugent.be> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> hi all, >>>>>>>>>> >>>>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>>>> written >>>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >>>> the >>>>>>>>>> nsd gpfs daemon to disk. >>>>>>>>>> >>>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it >> the >>>>>>>> same? >>>>>>>>>> >>>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >>>> it's >>>>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>>>> >>>>>>>>>> thanks a lot, >>>>>>>>>> >>>>>>>>>> stijn >>>>>>>>>> _______________________________________________ >>>>>>>>>> gpfsug-discuss mailing list >>>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> gpfsug-discuss mailing list >>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From stijn.deweirdt at ugent.be Wed Aug 2 22:53:50 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:53:50 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: Message-ID: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> hi steve, > The nsdChksum settings for none GNR/ESS based system is not officially > supported. It will perform checksum on data transfer over the network > only and can be used to help debug data corruption when network is a > suspect. i'll take not officially supported over silent bitrot any day. > > Did any of those "Encountered XYZ checksum errors on network I/O to NSD > Client disk" warning messages resulted in disk been changed to "down" > state due to IO error? no. If no disk IO error was reported in GPFS log, > that means data was retransmitted successfully on retry. we suspected as much. as sven already asked, mmfsck now reports clean filesystem. i have an ibdump of 2 involved nsds during the reported checksums, i'll have a closer look if i can spot these retries. > > As sven said, only GNR/ESS provids the full end to end data integrity. so with the silent network error, we have high probabilty that the data is corrupted. we are now looking for a test to find out what adapters are affected. we hoped that nsdperf with verify=on would tell us, but it doesn't. > > Steve Y. Xiao > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From aaron.s.knister at nasa.gov Thu Aug 3 01:48:07 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 20:48:07 -0400 Subject: [gpfsug-discuss] documentation about version compatibility Message-ID: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> Hey All, I swear that some time recently someone posted a link to some IBM documentation that outlined the recommended versions of GPFS to upgrade to/from (e.g. if you're at 3.5 get to 4.1 before going to 4.2.3). I can't for the life of me find it. Does anyone know what I'm talking about? Thanks, Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Aug 3 02:00:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 21:00:00 -0400 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: I'm a little late to the party here but I thought I'd share our recent experiences. We recently completed a mass UID number migration (half a billion inodes) and developed two tools ("luke filewalker" and the "mmilleniumfacl") to get the job done. Both luke filewalker and the mmilleniumfacl are based heavily on the code in /usr/lpp/mmfs/samples/util/tsreaddir.c and /usr/lpp/mmfs/samples/util/tsinode.c. luke filewalker targets traditional POSIX permissions whereas mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem in parallel and both but particularly the 2nd, are extremely I/O intensive on your metadata disks. The gist of luke filewalker is to scan the inode structures using the gpfs APIs and populate a mapping of inode number to gid and uid number. It then walks the filesystem in parallel using the APIs, looks up the inode number in an in-memory hash, and if appropriate changes ownership using the chown() API. The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs using the GPFS inode API so it walks the filesystem and reads the ACL of any and every file, updating the ACL entries as appropriate. I'm going to see if I can share the source code for both tools, although I don't know if I can post it here since it modified existing IBM source code. Could someone from IBM chime in here? If I were to send the code to IBM could they publish it perhaps on the wiki? -Aaron On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: > Hello, > > We're trying to change most of our users uids, is there a clean way to > migrate all of one users files with say `mmapplypolicy`? We have to change the > owner of around 273539588 files, and my estimates for runtime are around 6 days. > > What we've been doing is indexing all of the files and splitting them up by > owner which takes around an hour, and then we were locking the user out while we > chown their files. I made it multi threaded as it weirdly gave a 10% speedup > despite my expectation that multi threading access from a single node would not > give any speedup. > > Generally I'm looking for advice on how to make the chowning faster. Would > spreading the chowning processes over multiple nodes improve performance? Should > I not stat the files before running lchown on them, since lchown checks the file > before changing it? I saw mention of inodescan(), in an old gpfsug email, which > speeds up disk read access, by not guaranteeing that the data is up to date. We > have a maintenance day coming up where all users will be locked out, so the file > handles(?) from GPFS's perspective will not be able to go stale. Is there a > function with similar constraints to inodescan that I can use to speed up this > process? > > Thank you for your time, > > Luke > Storrs-HPC > University of Connecticut > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Aug 3 02:03:23 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 21:03:23 -0400 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: Oh, the one *huge* gotcha I thought I'd share-- we wrote a perl script to drive the migration and part of the perl script's process was to clone quotas from old uid numbers to the new number. I upset our GPFS cluster during a particular migration in which the user was over the grace period of the quota so after a certain point every chown() put the destination UID even further over its quota. The problem with this being that at this point every chown() operation would cause GPFS to do some cluster-wide quota accounting-related RPCs. That hurt. It's worth making sure there are no quotas defined for the destination UID numbers and if they are that the data coming from the source UID number will fit. -Aaron On 8/2/17 9:00 PM, Aaron Knister wrote: > I'm a little late to the party here but I thought I'd share our recent > experiences. > > We recently completed a mass UID number migration (half a billion > inodes) and developed two tools ("luke filewalker" and the > "mmilleniumfacl") to get the job done. Both luke filewalker and the > mmilleniumfacl are based heavily on the code in > /usr/lpp/mmfs/samples/util/tsreaddir.c and > /usr/lpp/mmfs/samples/util/tsinode.c. > > luke filewalker targets traditional POSIX permissions whereas > mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem in > parallel and both but particularly the 2nd, are extremely I/O intensive > on your metadata disks. > > The gist of luke filewalker is to scan the inode structures using the > gpfs APIs and populate a mapping of inode number to gid and uid number. > It then walks the filesystem in parallel using the APIs, looks up the > inode number in an in-memory hash, and if appropriate changes ownership > using the chown() API. > > The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs > using the GPFS inode API so it walks the filesystem and reads the ACL of > any and every file, updating the ACL entries as appropriate. > > I'm going to see if I can share the source code for both tools, although > I don't know if I can post it here since it modified existing IBM source > code. Could someone from IBM chime in here? If I were to send the code > to IBM could they publish it perhaps on the wiki? > > -Aaron > > On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: >> Hello, >> >> We're trying to change most of our users uids, is there a clean >> way to >> migrate all of one users files with say `mmapplypolicy`? We have to >> change the >> owner of around 273539588 files, and my estimates for runtime are >> around 6 days. >> >> What we've been doing is indexing all of the files and splitting >> them up by >> owner which takes around an hour, and then we were locking the user >> out while we >> chown their files. I made it multi threaded as it weirdly gave a 10% >> speedup >> despite my expectation that multi threading access from a single node >> would not >> give any speedup. >> >> Generally I'm looking for advice on how to make the chowning >> faster. Would >> spreading the chowning processes over multiple nodes improve >> performance? Should >> I not stat the files before running lchown on them, since lchown >> checks the file >> before changing it? I saw mention of inodescan(), in an old gpfsug >> email, which >> speeds up disk read access, by not guaranteeing that the data is up to >> date. We >> have a maintenance day coming up where all users will be locked out, >> so the file >> handles(?) from GPFS's perspective will not be able to go stale. Is >> there a >> function with similar constraints to inodescan that I can use to speed >> up this >> process? >> >> Thank you for your time, >> >> Luke >> Storrs-HPC >> University of Connecticut >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From scale at us.ibm.com Thu Aug 3 06:18:46 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Thu, 3 Aug 2017 13:18:46 +0800 Subject: [gpfsug-discuss] Gpfs Memory Usaage Keeps going up and we don't know why. In-Reply-To: <1500906086.571.9.camel@qmul.ac.uk> References: <261384244.3866909.1500901872347.ref@mail.yahoo.com><261384244.3866909.1500901872347@mail.yahoo.com><1500903047.571.7.camel@qmul.ac.uk><1770436429.3911327.1500905445052@mail.yahoo.com> <1500906086.571.9.camel@qmul.ac.uk> Message-ID: Can you provide the output of "pmap 4444"? If there's no "pmap" command on your system, then get the memory maps of mmfsd from file of /proc/4444/maps. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Peter Childs To: "gpfsug-discuss at spectrumscale.org" Date: 07/24/2017 10:22 PM Subject: Re: [gpfsug-discuss] Gpfs Memory Usaage Keeps going up and we don't know why. Sent by: gpfsug-discuss-bounces at spectrumscale.org top but ps gives the same value. [root at dn29 ~]# ps auww -q 4444 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 4444 2.7 22.3 10537600 5472580 ? S wrote: I've had a look at mmfsadm dump malloc and it looks to agree with the output from mmdiag --memory. and does not seam to account for the excessive memory usage. The new machines do have idleSocketTimout set to 0 from what your saying it could be related to keeping that many connections between nodes working. Thanks in advance Peter. [root at dn29 ~]# mmdiag --memory === mmdiag: memory === mmfsd heap size: 2039808 bytes Statistics for MemoryPool id 1 ("Shared Segment (EPHEMERAL)") 128 bytes in use 17500049370 hard limit on memory usage 1048576 bytes committed to regions 1 number of regions 555 allocations 555 frees 0 allocation failures Statistics for MemoryPool id 2 ("Shared Segment") 42179592 bytes in use 17500049370 hard limit on memory usage 56623104 bytes committed to regions 9 number of regions 100027 allocations 79624 frees 0 allocation failures Statistics for MemoryPool id 3 ("Token Manager") 2099520 bytes in use 17500049370 hard limit on memory usage 16778240 bytes committed to regions 1 number of regions 4 allocations 0 frees 0 allocation failures On Mon, 2017-07-24 at 13:11 +0000, Jim Doherty wrote: There are 3 places that the GPFS mmfsd uses memory the pagepool plus 2 shared memory segments. To see the memory utilization of the shared memory segments run the command mmfsadm dump malloc . The statistics for memory pool id 2 is where maxFilesToCache/maxStatCache objects are and the manager nodes use memory pool id 3 to track the MFTC/MSC objects. You might want to upgrade to later PTF as there was a PTF to fix a memory leak that occurred in tscomm associated with network connection drops. On Monday, July 24, 2017 5:29 AM, Peter Childs wrote: We have two GPFS clusters. One is fairly old and running 4.2.1-2 and non CCR and the nodes run fine using up about 1.5G of memory and is consistent (GPFS pagepool is set to 1G, so that looks about right.) The other one is "newer" running 4.2.1-3 with CCR and the nodes keep increasing in there memory usage, starting at about 1.1G and are find for a few days however after a while they grow to 4.2G which when the node need to run real work, means the work can't be done. I'm losing track of what maybe different other than CCR, and I'm trying to find some more ideas of where to look. I'm checked all the standard things like pagepool and maxFilesToCache (set to the default of 4000), workerThreads is set to 128 on the new gpfs cluster (against default 48 on the old) I'm not sure what else to look at on this one hence why I'm asking the community. Thanks in advance Peter Childs ITS Research Storage Queen Mary University of London. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Peter Childs ITS Research Storage Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Peter Childs ITS Research Storage Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Thu Aug 3 07:42:37 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Thu, 3 Aug 2017 07:42:37 +0100 Subject: [gpfsug-discuss] documentation about version compatibility In-Reply-To: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> References: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> Message-ID: <0a283eb9-a458-bd2c-4e7b-1f46bb22e385@pixitmedia.com> Hi This is the Installation Guide of each target version under the section 'Migrating from to '. Jez On 03/08/17 01:48, Aaron Knister wrote: > Hey All, > > I swear that some time recently someone posted a link to some IBM > documentation that outlined the recommended versions of GPFS to > upgrade to/from (e.g. if you're at 3.5 get to 4.1 before going to > 4.2.3). I can't for the life of me find it. Does anyone know what I'm > talking about? > > Thanks, > Aaron > -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Thu Aug 3 07:46:36 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Thu, 3 Aug 2017 07:46:36 +0100 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: Perhaps IBM might consider letting you commit it to https://github.com/gpfsug/gpfsug-tools he says, asking out loud... It'll require a friendly IBMer to take the reins up for you. Scott? :-) Jez On 03/08/17 02:00, Aaron Knister wrote: > I'm a little late to the party here but I thought I'd share our recent > experiences. > > We recently completed a mass UID number migration (half a billion > inodes) and developed two tools ("luke filewalker" and the > "mmilleniumfacl") to get the job done. Both luke filewalker and the > mmilleniumfacl are based heavily on the code in > /usr/lpp/mmfs/samples/util/tsreaddir.c and > /usr/lpp/mmfs/samples/util/tsinode.c. > > luke filewalker targets traditional POSIX permissions whereas > mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem > in parallel and both but particularly the 2nd, are extremely I/O > intensive on your metadata disks. > > The gist of luke filewalker is to scan the inode structures using the > gpfs APIs and populate a mapping of inode number to gid and uid > number. It then walks the filesystem in parallel using the APIs, looks > up the inode number in an in-memory hash, and if appropriate changes > ownership using the chown() API. > > The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs > using the GPFS inode API so it walks the filesystem and reads the ACL > of any and every file, updating the ACL entries as appropriate. > > I'm going to see if I can share the source code for both tools, > although I don't know if I can post it here since it modified existing > IBM source code. Could someone from IBM chime in here? If I were to > send the code to IBM could they publish it perhaps on the wiki? > > -Aaron > > On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: >> Hello, >> >> We're trying to change most of our users uids, is there a clean >> way to >> migrate all of one users files with say `mmapplypolicy`? We have to >> change the >> owner of around 273539588 files, and my estimates for runtime are >> around 6 days. >> >> What we've been doing is indexing all of the files and splitting >> them up by >> owner which takes around an hour, and then we were locking the user >> out while we >> chown their files. I made it multi threaded as it weirdly gave a 10% >> speedup >> despite my expectation that multi threading access from a single node >> would not >> give any speedup. >> >> Generally I'm looking for advice on how to make the chowning >> faster. Would >> spreading the chowning processes over multiple nodes improve >> performance? Should >> I not stat the files before running lchown on them, since lchown >> checks the file >> before changing it? I saw mention of inodescan(), in an old gpfsug >> email, which >> speeds up disk read access, by not guaranteeing that the data is up >> to date. We >> have a maintenance day coming up where all users will be locked out, >> so the file >> handles(?) from GPFS's perspective will not be able to go stale. Is >> there a >> function with similar constraints to inodescan that I can use to >> speed up this >> process? >> >> Thank you for your time, >> >> Luke >> Storrs-HPC >> University of Connecticut >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Aug 3 09:49:26 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 03 Aug 2017 09:49:26 +0100 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: <1501750166.17548.43.camel@strath.ac.uk> On Wed, 2017-08-02 at 21:03 -0400, Aaron Knister wrote: > Oh, the one *huge* gotcha I thought I'd share-- we wrote a perl script > to drive the migration and part of the perl script's process was to > clone quotas from old uid numbers to the new number. I upset our GPFS > cluster during a particular migration in which the user was over the > grace period of the quota so after a certain point every chown() put the > destination UID even further over its quota. The problem with this being > that at this point every chown() operation would cause GPFS to do some > cluster-wide quota accounting-related RPCs. That hurt. It's worth making > sure there are no quotas defined for the destination UID numbers and if > they are that the data coming from the source UID number will fit. For similar reasons if you are doing a restore of a file system (any file system for that matter not just GPFS) for whatever reason, don't turn quotas back on till *after* the restore is complete. Well unless you can be sure a user is not going to go over quota during the restore. However as this is generally not possible to determine you end up with no quota's either set/enforced till the restore is complete. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From oehmes at gmail.com Thu Aug 3 14:06:49 2017 From: oehmes at gmail.com (Sven Oehme) Date: Thu, 03 Aug 2017 13:06:49 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: a trace during a mmfsck with the checksum parameters turned on would reveal it. the support team should be able to give you specific triggers to cut a trace during checksum errors , this way the trace is cut when the issue happens and then from the trace on server and client side one can extract which card was used on each side. sven On Wed, Aug 2, 2017 at 2:53 PM Stijn De Weirdt wrote: > hi steve, > > > The nsdChksum settings for none GNR/ESS based system is not officially > > supported. It will perform checksum on data transfer over the network > > only and can be used to help debug data corruption when network is a > > suspect. > i'll take not officially supported over silent bitrot any day. > > > > > Did any of those "Encountered XYZ checksum errors on network I/O to NSD > > Client disk" warning messages resulted in disk been changed to "down" > > state due to IO error? > no. > > If no disk IO error was reported in GPFS log, > > that means data was retransmitted successfully on retry. > we suspected as much. as sven already asked, mmfsck now reports clean > filesystem. > i have an ibdump of 2 involved nsds during the reported checksums, i'll > have a closer look if i can spot these retries. > > > > > As sven said, only GNR/ESS provids the full end to end data integrity. > so with the silent network error, we have high probabilty that the data > is corrupted. > > we are now looking for a test to find out what adapters are affected. we > hoped that nsdperf with verify=on would tell us, but it doesn't. > > > > > Steve Y. Xiao > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamiedavis at us.ibm.com Thu Aug 3 14:11:23 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Thu, 3 Aug 2017 13:11:23 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Fri Aug 4 06:02:22 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 4 Aug 2017 01:02:22 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: 4.2.2.3 I want to think maybe this started after expanding inode space On Thu, Aug 3, 2017 at 9:11 AM, James Davis wrote: > Hey, > > Hmm, your invocation looks valid to me. What's your GPFS level? > > Cheers, > > Jamie > > > ----- Original message ----- > From: "J. Eric Wonderley" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] mmsetquota produces error > Date: Wed, Aug 2, 2017 5:03 PM > > for one of our home filesystem we get: > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > 'Invalid argument'. > > > mmedquota -j home:nathanfootest > does work however > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Aug 4 09:00:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 4 Aug 2017 04:00:35 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 Message-ID: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Hey All, Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some rather disconcerting behavior. Specifically on some of the upgraded nodes GPFS will seemingly deadlock on the entire node rendering it unusable. I can't even get a session on the node (but I can trigger a crash dump via a sysrq trigger). Most blocked tasks are blocked are in cxiWaitEventWait at the top of their call trace. That's probably not very helpful in of itself but I'm curious if anyone else out there has run into this issue or if this is a known bug. (I'll open a PMR later today once I've gathered more diagnostic information). -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From eric.wonderley at vt.edu Fri Aug 4 13:58:12 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 4 Aug 2017 08:58:12 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: i actually hit this assert and turned it in to support on this version: Build branch "4.2.2.3 efix6 (987197)". i was told do to exactly what sven mentioned. i thought it strange that i did NOT hit the assert in a no pass but hit it in a yes pass. On Thu, Aug 3, 2017 at 9:06 AM, Sven Oehme wrote: > a trace during a mmfsck with the checksum parameters turned on would > reveal it. > the support team should be able to give you specific triggers to cut a > trace during checksum errors , this way the trace is cut when the issue > happens and then from the trace on server and client side one can extract > which card was used on each side. > > sven > > On Wed, Aug 2, 2017 at 2:53 PM Stijn De Weirdt > wrote: > >> hi steve, >> >> > The nsdChksum settings for none GNR/ESS based system is not officially >> > supported. It will perform checksum on data transfer over the network >> > only and can be used to help debug data corruption when network is a >> > suspect. >> i'll take not officially supported over silent bitrot any day. >> >> > >> > Did any of those "Encountered XYZ checksum errors on network I/O to NSD >> > Client disk" warning messages resulted in disk been changed to "down" >> > state due to IO error? >> no. >> >> If no disk IO error was reported in GPFS log, >> > that means data was retransmitted successfully on retry. >> we suspected as much. as sven already asked, mmfsck now reports clean >> filesystem. >> i have an ibdump of 2 involved nsds during the reported checksums, i'll >> have a closer look if i can spot these retries. >> >> > >> > As sven said, only GNR/ESS provids the full end to end data integrity. >> so with the silent network error, we have high probabilty that the data >> is corrupted. >> >> we are now looking for a test to find out what adapters are affected. we >> hoped that nsdperf with verify=on would tell us, but it doesn't. >> >> > >> > Steve Y. Xiao >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Aug 4 15:45:49 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 4 Aug 2017 16:45:49 +0200 Subject: [gpfsug-discuss] restrict user quota on specific filesets Message-ID: Hi, Is it possible to let users only write data in filesets where some quota is explicitly set ? We have independent filesets with quota defined for users that should have access in a specific fileset. The problem is when users using another fileset give eg global write access on their directories, the former users can write without limits, because it is by default 0 == no limits. Setting the quota on the file system will only restrict users quota in the root fileset, and setting quota for each user - fileset combination would be a huge mess. Setting default quotas does not work for existing users. Thank you !! Kenneth From aaron.s.knister at nasa.gov Fri Aug 4 16:02:04 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 4 Aug 2017 11:02:04 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 In-Reply-To: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> References: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Message-ID: I've narrowed the problem down to 4.1.1.16. We'll most likely be downgrading to 4.1.1.15. -Aaron On 8/4/17 4:00 AM, Aaron Knister wrote: > Hey All, > > Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? > > We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some rather > disconcerting behavior. Specifically on some of the upgraded nodes GPFS > will seemingly deadlock on the entire node rendering it unusable. I > can't even get a session on the node (but I can trigger a crash dump via > a sysrq trigger). > > Most blocked tasks are blocked are in cxiWaitEventWait at the top of > their call trace. That's probably not very helpful in of itself but I'm > curious if anyone else out there has run into this issue or if this is a > known bug. > > (I'll open a PMR later today once I've gathered more diagnostic > information). > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan.buzzard at strath.ac.uk Fri Aug 4 16:15:44 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 04 Aug 2017 16:15:44 +0100 Subject: [gpfsug-discuss] restrict user quota on specific filesets In-Reply-To: References: Message-ID: <1501859744.17548.69.camel@strath.ac.uk> On Fri, 2017-08-04 at 16:45 +0200, Kenneth Waegeman wrote: > Hi, > > Is it possible to let users only write data in filesets where some quota > is explicitly set ? > > We have independent filesets with quota defined for users that should > have access in a specific fileset. The problem is when users using > another fileset give eg global write access on their directories, the > former users can write without limits, because it is by default 0 == no > limits. Setting appropriate ACL's on the junction point of the fileset so that they can only write to file sets that they have permissions to is how you achieve this. I would say create groups and do it that way, but *nasty* things happen when you are a member of more than 16 supplemental groups and are using NFSv3 (NFSv4 and up is fine). So as long as that is not an issue go nuts with groups as it is much easier to manage. > Setting the quota on the file system will only restrict users quota in > the root fileset, and setting quota for each user - fileset combination > would be a huge mess. Setting default quotas does not work for existing > users. Not sure abusing the quota system for permissions a sensible approach. Put another way it was not designed with that purpose in mind so don't be surprised when you can't use it to do that. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From ilan84 at gmail.com Sun Aug 6 09:26:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 11:26:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood Message-ID: Hi guys, I see IBM spectrumscale configure the NFS via command: mmnfs Is the command mmnfs is a wrapper on top of the normal kernel NFS (Kernel VFS) ? Is it a wrapper on top of ganesha NFS ? Or it is NFS implemented by SpectrumScale team ? Thanks -- - Ilan Schwarts From S.J.Thompson at bham.ac.uk Sun Aug 6 10:10:45 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Sun, 6 Aug 2017 09:10:45 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] Sent: 06 August 2017 09:26 To: gpfsug main discussion list Subject: [gpfsug-discuss] what is mmnfs under the hood Hi guys, I see IBM spectrumscale configure the NFS via command: mmnfs Is the command mmnfs is a wrapper on top of the normal kernel NFS (Kernel VFS) ? Is it a wrapper on top of ganesha NFS ? Or it is NFS implemented by SpectrumScale team ? Thanks -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Sun Aug 6 10:42:30 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 12:42:30 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, I cannot use ganesha NFS. How do I make NFS exports ? just editing all nodes /etc/exports is enough ? I should i use the CNFS as described here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) wrote: > Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... > > Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. > > Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > Sent: 06 August 2017 09:26 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] what is mmnfs under the hood > > Hi guys, > > I see IBM spectrumscale configure the NFS via command: mmnfs > > Is the command mmnfs is a wrapper on top of the normal kernel NFS > (Kernel VFS) ? > Is it a wrapper on top of ganesha NFS ? > Or it is NFS implemented by SpectrumScale team ? > > > Thanks > > -- > > > - > Ilan Schwarts > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- - Ilan Schwarts From ilan84 at gmail.com Sun Aug 6 10:49:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 12:49:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts From S.J.Thompson at bham.ac.uk Sun Aug 6 11:54:17 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Sun, 6 Aug 2017 10:54:17 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: , Message-ID: What do you mean by cannot use mmsmb and cannot use Ganesha? Do you functionally you are not allowed to or they are not working for you? If it's the latter, then this should be resolvable. If you are under active maintenance you could try raising a ticket with IBM, though basic implementation is not really a support issue and so you may be better engaging a business partner or integrator to help you out. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] Sent: 06 August 2017 10:49 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] what is mmnfs under the hood I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Sun Aug 6 12:39:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 14:39:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: In my case, I cannot use nfs ganesha, this means I cannot use mmsnb since its part of "ces", if i want to use cnfs i cannot combine it with ces.. so the system architecture need to solve this issue. On Aug 6, 2017 13:54, "Simon Thompson (IT Research Support)" < S.J.Thompson at bham.ac.uk> wrote: > What do you mean by cannot use mmsmb and cannot use Ganesha? Do you > functionally you are not allowed to or they are not working for you? > > If it's the latter, then this should be resolvable. If you are under > active maintenance you could try raising a ticket with IBM, though basic > implementation is not really a support issue and so you may be better > engaging a business partner or integrator to help you out. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces@ > spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > Sent: 06 August 2017 10:49 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] what is mmnfs under the hood > > I have read this atricle: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2. > 0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm > > So, in a shortcut, CNFS cannot be used when sharing via CES. > I cannot use ganesha NFS. > > Is it possible to share a cluster via SMB and NFS without using CES ? > the nfs will be expored via CNFS but what about SMB ? i cannot use > mmsmb.. > > > On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > > I cannot use ganesha NFS. > > How do I make NFS exports ? just editing all nodes /etc/exports is > enough ? > > I should i use the CNFS as described here: > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2. > 2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > > wrote: > >> Under the hood, the NFS services are provided by IBM supplied Ganesha > rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle > locking, ACLs, quota etc... > >> > >> Note it's different from using the cnfs support in Spectrum Scale which > uses Kernel NFS AFAIK. Using user space Ganesha means they have control of > the NFS stack, so if something needs patching/fixing, then can roll out new > Ganesha rpms rather than having to get (e.g.) RedHat to incorporate > something into kernel NFS. > >> > >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute > the config to the nodes. > >> > >> Simon > >> ________________________________________ > >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces@ > spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > >> Sent: 06 August 2017 09:26 > >> To: gpfsug main discussion list > >> Subject: [gpfsug-discuss] what is mmnfs under the hood > >> > >> Hi guys, > >> > >> I see IBM spectrumscale configure the NFS via command: mmnfs > >> > >> Is the command mmnfs is a wrapper on top of the normal kernel NFS > >> (Kernel VFS) ? > >> Is it a wrapper on top of ganesha NFS ? > >> Or it is NFS implemented by SpectrumScale team ? > >> > >> > >> Thanks > >> > >> -- > >> > >> > >> - > >> Ilan Schwarts > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > -- > > > > > > - > > Ilan Schwarts > > > > -- > > > - > Ilan Schwarts > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Mon Aug 7 05:58:13 2017 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Mon, 7 Aug 2017 04:58:13 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: It would be nice to know why you cannot use ganesha or mmsmb. You don't have to use protocols or CES. We are migrating to CES from doing our own thing with NFS and samba on Debian. Debian does not have support for CES, so we had to roll our own. We did not use CNFS either. To get to CES we had to change OS. We did this because we valued the support. I'd say the failover works better with CES than with our solution, particularly with regards failing over and Infiniband IP address. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ilan Schwarts Sent: Sunday, 6 August 2017 7:50 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] what is mmnfs under the hood I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.sp > ectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of >> ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Mon Aug 7 14:27:07 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Mon, 7 Aug 2017 16:27:07 +0300 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Hi all, My setup is 2 nodes GPFS and 1 machine as NFS Client. All machines (3 total) run CentOS 7.2 The 3rd CentOS machine (not part of the cluster) used as NFS Client. I mount the NFS Client machine to one of the nodes: mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 This gives me the following: [root at CentOS7286-64 ~]# mount -v | grep gpfs 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) Now, From the Client NFS Machine, I go to the mount directory ("cd /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I use nfs4_getfacl: [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 Operation to request attribute not supported. [root at CentOS7286-64 nfs4]# >From the NODE machine i see the status: [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 flag value description ------------------- ------------------------ ----------------------------------- -f 8192 Minimum fragment size in bytes -i 4096 Inode size in bytes -I 16384 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k nfs4 ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 262144 Block size -Q none Quotas accounting enabled none Quotas enforced none Default quotas enabled --perfileset-quota No Per-fileset quota enforcement --filesetdf No Fileset df enabled? -V 16.00 (4.2.2.0) File system version --create-time Wed Jul 5 12:28:39 2017 File system creation time -z No Is DMAPI enabled? -L 4194304 Logfile size -E Yes Exact mtime mount option -S No Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 171840 Maximum number of inodes in all inode spaces --log-replicas 0 Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) -P system Disk storage pools in file system -d nynsd1;nynsd2 Disks in file system -A yes Automatic mount option -o none Additional mount options -T /fs_gpfs01 Default mount point --mount-priority 0 Mount priority I saw this thread: https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 Is it still relevant ? Since 2014.. Thanks ! From makaplan at us.ibm.com Mon Aug 7 17:48:39 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 7 Aug 2017 12:48:39 -0400 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Indeed. You can consider and use GPFS/Spectrum Scale as "just another" file system type that can be loaded into/onto a Linux system. But you should consider the pluses and minuses of using other software subsystems that may or may not be designed to work better or inter-operate with Spectrum Scale specific features and APIs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Mon Aug 7 18:14:41 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Mon, 7 Aug 2017 20:14:41 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Thanks for response. I am not a system engineer / storage architect. I maintain kernel module that interact with file system drivers.. so I need to configure gpfs and perform tests.. for example I noticed that gpfs set extended attribute does not go via VFS On Aug 7, 2017 19:48, "Marc A Kaplan" wrote: > Indeed. You can consider and use GPFS/Spectrum Scale as "just another" > file system type that can be loaded into/onto a Linux system. > > But you should consider the pluses and minuses of using other software > subsystems that may or may not be designed to work better or inter-operate > with Spectrum Scale specific features and APIs. > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Aug 7 21:27:03 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 07 Aug 2017 16:27:03 -0400 Subject: [gpfsug-discuss] 'ltfsee info tapes' - Unusable tapes... Message-ID: <8652.1502137623@turing-police.cc.vt.edu> The LTFSEE docs say: https://www.ibm.com/support/knowledgecenter/en/ST9MBR_1.2.3/ltfs_ee_ltfsee_info_tapes.html "Unusable The Unusable status indicates that the tape can't be used. To change the status, remove the tape from the pool by using the ltfsee pool remove command with the -r option. Then, add the tape back into the pool by using the ltfsee pool add command." Do they really mean that? What happens to data that was on the tape? Does the 'pool add' command re-import LTFS's knowledge of what files were on that tape? It's one thing to remove/add tapes with no files on them - but I'm leery of doing it for tapes that contain migrated data, given a lack of clear statement that file index recovery is done at 'pool add' time. (We had a tape get stuck in a drive, and LTFS/EE tried to use the drive, wasn't able to load the tape because the drive was occupied, marked the tape as Unusable. Lather rinse repeat until there's no usable tapes left in the pool... but that's a different issue...) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From jamiedavis at us.ibm.com Mon Aug 7 22:10:06 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Mon, 7 Aug 2017 21:10:06 +0000 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 8 05:28:20 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 8 Aug 2017 07:28:20 +0300 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: Hi, The command should work from server side i know.. but isnt the scenario of: Root user, that is mounted via nfsv4 to a gpfs filesystem, cannot edit any of the mounted files/dirs acls? The acls are editable only from server side? Thanks! On Aug 8, 2017 00:10, "James Davis" wrote: > Hi Ilan, > > 1. Your command might work from the server side; you said you tried it > from the client side. Could you find anything in the docs about this? I > could not. > > 2. I can share this NFSv4-themed wrapper around mmputacl if it would be > useful to you. You would have to run it from the GPFS side, not the NFS > client side. > > Regards, > > Jamie > > # ./updateNFSv4ACL -h > Update the NFSv4 ACL governing a file's access permissions. > Appends to the existing ACL, overwriting conflicting permissions. > Usage: ./updateNFSv4ACL -file /path/to/file { ADD_PERM_SPEC | > DEL_PERM_SPEC }+ > ADD_PERM_SPEC: { -owningUser PERM | -owningGroup PERM | -other PERM | > -ace nameType:name:PERM:aceType } > DEL_PERM_SPEC: { -noACEFor nameType:name } > PERM: Specify a string composed of one or more of the following letters > in no particular order: > r (ead) > w (rite) > a (ppend) Must agree with write > x (execute) > d (elete) > D (elete child) Dirs only > t (read attrs) > T (write attrs) > c (read ACL) > C (write ACL) > o (change owner) > You can also provide these, but they will have no effect in GPFS: > n (read named attrs) > N (write named attrs) > y (support synchronous I/O) > > To indicate no permissions, give a - > nameType: 'user' or 'group'. > aceType: 'allow' or 'deny'. > Examples: ./updateNFSv4ACL -file /fs1/f -owningUser rtc -owningGroup > rwaxdtc -other '-' > Assign these permissions to 'owner', 'group', 'other'. > ./updateNFSv4ACL -file /fs1/f -ace 'user:pfs001:rtc:allow' > -noACEFor 'group:fvt001' > Allow user pfs001 read/read attrs/read ACL permission > Remove all ACEs (allow and deny) for group fvt001. > Notes: > Permissions you do not allow are denied by default. > See the GPFS docs for some other restrictions. > ace is short for Access Control Entry > > > ----- Original message ----- > From: Ilan Schwarts > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster > Date: Mon, Aug 7, 2017 9:27 AM > > Hi all, > My setup is 2 nodes GPFS and 1 machine as NFS Client. > All machines (3 total) run CentOS 7.2 > > The 3rd CentOS machine (not part of the cluster) used as NFS Client. > > I mount the NFS Client machine to one of the nodes: mount -t nfs > 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 > > This gives me the following: > > [root at CentOS7286-64 ~]# mount -v | grep gpfs > 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 > (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen= > 255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys, > clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) > > Now, From the Client NFS Machine, I go to the mount directory ("cd > /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I > use nfs4_getfacl: > [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 > Operation to request attribute not supported. > [root at CentOS7286-64 nfs4]# > > From the NODE machine i see the status: > [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 > flag value description > ------------------- ------------------------ ------------------------------ > ----- > -f 8192 Minimum fragment size in bytes > -i 4096 Inode size in bytes > -I 16384 Indirect block size in bytes > -m 1 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data > replicas > -R 2 Maximum number of data > replicas > -j cluster Block allocation type > -D nfs4 File locking semantics in > effect > -k nfs4 ACL semantics in effect > -n 32 Estimated number of nodes > that will mount file system > -B 262144 Block size > -Q none Quotas accounting enabled > none Quotas enforced > none Default quotas enabled > --perfileset-quota No Per-fileset quota enforcement > --filesetdf No Fileset df enabled? > -V 16.00 (4.2.2.0) File system version > --create-time Wed Jul 5 12:28:39 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E Yes Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation > option > --fastea Yes Fast external attributes > enabled? > --encryption No Encryption enabled? > --inode-limit 171840 Maximum number of inodes > in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > -P system Disk storage pools in file > system > -d nynsd1;nynsd2 Disks in file system > -A yes Automatic mount option > -o none Additional mount options > -T /fs_gpfs01 Default mount point > --mount-priority 0 Mount priority > > > > I saw this thread: > https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 > > Is it still relevant ? Since 2014.. > > Thanks ! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Tue Aug 8 05:50:10 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Tue, 8 Aug 2017 10:20:10 +0530 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Tue Aug 8 17:30:13 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Tue, 8 Aug 2017 22:00:13 +0530 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: (seems my earlier reply created a new topic; hence trying to reply back original thread started by Ilan Schwarts...) >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/08/2017 04:30 PM Subject: gpfsug-discuss Digest, Vol 67, Issue 21 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: How to use nfs4_getfacl (or set) on GPFS cluster (Ilan Schwarts) 2. How to use nfs4_getfacl (or set) on GPFS cluster (Chetan R Kulkarni) ---------------------------------------------------------------------- Message: 1 Date: Tue, 8 Aug 2017 07:28:20 +0300 From: Ilan Schwarts To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Content-Type: text/plain; charset="utf-8" Hi, The command should work from server side i know.. but isnt the scenario of: Root user, that is mounted via nfsv4 to a gpfs filesystem, cannot edit any of the mounted files/dirs acls? The acls are editable only from server side? Thanks! On Aug 8, 2017 00:10, "James Davis" wrote: > Hi Ilan, > > 1. Your command might work from the server side; you said you tried it > from the client side. Could you find anything in the docs about this? I > could not. > > 2. I can share this NFSv4-themed wrapper around mmputacl if it would be > useful to you. You would have to run it from the GPFS side, not the NFS > client side. > > Regards, > > Jamie > > # ./updateNFSv4ACL -h > Update the NFSv4 ACL governing a file's access permissions. > Appends to the existing ACL, overwriting conflicting permissions. > Usage: ./updateNFSv4ACL -file /path/to/file { ADD_PERM_SPEC | > DEL_PERM_SPEC }+ > ADD_PERM_SPEC: { -owningUser PERM | -owningGroup PERM | -other PERM | > -ace nameType:name:PERM:aceType } > DEL_PERM_SPEC: { -noACEFor nameType:name } > PERM: Specify a string composed of one or more of the following letters > in no particular order: > r (ead) > w (rite) > a (ppend) Must agree with write > x (execute) > d (elete) > D (elete child) Dirs only > t (read attrs) > T (write attrs) > c (read ACL) > C (write ACL) > o (change owner) > You can also provide these, but they will have no effect in GPFS: > n (read named attrs) > N (write named attrs) > y (support synchronous I/O) > > To indicate no permissions, give a - > nameType: 'user' or 'group'. > aceType: 'allow' or 'deny'. > Examples: ./updateNFSv4ACL -file /fs1/f -owningUser rtc -owningGroup > rwaxdtc -other '-' > Assign these permissions to 'owner', 'group', 'other'. > ./updateNFSv4ACL -file /fs1/f -ace 'user:pfs001:rtc:allow' > -noACEFor 'group:fvt001' > Allow user pfs001 read/read attrs/read ACL permission > Remove all ACEs (allow and deny) for group fvt001. > Notes: > Permissions you do not allow are denied by default. > See the GPFS docs for some other restrictions. > ace is short for Access Control Entry > > > ----- Original message ----- > From: Ilan Schwarts > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster > Date: Mon, Aug 7, 2017 9:27 AM > > Hi all, > My setup is 2 nodes GPFS and 1 machine as NFS Client. > All machines (3 total) run CentOS 7.2 > > The 3rd CentOS machine (not part of the cluster) used as NFS Client. > > I mount the NFS Client machine to one of the nodes: mount -t nfs > 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 > > This gives me the following: > > [root at CentOS7286-64 ~]# mount -v | grep gpfs > 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 > (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen= > 255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys, > clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) > > Now, From the Client NFS Machine, I go to the mount directory ("cd > /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I > use nfs4_getfacl: > [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 > Operation to request attribute not supported. > [root at CentOS7286-64 nfs4]# > > From the NODE machine i see the status: > [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 > flag value description > ------------------- ------------------------ ------------------------------ > ----- > -f 8192 Minimum fragment size in bytes > -i 4096 Inode size in bytes > -I 16384 Indirect block size in bytes > -m 1 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data > replicas > -R 2 Maximum number of data > replicas > -j cluster Block allocation type > -D nfs4 File locking semantics in > effect > -k nfs4 ACL semantics in effect > -n 32 Estimated number of nodes > that will mount file system > -B 262144 Block size > -Q none Quotas accounting enabled > none Quotas enforced > none Default quotas enabled > --perfileset-quota No Per-fileset quota enforcement > --filesetdf No Fileset df enabled? > -V 16.00 (4.2.2.0) File system version > --create-time Wed Jul 5 12:28:39 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E Yes Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation > option > --fastea Yes Fast external attributes > enabled? > --encryption No Encryption enabled? > --inode-limit 171840 Maximum number of inodes > in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > -P system Disk storage pools in file > system > -d nynsd1;nynsd2 Disks in file system > -A yes Automatic mount option > -o none Additional mount options > -T /fs_gpfs01 Default mount point > --mount-priority 0 Mount priority > > > > I saw this thread: > https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 > > Is it still relevant ? Since 2014.. > > Thanks ! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170808/0e20196d/attachment-0001.html > ------------------------------ Message: 2 Date: Tue, 8 Aug 2017 10:20:10 +0530 From: "Chetan R Kulkarni" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Content-Type: text/plain; charset="us-ascii" >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170808/42fbe6c2/attachment-0001.html > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 21 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stefan.dietrich at desy.de Tue Aug 8 18:16:33 2017 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Tue, 8 Aug 2017 19:16:33 +0200 (CEST) Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS Message-ID: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Hello, I am currently trying to understand an issue with ACLs and how GPFS handles the umask. The filesystem is configured for NFS4 ACLs only (-k nfs4), filesets have been configured for chmodAndUpdateACL and the access is through a native GPFS client (v4.2.3). If I create a new file in a directory, which has an ACE with inheritance, the configured umask on the shell is completely ignored. The new file only contains ACEs from the inherited ACL. As soon as the ACE with inheritance is removed, newly created files receive the correct configured umask. Obvious downside, no ACLs anymore :( Additionally, it looks like that the specified mode bits for an open call are ignored as well. E.g. with an strace I see, that the open call includes the correct mode bits. However, the new file only has inherited ACEs. According to the NFSv4 RFC, the behavior is more or less undefined, only with NFSv4.2 umask will be added to the protocol. For GPFS, I found a section in the traditional ACL administration section, but nothing in the NFS4 ACL section of the docs. Is my current observation the intended behavior of GPFS? Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ From kkr at lbl.gov Tue Aug 8 19:33:22 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Tue, 8 Aug 2017 11:33:22 -0700 Subject: [gpfsug-discuss] Hold the Date - Spectrum Scale Day @ HPCXXL (Sept 2017, NYC) In-Reply-To: References: Message-ID: Hello, The GPFS Day of the HPCXXL conference is confirmed for Thursday, September 28th. Here is an updated URL, the agenda and registration are still being put together http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ . The GPFS Day will require registration, so we can make sure there is enough room (and coffee/food) for all attendees ?however, there will be no registration fee if you attend the GPFS Day only. I?ll send another update when the agenda is closer to settled. Cheers, Kristy > On Jul 7, 2017, at 3:32 PM, Kristy Kallback-Rose wrote: > > Hello, > > More details will be provided as they become available, but just so you can make a placeholder on your calendar, there will be a Spectrum Scale Day the week of September 25th - 29th, likely Thursday, September 28th. > > This will be a part of the larger HPCXXL meeting (https://www.spxxl.org/?q=New-York-City-2017 ). You may recall this group was formerly called SPXXL and the website is in the process of transitioning to the new name (and getting a new certificate). You will be able to attend *just* the Spectrum Scale day if that is the only portion of the event you would like to attend. > > The NYC location is great for Spectrum Scale events because many IBMers, including developers, can come in from Poughkeepsie. > > More as we get closer to the date and details are settled. > > Cheers, > Kristy -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Aug 8 20:28:31 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 8 Aug 2017 14:28:31 -0500 Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS In-Reply-To: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> References: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Message-ID: Yes, that is the intended behavior. As in the section on traditional ACLs that you found, the intent is that if there is a default/inherited ACL, the object is created with that (and if there is no default/inherited ACL, then the mode and umask are the basis for the initial set of permissions). Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dietrich, Stefan" To: gpfsug-discuss at spectrumscale.org Date: 08/08/2017 12:17 PM Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, I am currently trying to understand an issue with ACLs and how GPFS handles the umask. The filesystem is configured for NFS4 ACLs only (-k nfs4), filesets have been configured for chmodAndUpdateACL and the access is through a native GPFS client (v4.2.3). If I create a new file in a directory, which has an ACE with inheritance, the configured umask on the shell is completely ignored. The new file only contains ACEs from the inherited ACL. As soon as the ACE with inheritance is removed, newly created files receive the correct configured umask. Obvious downside, no ACLs anymore :( Additionally, it looks like that the specified mode bits for an open call are ignored as well. E.g. with an strace I see, that the open call includes the correct mode bits. However, the new file only has inherited ACEs. According to the NFSv4 RFC, the behavior is more or less undefined, only with NFSv4.2 umask will be added to the protocol. For GPFS, I found a section in the traditional ACL administration section, but nothing in the NFS4 ACL section of the docs. Is my current observation the intended behavior of GPFS? Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Aug 8 22:27:20 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 8 Aug 2017 17:27:20 -0400 Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS In-Reply-To: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> References: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Message-ID: (IMO) NFSv4 ACLs are complicated. Confusing. Difficult. Befuddling. PIA. Before questioning the GPFS implementation, see how they work in other file systems. If GPFS does it differently, perhaps there is a rationale, or perhaps you've found a bug. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Wed Aug 9 11:32:32 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 9 Aug 2017 10:32:32 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: <09520659d6cb44a1bbbed066106b39a2@R01UKEXCASM223.r01.fujitsu.local> Hello Experts, Does GPFS start "down" disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in "down" state by GPFS. Will GPFS in such case start the disk? With best regards / Mit freundlichen Gr??en / Pozdrawiam Tomasz Wolski Development Engineer NDC Eternus CS HE (ET2) [cid:image002.gif at 01CE62B9.8ACFA960] FUJITSU Fujitsu Technology Solutions Sp. z o.o. Textorial Park Bldg C, ul. Fabryczna 17 90-344 Lodz, Poland E-mail: Tomasz.Wolski at ts.fujitsu.com Web: ts.fujitsu.com Company details: ts.fujitsu.com/imprint This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 2774 bytes Desc: image001.gif URL: From chris.schlipalius at pawsey.org.au Wed Aug 9 11:50:22 2017 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Wed, 9 Aug 2017 18:50:22 +0800 Subject: [gpfsug-discuss] Announcement of the next Australian SpectrumScale User Group - Half day August 2017 (Melbourne) References: Message-ID: <0190993E-B870-4A37-9671-115A1201A59D@pawsey.org.au> Hello we have a half day (afternoon) usergroup next week. Please check out the event registration link below for tickets, speakers and topics. https://goo.gl/za8g3r Regards, Chris Schlipalius Lead Organiser Spectrum Scale Usergroups Australia Senior Storage Infrastucture Specialist, The Pawsey Supercomputing Centre From Robert.Oesterlin at nuance.com Wed Aug 9 13:14:46 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 9 Aug 2017 12:14:46 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: By default, GPFS does not automatically start down disks. You could add a callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? if you wanted. If a disk is marked down, it?s better to determine why before trying to start it as it may involve other issues that need investigation. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of "Tomasz.Wolski at ts.fujitsu.com" Reply-To: gpfsug main discussion list Date: Wednesday, August 9, 2017 at 6:33 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs automatically? Does GPFS start ?down? disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in ?down? state by GPFS. Will GPFS in such case start the disk? -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Aug 9 13:22:57 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 9 Aug 2017 14:22:57 +0200 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? In-Reply-To: References: Message-ID: If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. That can be quite useful for stretched clusters, where you want to replicate all blocks to both locations, and this way recover automatically. -jf On Wed, Aug 9, 2017 at 2:14 PM, Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: > By default, GPFS does not automatically start down disks. You could add a > callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? > if you wanted. If a disk is marked down, it?s better to determine why > before trying to start it as it may involve other issues that need > investigation. > > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > *From: * on behalf of " > Tomasz.Wolski at ts.fujitsu.com" > *Reply-To: *gpfsug main discussion list > *Date: *Wednesday, August 9, 2017 at 6:33 AM > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs > automatically? > > > > > > Does GPFS start ?down? disks in a filesystem automatically? For instance, > when connection to NSD is recovered, but it the meantime disk was put in > ?down? state by GPFS. Will GPFS in such case start the disk? > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Aug 9 13:48:00 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 9 Aug 2017 12:48:00 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: <3F6EFDF7-B96B-4E89-ABFE-4EEBEE0C0878@nuance.com> Be careful here, as this does: ?When a disk experiences a failure and becomes unavailable, the recovery procedure will first attempt to restart the disk and if this fails, the disk is suspended and its data moved to other disks. ? Which may not be what you want to happen. :-) If you have disks marked down due to a transient failure, kicking of restripes to move the data off might not be the best choice. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Jan-Frode Myklebust Reply-To: gpfsug main discussion list Date: Wednesday, August 9, 2017 at 8:23 AM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] Is GPFS starting NSDs automatically? If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 9 16:04:35 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 9 Aug 2017 15:04:35 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? In-Reply-To: References: Message-ID: <44e67260b5104860adaaf8222e11e995@jumptrading.com> For non-stretch clusters, I think best practice would be to have an administrator analyze the situation and understand why the NSD was considered unavailable before attempting to start the disks back in the file system. Down NSDs are usually indicative of a serious issue. However I have seen a transient network communication problems or NSD server recovery cause a NSD Client to report a NSD as failed. I would prefer that the FS manager check first that the NSDs are actually not accessible and that there isn?t a recovery operation within the NSD Servers supporting an NSD before marking NSDs as down. Recovery should be allowed to complete and a NSD client should just wait for that to happen. NSDs being marked down can cause serious file system outages!! We?ve also requested that a settable retry configuration setting be provided to have NSD Clients retry access to the NSD before reporting the NSD as failed (https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=104474 if you want to add a vote!). Cheers, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jan-Frode Myklebust Sent: Wednesday, August 09, 2017 7:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Is GPFS starting NSDs automatically? Note: External Email ________________________________ If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. That can be quite useful for stretched clusters, where you want to replicate all blocks to both locations, and this way recover automatically. -jf On Wed, Aug 9, 2017 at 2:14 PM, Oesterlin, Robert > wrote: By default, GPFS does not automatically start down disks. You could add a callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? if you wanted. If a disk is marked down, it?s better to determine why before trying to start it as it may involve other issues that need investigation. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: > on behalf of "Tomasz.Wolski at ts.fujitsu.com" > Reply-To: gpfsug main discussion list > Date: Wednesday, August 9, 2017 at 6:33 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs automatically? Does GPFS start ?down? disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in ?down? state by GPFS. Will GPFS in such case start the disk? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Aug 14 22:53:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 14 Aug 2017 17:53:35 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 In-Reply-To: References: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Message-ID: <5e25d9b1-13de-20b7-567d-e14601fd4bd0@nasa.gov> I was remiss in not following up with this sooner and thank you to the kind individual that shot me a direct message to ask the question. It turns out that when I asked for the fix for APAR IV96776 I got an early release of 4.1.1.16 that had a fix for the APAR but also introduced the lockup bug. IBM kindly delayed the release of 4.1.1.16 proper until they had addressed the lockup bug (APAR IV98888). As I understand it the version of 4.1.1.16 that was released via fix central should have a fix for this bug although I haven't tested it I have no reason to believe it's not fixed. -Aaron On 08/04/2017 11:02 AM, Aaron Knister wrote: > I've narrowed the problem down to 4.1.1.16. We'll most likely be > downgrading to 4.1.1.15. > > -Aaron > > On 8/4/17 4:00 AM, Aaron Knister wrote: >> Hey All, >> >> Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? >> >> We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some >> rather disconcerting behavior. Specifically on some of the upgraded >> nodes GPFS will seemingly deadlock on the entire node rendering it >> unusable. I can't even get a session on the node (but I can trigger a >> crash dump via a sysrq trigger). >> >> Most blocked tasks are blocked are in cxiWaitEventWait at the top of >> their call trace. That's probably not very helpful in of itself but >> I'm curious if anyone else out there has run into this issue or if >> this is a known bug. >> >> (I'll open a PMR later today once I've gathered more diagnostic >> information). >> >> -Aaron >> > From aaron.s.knister at nasa.gov Thu Aug 17 14:12:28 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Thu, 17 Aug 2017 13:12:28 +0000 Subject: [gpfsug-discuss] NSD Server/FS Manager Memory Requirements Message-ID: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> Hi Everyone, In the world of GPFS 4.2 is there a particular advantage to having a large amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD Server/FS manager nodes? We currently have half of physical memory allocated to pagepool on these nodes. For some historical context-- we had two indicidents that drove us to increase our NSD server/FS manager pagepools. One was a weird behavior in GPFS 3.5 that was causing bouncing FS managers until we bumped the page pool from a few gigs to about half of the physical memory on the node. The other was a mass round of parallel mmfsck's of all 20 something of our filesystems. It came highly recommended to us to increase the pagepool to something very large for that. I'm curious to hear what other folks do and what the recommendations from IBM folks are. Thanks, Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Aug 17 14:43:48 2017 From: ewahl at osc.edu (Edward Wahl) Date: Thu, 17 Aug 2017 09:43:48 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: <20170817094348.37d2f51b@osc.edu> On Fri, 4 Aug 2017 01:02:22 -0400 "J. Eric Wonderley" wrote: > 4.2.2.3 > > I want to think maybe this started after expanding inode space What does 'mmlsfileset home nathanfootest -L' say? Ed > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis wrote: > > > Hey, > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > Cheers, > > > > Jamie > > > > > > ----- Original message ----- > > From: "J. Eric Wonderley" > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list > > Cc: > > Subject: [gpfsug-discuss] mmsetquota produces error > > Date: Wed, Aug 2, 2017 5:03 PM > > > > for one of our home filesystem we get: > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > 'Invalid argument'. > > > > > > mmedquota -j home:nathanfootest > > does work however > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From eric.wonderley at vt.edu Thu Aug 17 15:13:57 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 17 Aug 2017 10:13:57 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: <20170817094348.37d2f51b@osc.edu> References: <20170817094348.37d2f51b@osc.edu> Message-ID: The error is very repeatable... [root at cl001 ~]# mmcrfileset home setquotafoo Fileset setquotafoo created with id 61 root inode 3670407. [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo Fileset setquotafoo linked at /gpfs/home/setquotafoo [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid argument'. mmsetquota: Command failed. Examine previous error messages to determine cause. [root at cl001 ~]# mmlsfileset home setquotafoo -L Filesets in file system 'home': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 61 3670407 0 Thu Aug 17 10:10:54 2017 0 0 0 On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > On Fri, 4 Aug 2017 01:02:22 -0400 > "J. Eric Wonderley" wrote: > > > 4.2.2.3 > > > > I want to think maybe this started after expanding inode space > > What does 'mmlsfileset home nathanfootest -L' say? > > Ed > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > > > Hey, > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > Cheers, > > > > > > Jamie > > > > > > > > > ----- Original message ----- > > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > To: gpfsug main discussion list > > > Cc: > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > for one of our home filesystem we get: > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > 'Invalid argument'. > > > > > > > > > mmedquota -j home:nathanfootest > > > does work however > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Aug 17 15:20:06 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 17 Aug 2017 14:20:06 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: I?ve just done exactly that and can?t reproduce it in my prod environment. Running 4.2.3-2 though. [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L Filesets in file system 'gpfs': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 251 8408295 0 Thu Aug 17 15:17:18 2017 0 0 0 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of J. Eric Wonderley Sent: 17 August 2017 15:14 To: Edward Wahl Cc: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmsetquota produces error The error is very repeatable... [root at cl001 ~]# mmcrfileset home setquotafoo Fileset setquotafoo created with id 61 root inode 3670407. [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo Fileset setquotafoo linked at /gpfs/home/setquotafoo [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid argument'. mmsetquota: Command failed. Examine previous error messages to determine cause. [root at cl001 ~]# mmlsfileset home setquotafoo -L Filesets in file system 'home': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 61 3670407 0 Thu Aug 17 10:10:54 2017 0 0 0 On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl > wrote: On Fri, 4 Aug 2017 01:02:22 -0400 "J. Eric Wonderley" > wrote: > 4.2.2.3 > > I want to think maybe this started after expanding inode space What does 'mmlsfileset home nathanfootest -L' say? Ed > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > Hey, > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > Cheers, > > > > Jamie > > > > > > ----- Original message ----- > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list > > > Cc: > > Subject: [gpfsug-discuss] mmsetquota produces error > > Date: Wed, Aug 2, 2017 5:03 PM > > > > for one of our home filesystem we get: > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > 'Invalid argument'. > > > > > > mmedquota -j home:nathanfootest > > does work however > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamiedavis at us.ibm.com Thu Aug 17 15:30:19 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Thu, 17 Aug 2017 14:30:19 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: , <20170817094348.37d2f51b@osc.edu> Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Thu Aug 17 15:34:26 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 17 Aug 2017 10:34:26 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: I recently opened a pmr on this issue(24603,442,000)...I'll keep this thread posted on results. On Thu, Aug 17, 2017 at 10:30 AM, James Davis wrote: > I've also tried on our in-house latest release and cannot recreate it. > > I'll ask around to see who's running a 4.2.2 cluster I can look at. > > > ----- Original message ----- > From: "Sobey, Richard A" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list , > "Edward Wahl" > Cc: > Subject: Re: [gpfsug-discuss] mmsetquota produces error > Date: Thu, Aug 17, 2017 10:20 AM > > > I?ve just done exactly that and can?t reproduce it in my prod environment. > Running 4.2.3-2 though. > > > > [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L > > Filesets in file system 'gpfs': > > Name Id RootInode ParentId > Created InodeSpace MaxInodes AllocInodes > Comment > > setquotafoo 251 8408295 0 Thu Aug 17 > 15:17:18 2017 0 0 0 > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] *On Behalf Of *J. Eric Wonderley > *Sent:* 17 August 2017 15:14 > *To:* Edward Wahl > *Cc:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] mmsetquota produces error > > > > The error is very repeatable... > [root at cl001 ~]# mmcrfileset home setquotafoo > Fileset setquotafoo created with id 61 root inode 3670407. > [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo > Fileset setquotafoo linked at /gpfs/home/setquotafoo > [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files > 10M:10M > tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid > argument'. > mmsetquota: Command failed. Examine previous error messages to determine > cause. > [root at cl001 ~]# mmlsfileset home setquotafoo -L > Filesets in file system 'home': > Name Id RootInode ParentId > Created InodeSpace MaxInodes AllocInodes > Comment > setquotafoo 61 3670407 0 Thu Aug 17 > 10:10:54 2017 0 0 0 > > > > On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > > On Fri, 4 Aug 2017 01:02:22 -0400 > "J. Eric Wonderley" wrote: > > > 4.2.2.3 > > > > I want to think maybe this started after expanding inode space > > What does 'mmlsfileset home nathanfootest -L' say? > > Ed > > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > > > Hey, > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > Cheers, > > > > > > Jamie > > > > > > > > > ----- Original message ----- > > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > To: gpfsug main discussion list > > > Cc: > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > for one of our home filesystem we get: > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > 'Invalid argument'. > > > > > > > > > mmedquota -j home:nathanfootest > > > does work however > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Aug 17 15:50:27 2017 From: ewahl at osc.edu (Edward Wahl) Date: Thu, 17 Aug 2017 10:50:27 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: <20170817105027.316ce609@osc.edu> We're running 4.2.2.3 (well technically 4.2.2.3 efix21 (1028007) since yesterday" and we use filesets extensively for everything and I cannot reproduce this. I would guess this is somehow an inode issue, but... ?? checked the logs for the FS creation and looked for odd errors? So this fileset is not a stand-alone, Is there anything odd about the mmlsfileset for the root fileset? mmlsfileset gpfs root -L can you create files in the Junction directory? Does the increase in inodes show up? nothing weird from 'mmdf gpfs -m' ? none of your metadata NSDs are offline? Ed On Thu, 17 Aug 2017 10:34:26 -0400 "J. Eric Wonderley" wrote: > I recently opened a pmr on this issue(24603,442,000)...I'll keep this > thread posted on results. > > On Thu, Aug 17, 2017 at 10:30 AM, James Davis wrote: > > > I've also tried on our in-house latest release and cannot recreate it. > > > > I'll ask around to see who's running a 4.2.2 cluster I can look at. > > > > > > ----- Original message ----- > > From: "Sobey, Richard A" > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list , > > "Edward Wahl" > > Cc: > > Subject: Re: [gpfsug-discuss] mmsetquota produces error > > Date: Thu, Aug 17, 2017 10:20 AM > > > > > > I?ve just done exactly that and can?t reproduce it in my prod environment. > > Running 4.2.3-2 though. > > > > > > > > [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L > > > > Filesets in file system 'gpfs': > > > > Name Id RootInode ParentId > > Created InodeSpace MaxInodes AllocInodes > > Comment > > > > setquotafoo 251 8408295 0 Thu Aug 17 > > 15:17:18 2017 0 0 0 > > > > > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] *On Behalf Of *J. Eric Wonderley > > *Sent:* 17 August 2017 15:14 > > *To:* Edward Wahl > > *Cc:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] mmsetquota produces error > > > > > > > > The error is very repeatable... > > [root at cl001 ~]# mmcrfileset home setquotafoo > > Fileset setquotafoo created with id 61 root inode 3670407. > > [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo > > Fileset setquotafoo linked at /gpfs/home/setquotafoo > > [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files > > 10M:10M > > tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid > > argument'. > > mmsetquota: Command failed. Examine previous error messages to determine > > cause. > > [root at cl001 ~]# mmlsfileset home setquotafoo -L > > Filesets in file system 'home': > > Name Id RootInode ParentId > > Created InodeSpace MaxInodes AllocInodes > > Comment > > setquotafoo 61 3670407 0 Thu Aug 17 > > 10:10:54 2017 0 0 0 > > > > > > > > On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > > > > On Fri, 4 Aug 2017 01:02:22 -0400 > > "J. Eric Wonderley" wrote: > > > > > 4.2.2.3 > > > > > > I want to think maybe this started after expanding inode space > > > > What does 'mmlsfileset home nathanfootest -L' say? > > > > Ed > > > > > > > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > > wrote: > > > > > > > Hey, > > > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > > > Cheers, > > > > > > > > Jamie > > > > > > > > > > > > ----- Original message ----- > > > > From: "J. Eric Wonderley" > > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > To: gpfsug main discussion list > > > > Cc: > > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > > > for one of our home filesystem we get: > > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > > 'Invalid argument'. > > > > > > > > > > > > mmedquota -j home:nathanfootest > > > > does work however > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > -- > > > > Ed Wahl > > Ohio Supercomputer Center > > 614-292-9302 > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From alex.chekholko at gmail.com Thu Aug 17 19:11:39 2017 From: alex.chekholko at gmail.com (Alex Chekholko) Date: Thu, 17 Aug 2017 18:11:39 +0000 Subject: [gpfsug-discuss] NSD Server/FS Manager Memory Requirements In-Reply-To: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> References: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> Message-ID: Hi Aaron, What would be the advantage of decreasing the pagepool size? Regards, Alex On Thu, Aug 17, 2017 at 6:12 AM Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > In the world of GPFS 4.2 is there a particular advantage to having a large > amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD > Server/FS manager nodes? We currently have half of physical memory > allocated to pagepool on these nodes. > > For some historical context-- we had two indicidents that drove us to > increase our NSD server/FS manager pagepools. One was a weird behavior in > GPFS 3.5 that was causing bouncing FS managers until we bumped the page > pool from a few gigs to about half of the physical memory on the node. The > other was a mass round of parallel mmfsck's of all 20 something of our > filesystems. It came highly recommended to us to increase the pagepool to > something very large for that. > > I'm curious to hear what other folks do and what the recommendations from > IBM folks are. > > Thanks, > Aaron > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Aug 19 02:07:29 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 18 Aug 2017 21:07:29 -0400 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. Message-ID: <35574.1503104849@turing-police.cc.vt.edu> So for a variety of reasons, we had accumulated some 45 tapes that had found ways to get out of Valid status. I've cleaned up most of them, but I'm stuck on a few corner cases. Case 1: l% tfsee info tapes | sort | grep -C 1 'Not Sup' AV0186JD Valid TS1150(J5) 9022 0 56 vbi_tapes VTC 1148 - AV0187JD Not Supported TS1150(J5) 9022 2179 37 vbi_tapes VTC 1149 - AV0188JD Valid TS1150(J5) 9022 1559 67 vbi_tapes VTC 1150 - -- AV0540JD Valid TS1150(J5) 9022 9022 0 vtti_tapes VTC 1607 - AV0541JD Not Supported TS1150(J5) 9022 1797 6 vtti_tapes VTC 1606 - AV0542JD Valid TS1150(J5) 9022 9022 0 vtti_tapes VTC 1605 - How the heck does *that* happen? And how do you fix it? Case 2: The docs say that for 'Invalid', you need to add it to the pool with -c. % ltfsee pool remove -p arc_tapes -l ISB -t AI0084JD; ltfsee pool add -c -p arc_tapes -l ISB -t AI0084JD GLESL043I(01052): Removing tape AI0084JD from storage pool arc_tapes. GLESL041E(01129): Tape AI0084JD does not exist in storage pool arc_tapes or is in an invalid state. Specify a valid tape ID. GLESL042I(00809): Adding tape AI0084JD to storage pool arc_tapes. (Not sure why the last 2 messages got out of order..) % ltfsee info tapes | grep AI0084JD AI0084JD Invalid LTFS TS1150 0 0 0 - ISB 1262 - What do you do if adding it with -c doesn't work? Time to reformat the tape? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From makaplan at us.ibm.com Sat Aug 19 16:45:48 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 19 Aug 2017 11:45:48 -0400 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. In-Reply-To: <35574.1503104849@turing-police.cc.vt.edu> References: <35574.1503104849@turing-police.cc.vt.edu> Message-ID: I'm kinda curious... I've noticed a few message on this subject -- so I went to the doc.... The doc seems to indicate there are some circumstances where removing the tape with the appropriate command and options and then adding it back will result in the files on the tape becoming available again... But, of course, tapes are not 100% (nothing is), so no guarantee. Perhaps the rigamarole of removing and adding back is compensating for software glitch (bug!) -- Logically seems it shouldn't be necessary -- either the tape is readable or not -- the system should be able to do retries and error correction without removing -- but worth a shot. (I'm a gpfs guy, but not an LTFS/EE/tape guy) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Sat Aug 19 20:05:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Sat, 19 Aug 2017 20:05:05 +0100 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. In-Reply-To: References: <35574.1503104849@turing-police.cc.vt.edu> Message-ID: On 19/08/17 16:45, Marc A Kaplan wrote: > I'm kinda curious... I've noticed a few message on this subject -- so I > went to the doc.... > > The doc seems to indicate there are some circumstances where removing > the tape with the appropriate command and options and then adding it > back will result in the files on the tape becoming available again... > But, of course, tapes are not 100% (nothing is), so no guarantee. > Perhaps the rigamarole of removing and adding back is compensating for > software glitch (bug!) -- Logically seems it shouldn't be necessary -- > either the tape is readable or not -- the system should be able to do > retries and error correction without removing -- but worth a shot. > > (I'm a gpfs guy, but not an LTFS/EE/tape guy) > Well with a TSM based HSM there are all sorts of reasons for a tape being marked "offline". Usually it's because there has been some sort of problem with the tape library in my experience. Say there is a problem with the gripper and the library is unable to get the tape, it will mark it as unavailable. Of course issues with reading data from the tape would be another reasons. Typically beyond a number of errors TSM would mark the tape as bad, which is why you always have a copy pool. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From aaron.s.knister at nasa.gov Sun Aug 20 21:02:36 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sun, 20 Aug 2017 16:02:36 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: <30bfb6ca-3d86-08ab-0eec-06def4a2f6db@nasa.gov> I think it would be a huge advantage to support these mysterious nsdChecksum settings for us non GNR folks. Even if the checksums aren't being stored on disk I would think the ability to protect against network-level corruption would be valuable enough to warrant its support. I've created RFE 109269 to request this. We'll see what IBM says. If this is valuable to other folks then please vote for the RFE. -Aaron On 8/2/17 5:53 PM, Stijn De Weirdt wrote: > hi steve, > >> The nsdChksum settings for none GNR/ESS based system is not officially >> supported. It will perform checksum on data transfer over the network >> only and can be used to help debug data corruption when network is a >> suspect. > i'll take not officially supported over silent bitrot any day. > >> Did any of those "Encountered XYZ checksum errors on network I/O to NSD >> Client disk" warning messages resulted in disk been changed to "down" >> state due to IO error? > no. > > If no disk IO error was reported in GPFS log, >> that means data was retransmitted successfully on retry. > we suspected as much. as sven already asked, mmfsck now reports clean > filesystem. > i have an ibdump of 2 involved nsds during the reported checksums, i'll > have a closer look if i can spot these retries. > >> As sven said, only GNR/ESS provids the full end to end data integrity. > so with the silent network error, we have high probabilty that the data > is corrupted. > > we are now looking for a test to find out what adapters are affected. we > hoped that nsdperf with verify=on would tell us, but it doesn't. > >> Steve Y. Xiao >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From evan.koutsandreou at adventone.com Mon Aug 21 04:05:40 2017 From: evan.koutsandreou at adventone.com (Evan Koutsandreou) Date: Mon, 21 Aug 2017 03:05:40 +0000 Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Message-ID: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you From mweil at wustl.edu Mon Aug 21 20:54:27 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 21 Aug 2017 14:54:27 -0500 Subject: [gpfsug-discuss] pmcollector node In-Reply-To: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> References: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> Message-ID: <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> any input on this Thanks On 7/5/17 10:51 AM, Matt Weil wrote: > Hello all, > > Question on the requirements on pmcollector node/s for a 500+ node > cluster. Is there a sizing guide? What specifics should we scale? > CPU Disks memory? > > Thanks > > Matt > From kkr at lbl.gov Mon Aug 21 23:33:36 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 21 Aug 2017 15:33:36 -0700 Subject: [gpfsug-discuss] Hold the Date - Spectrum Scale Day @ HPCXXL (Sept 2017, NYC) In-Reply-To: References: Message-ID: <6EF4187F-D8A1-4927-9E4F-4DF703DA04F5@lbl.gov> If you plan on attending the GPFS Day, please use the HPCXXL registration form (link to Eventbrite registration at the link below). The GPFS day is a free event, but you *must* register so we can make sure there are enough seats and food available. If you would like to speak or suggest a topic, please let me know. http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ The agenda is still being worked on, here are some likely topics: --RoadMap/Updates --"New features - New Bugs? (Julich) --GPFS + Openstack (CSCS) --ORNL Update on Spider3-related GPFS work --ANL Site Update --File Corruption Session Best, Kristy > On Aug 8, 2017, at 11:33 AM, Kristy Kallback-Rose wrote: > > Hello, > > The GPFS Day of the HPCXXL conference is confirmed for Thursday, September 28th. Here is an updated URL, the agenda and registration are still being put together http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ . The GPFS Day will require registration, so we can make sure there is enough room (and coffee/food) for all attendees ?however, there will be no registration fee if you attend the GPFS Day only. > > I?ll send another update when the agenda is closer to settled. > > Cheers, > Kristy > >> On Jul 7, 2017, at 3:32 PM, Kristy Kallback-Rose > wrote: >> >> Hello, >> >> More details will be provided as they become available, but just so you can make a placeholder on your calendar, there will be a Spectrum Scale Day the week of September 25th - 29th, likely Thursday, September 28th. >> >> This will be a part of the larger HPCXXL meeting (https://www.spxxl.org/?q=New-York-City-2017 ). You may recall this group was formerly called SPXXL and the website is in the process of transitioning to the new name (and getting a new certificate). You will be able to attend *just* the Spectrum Scale day if that is the only portion of the event you would like to attend. >> >> The NYC location is great for Spectrum Scale events because many IBMers, including developers, can come in from Poughkeepsie. >> >> More as we get closer to the date and details are settled. >> >> Cheers, >> Kristy > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Aug 22 04:03:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 21 Aug 2017 23:03:35 -0400 Subject: [gpfsug-discuss] multicluster security Message-ID: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> Hi Everyone, I have a theoretical question about GPFS multiclusters and security. Let's say I have clusters A and B. Cluster A is exporting a filesystem as read-only to cluster B. Where does the authorization burden lay? Meaning, does the security rely on mmfsd in cluster B to behave itself and enforce the conditions of the multi-cluster export? Could someone using the credentials on a compromised node in cluster B just start sending arbitrary nsd read/write commands to the nsds from cluster A (or something along those lines)? Do the NSD servers in cluster A do any sort of sanity or security checking on the I/O requests coming from cluster B to the NSDs they're serving to exported filesystems? I imagine any enforcement would go out the window with shared disks in a multi-cluster environment since a compromised node could just "dd" over the LUNs. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From kkr at lbl.gov Tue Aug 22 05:52:58 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 21 Aug 2017 21:52:58 -0700 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Message-ID: Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E -------------- next part -------------- An HTML attachment was scrubbed... URL: From NSCHULD at de.ibm.com Tue Aug 22 08:44:28 2017 From: NSCHULD at de.ibm.com (Norbert Schuld) Date: Tue, 22 Aug 2017 09:44:28 +0200 Subject: [gpfsug-discuss] pmcollector node In-Reply-To: <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> References: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> Message-ID: Above ~100 nodes the answer is "it depends" but memory is certainly the main factor. Important parts for the estimation are the number of nodes, filesystems, NSDs, NFS & SMB shares and the frequency (aka period) with which measurements are made. For a lot of sensors today the default is 1/sec which is quite high. Depending on your needs 1/ 10 sec might do or even 1/min. With just guessing on some numbers I end up with ~24-32 GB RAM needed in total and about the same number for disk space. If you want HA double the number, then divide by the number of collector nodes used in the federation setup. Place the collectors on nodes which do not play an additional important part in your cluster, then CPU should not be an issue. Mit freundlichen Gr??en / Kind regards Norbert Schuld From: Matt Weil To: gpfsug-discuss at spectrumscale.org Date: 21/08/2017 21:54 Subject: Re: [gpfsug-discuss] pmcollector node Sent by: gpfsug-discuss-bounces at spectrumscale.org any input on this Thanks On 7/5/17 10:51 AM, Matt Weil wrote: > Hello all, > > Question on the requirements on pmcollector node/s for a 500+ node > cluster. Is there a sizing guide? What specifics should we scale? > CPU Disks memory? > > Thanks > > Matt > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Jochen.Zeller at sva.de Tue Aug 22 12:09:31 2017 From: Jochen.Zeller at sva.de (Zeller, Jochen) Date: Tue, 22 Aug 2017 11:09:31 +0000 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss Message-ID: Dear community, this morning I started in a good mood, until I've checked my mailbox. Again a reported bug in Spectrum Scale that could lead to data loss. During the last year I was looking for a stable Scale version, and each time I've thought: "Yes, this one is stable and without serious data loss bugs" - a few day later, IBM announced a new APAR with possible data loss for this version. I am supporting many clients in central Europe. They store databases, backup data, life science data, video data, results of technical computing, do HPC on the file systems, etc. Some of them had to change their Scale version nearly monthly during the last year to prevent running in one of the serious data loss bugs in Scale. From my perspective, it was and is a shame to inform clients about new reported bugs right after the last update. From client perspective, it was and is a lot of work and planning to do to get a new downtime for updates. And their internal customers are not satisfied with those many downtimes of the clusters and applications. For me, it seems that Scale development is working on features for a specific project or client, to achieve special requirements. But they forgot the existing clients, using Scale for storing important data or running important workloads on it. To make us more visible, I've used the IBM recommended way to notify about mandatory enhancements, the less favored RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334 If you like, vote for more reliability in Scale. I hope this a good way to show development and responsible persons that we have trouble and are not satisfied with the quality of the releases. Regards, Jochen -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From stockf at us.ibm.com Tue Aug 22 13:31:52 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 22 Aug 2017 08:31:52 -0400 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata"bug In-Reply-To: References: Message-ID: My understanding is that the problem is not with the policy engine scanning but with the commands that move data, for example mmrestripefs. So if you are using the policy engine for other purposes you are not impacted by the problem. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Kristy Kallback-Rose To: gpfsug main discussion list Date: 08/22/2017 12:53 AM Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Sent by: gpfsug-discuss-bounces at spectrumscale.org Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=L6hGADgajb-s1ezkPaD4wQhytCTKnUBGorgQEbmlEzk&s=nDmkF6EvhbMgktl3Oks3UkCb-2-cwR1QLEpOi6qeea4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From sxiao at us.ibm.com Tue Aug 22 14:51:25 2017 From: sxiao at us.ibm.com (Steve Xiao) Date: Tue, 22 Aug 2017 09:51:25 -0400 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" In-Reply-To: References: Message-ID: ILM policy engine scans of metadata is safe and will not trigger the problem. Steve Y. Xiao -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Aug 22 15:06:00 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 22 Aug 2017 14:06:00 +0000 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug In-Reply-To: References: Message-ID: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> Can anyone tell us when a normal PTF release (4.2.3-4 ??) will be made available that will fix this issue? Trying to decide if I should roll an e-fix or just wait for a normal release, thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Monday, August 21, 2017 11:53 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Note: External Email ________________________________ Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Tue Aug 22 15:27:36 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Tue, 22 Aug 2017 16:27:36 +0200 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata"bug In-Reply-To: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> References: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: From aaron.knister at gmail.com Tue Aug 22 15:37:06 2017 From: aaron.knister at gmail.com (Aaron Knister) Date: Tue, 22 Aug 2017 10:37:06 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: Hi Jochen, I share your concern about data loss bugs and I too have found it troubling especially since the 4.2 stream is in my immediate future (although I would have rather stayed on 4.1 due to my perception of stability/integrity issues in 4.2). By and large 4.1 has been *extremely* stable for me. While not directly related to the stability concerns, I'm curious as to why your customer sites are requiring downtime to do the upgrades? While, of course, individual servers need to be taken offline to update GPFS the collective should be able to stay up. Perhaps your customer environments just don't lend themselves to that. It occurs to me that some of these bugs sound serious (and indeed I believe this one is) I recently found myself jumping prematurely into an update for the metanode filesize corruption bug that as it turns out that while very scary sounding is not necessarily a particularly common bug (if I understand correctly). Perhaps it would be helpful if IBM could clarify the believed risk of these updates or give us some indication if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild". I could imagine IBM legal wanting to avoid a situation where IBM indicates something is low risk but someone hits it and it eats data. Although many companies do this with security patches so perhaps it's a non-issue. >From my perspective I don't think existing customers are being "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to an ever-changing world and I think these features are necessary and useful. Perhaps Scale would benefit from more resources being dedicated to QA/Testing which isn't a particularly sexy thing-- it doesn't result in any new shiny features for customers (although "not eating your data" is a feature I find really attractive). Anyway, I hope IBM can find a way to minimize the frequency of these bugs. Personally speaking, I'm pretty convinced, it's not for lack of capability or dedication on the part of the great folks actually writing the code. -Aaron On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen wrote: > Dear community, > > this morning I started in a good mood, until I?ve checked my mailbox. > Again a reported bug in Spectrum Scale that could lead to data loss. During > the last year I was looking for a stable Scale version, and each time I?ve > thought: ?Yes, this one is stable and without serious data loss bugs? - a > few day later, IBM announced a new APAR with possible data loss for this > version. > > I am supporting many clients in central Europe. They store databases, > backup data, life science data, video data, results of technical computing, > do HPC on the file systems, etc. Some of them had to change their Scale > version nearly monthly during the last year to prevent running in one of > the serious data loss bugs in Scale. From my perspective, it was and is a > shame to inform clients about new reported bugs right after the last > update. From client perspective, it was and is a lot of work and planning > to do to get a new downtime for updates. And their internal customers are > not satisfied with those many downtimes of the clusters and applications. > > For me, it seems that Scale development is working on features for a > specific project or client, to achieve special requirements. But they > forgot the existing clients, using Scale for storing important data or > running important workloads on it. > > To make us more visible, I?ve used the IBM recommended way to notify about > mandatory enhancements, the less favored RFE: > > > *http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334* > > > If you like, vote for more reliability in Scale. > > I hope this a good way to show development and responsible persons that we > have trouble and are not satisfied with the quality of the releases. > > > Regards, > > Jochen > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Tue Aug 22 16:24:46 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Tue, 22 Aug 2017 08:24:46 -0700 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" In-Reply-To: References: Message-ID: Thanks we just wanted to confirm given the use of the word "scanning" in describing the trigger. On Aug 22, 2017 6:51 AM, "Steve Xiao" wrote: > ILM policy engine scans of metadatais safe and will not trigger the > problem. > > > Steve Y. Xiao > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Aug 22 17:45:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 22 Aug 2017 12:45:00 -0400 Subject: [gpfsug-discuss] Fwd: FLASH: IBM Spectrum Scale (GPFS): RDMA-enabled network adapter failure on the NSD server may result in file IO error (2017.06.30) In-Reply-To: References: <487469581.449569.1498832342497.JavaMail.webinst@w30112> <2689cf86-eca2-dab6-c6aa-7fc54d923e55@nasa.gov> Message-ID: <3b16ad01-4d83-8106-f2e2-110364f31566@nasa.gov> (I'm slowly catching up on a backlog of e-mail, sorry for the delayed reply). Thanks, Sven. I recognize the complexity and appreciate your explanation. In my mind I had envisioned either the block integrity information being stored as a new metadata structure or stored leveraging T10-DIX/DIF (perhaps configurable on a per-pool basis) to pass the checksums down to the RAID controller. I would quite like to run GNR as software on generic hardware and in fact voted, along with 26 other customers, on an RFE (https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090) requesting this but the request was declined. I think customers spoke pretty loudly there and IBM gave it the kibosh. -Aaron On 06/30/2017 02:25 PM, Sven Oehme wrote: > > end-to-end data integrity is very important and the reason it hasn't > been done in Scale is not because its not important, its because its > very hard to do without impacting performance in a very dramatic way. > > imagine your raid controller blocksize is 1mb and your filesystem > blocksize is 1MB . if your application does a 1 MB write this ends up > being a perfect full block , full track de-stage to your raid layer > and everything works fine and fast. as soon as you add checksum > support you need to add data somehow into this, means your 1MB is no > longer 1 MB but 1 MB+checksum. > > to store this additional data you have multiple options, inline , > outside the data block or some combination ,the net is either you need > to do more physical i/o's to different places to get both the data and > the corresponding checksum or your per block on disc structure becomes > bigger than than what your application reads/or writes, both put > massive burden on the Storage layer as e.g. a 1 MB write will now, > even the blocks are all aligned from the application down to the raid > layer, cause a read/modify/write on the raid layer as the data is > bigger than the physical track size. > > so to get end-to-end checksum in Scale outside of ESS the best way is > to get GNR as SW to run on generic HW, this is what people should vote > for as RFE if they need that functionality. beside end-to-end > checksums you get read/write cache and acceleration , fast rebuild and > many other goodies as a added bonus. > > Sven > > > On Fri, Jun 30, 2017 at 10:53 AM Aaron Knister > > wrote: > > In fact the answer was quite literally "no": > > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=84523 > (the RFE was declined and the answer was that the "function is already > available in GNR environments"). > > Regarding GNR, see this RFE request > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090 > requesting the use of GNR outside of an ESS/GSS environment. It's > interesting to note this is the highest voted Public RFE for GPFS > that I > can see, at least. It too was declined. > > -Aaron > > On 6/30/17 1:41 PM, Aaron Knister wrote: > > Thanks Olaf, that's good to know (and is kind of what I > suspected). I've > > requested a number of times this capability for those of us who > can't > > use or aren't using GNR and the answer is effectively "no". This > > response is curious to me because I'm sure IBM doesn't believe > that data > > integrity is only important and of value to customers who > purchase their > > hardware *and* software. > > > > -Aaron > > > > On Fri, Jun 30, 2017 at 1:37 PM, Olaf Weiser > > > >> > wrote: > > > > yes.. in case of GNR (GPFS native raid) .. we do end-to-end > > check-summing ... client --> server --> downToDisk > > GNR writes down a chksum to disk (to all pdisks /all "raid" > segments > > ) so that dropped writes can be detected as well as miss-done > > writes (bit flips..) > > > > > > > > From: Aaron Knister > > >> > > To: gpfsug main discussion list > > > >> > > Date: 06/30/2017 07:15 PM > > Subject: [gpfsug-discuss] Fwd: FLASH: IBM Spectrum Scale (GPFS): > > RDMA-enabled network adapter failure on the NSD server may > result in > > file IO error (2017.06.30) > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > ------------------------------------------------------------------------ > > > > > > > > I'm curious to know why this doesn't affect GSS/ESS? Is it a > feature of > > the additional check-summing done on those platforms? > > > > > > -------- Forwarded Message -------- > > Subject: FLASH: IBM Spectrum Scale (GPFS): RDMA-enabled network > > adapter > > failure on the NSD server may result in file IO error > (2017.06.30) > > Date: Fri, 30 Jun 2017 14:19:02 +0000 > > From: IBM My Notifications > > > >> > > To: aaron.s.knister at nasa.gov > > > > > > > > > > > > My Notifications for Storage - 30 Jun 2017 > > > > Dear Subscriber (aaron.s.knister at nasa.gov > > > >), > > > > Here are your updates from IBM My Notifications. > > > > Your support Notifications display in English by default. > Machine > > translation based on your IBM profile > > language setting is added if you specify this option in My > defaults > > within My Notifications. > > (Note: Not all languages are available at this time, and the > English > > version always takes precedence > > over the machine translated version.) > > > > > ------------------------------------------------------------------------------ > > 1. IBM Spectrum Scale > > > > - TITLE: IBM Spectrum Scale (GPFS): RDMA-enabled network adapter > > failure > > on the NSD server may result in file IO error > > - URL: > > > http://www.ibm.com/support/docview.wss?uid=ssg1S1010233&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E > > > > > - ABSTRACT: IBM has identified an issue with all IBM GPFS > and IBM > > Spectrum Scale versions where the NSD server is enabled to > use RDMA for > > file IO and the storage used in your GPFS cluster accessed > via NSD > > servers (not fully SAN accessible) includes anything other > than IBM > > Elastic Storage Server (ESS) or GPFS Storage Server (GSS); > under these > > conditions, when the RDMA-enabled network adapter fails, the > issue may > > result in undetected data corruption for file write or read > operations. > > > > > ------------------------------------------------------------------------------ > > Manage your My Notifications subscriptions, or send > questions and > > comments. > > - Subscribe or Unsubscribe - > > https://www.ibm.com/support/mynotifications > > > > - Feedback - > > > https://www-01.ibm.com/support/feedback/techFeedbackCardContentMyNotifications.html > > > > > > > - Follow us on Twitter - https://twitter.com/IBMStorageSupt > > > > > > > > > > To ensure proper delivery please add > mynotify at stg.events.ihost.com > > > to > > your address book. > > You received this email because you are subscribed to IBM My > > Notifications as: > > aaron.s.knister at nasa.gov > > > > > > Please do not reply to this message as it is generated by an > automated > > service machine. > > > > (C) International Business Machines Corporation 2017. All rights > > reserved. > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Aug 23 05:40:19 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 23 Aug 2017 00:40:19 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: Aaron, IBM's policy is to issue a flash when such data corruption/loss problem has been identified, even if the problem has never been encountered by any customer. In fact, most of the flashes have been the result of internal test activity, even though the discovery took place after the affected versions/PTFs have already been released. This is the case of two of the recent flashes: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293 http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 The flashes normally do not indicate the risk level that a given problem has of being hit, since there are just too many variables at play, given that clusters and workloads vary significantly. The first issue above appears to be uncommon (and potentially rare). The second issue seems to have a higher probability of occurring -- and as described in the flash, the problem is triggered by failures being encountered while running one of the commands listed in the "Users Affected" section of the writeup. I don't think precise recommendations could be given on if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild" since different clusters, configuration, or workload may drastically affect the the likelihood of hitting the problem. On the other hand, when coming up with the text for the flash, the team attempts to provide as much information as possible/available on the known triggers and mitigation circumstances. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Aaron Knister To: gpfsug main discussion list Date: 08/22/2017 10:37 AM Subject: Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Jochen, I share your concern about data loss bugs and I too have found it troubling especially since the 4.2 stream is in my immediate future (although I would have rather stayed on 4.1 due to my perception of stability/integrity issues in 4.2). By and large 4.1 has been *extremely* stable for me. While not directly related to the stability concerns, I'm curious as to why your customer sites are requiring downtime to do the upgrades? While, of course, individual servers need to be taken offline to update GPFS the collective should be able to stay up. Perhaps your customer environments just don't lend themselves to that. It occurs to me that some of these bugs sound serious (and indeed I believe this one is) I recently found myself jumping prematurely into an update for the metanode filesize corruption bug that as it turns out that while very scary sounding is not necessarily a particularly common bug (if I understand correctly). Perhaps it would be helpful if IBM could clarify the believed risk of these updates or give us some indication if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild". I could imagine IBM legal wanting to avoid a situation where IBM indicates something is low risk but someone hits it and it eats data. Although many companies do this with security patches so perhaps it's a non-issue. From my perspective I don't think existing customers are being "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to an ever-changing world and I think these features are necessary and useful. Perhaps Scale would benefit from more resources being dedicated to QA/Testing which isn't a particularly sexy thing-- it doesn't result in any new shiny features for customers (although "not eating your data" is a feature I find really attractive). Anyway, I hope IBM can find a way to minimize the frequency of these bugs. Personally speaking, I'm pretty convinced, it's not for lack of capability or dedication on the part of the great folks actually writing the code. -Aaron On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen wrote: Dear community, this morning I started in a good mood, until I?ve checked my mailbox. Again a reported bug in Spectrum Scale that could lead to data loss. During the last year I was looking for a stable Scale version, and each time I?ve thought: ?Yes, this one is stable and without serious data loss bugs? - a few day later, IBM announced a new APAR with possible data loss for this version. I am supporting many clients in central Europe. They store databases, backup data, life science data, video data, results of technical computing, do HPC on the file systems, etc. Some of them had to change their Scale version nearly monthly during the last year to prevent running in one of the serious data loss bugs in Scale. From my perspective, it was and is a shame to inform clients about new reported bugs right after the last update. From client perspective, it was and is a lot of work and planning to do to get a new downtime for updates. And their internal customers are not satisfied with those many downtimes of the clusters and applications. For me, it seems that Scale development is working on features for a specific project or client, to achieve special requirements. But they forgot the existing clients, using Scale for storing important data or running important workloads on it. To make us more visible, I?ve used the IBM recommended way to notify about mandatory enhancements, the less favored RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334 If you like, vote for more reliability in Scale. I hope this a good way to show development and responsible persons that we have trouble and are not satisfied with the quality of the releases. Regards, Jochen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 23 11:11:37 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 10:11:37 +0000 Subject: [gpfsug-discuss] AFM weirdness Message-ID: We're using an AFM cache from our HPC nodes to access data in another GPFS cluster, mostly this seems to be working fine, but we've just come across an interesting problem with a user using gfortran from the GCC 5.2.0 toolset. When linking their code, they get a "no space left on device" error back from the linker. If we do this on a node that mounts the file-system directly (I.e. Not via AFM cache), then it works fine. We tried with GCC 4.5 based tools and it works OK, but the difference there is that 4.x uses ld and 5x uses ld.gold. If we strike the ld.gold when using AFM, we see: stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 unlink("program") = 0 open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on device) Vs when running directly on the file-system: stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 unlink("program") = 0 open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 fallocate(30, 0, 0, 248480) = 0 Anyone seen anything like this before? ... Actually I'm about to go off and see if its a function of AFM, or maybe something to do with the FS in use (I.e. Make a local directory on the filesystem on the "AFM" FS and see if that works ...) Thanks Simon From S.J.Thompson at bham.ac.uk Wed Aug 23 11:17:58 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 10:17:58 +0000 Subject: [gpfsug-discuss] AFM weirdness Message-ID: OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From vpuvvada at in.ibm.com Wed Aug 23 13:36:33 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 23 Aug 2017 18:06:33 +0530 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 23 14:01:55 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 13:01:55 +0000 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: I've got a PMR open about this ... Will email you the number directly. Looking at the man page for ld.gold, it looks to set '--posix-fallocate' by default. In fact, testing with '-Xlinker -no-posix-fallocate' does indeed make the code compile. Simon From: "vpuvvada at in.ibm.com" > Date: Wednesday, 23 August 2017 at 13:36 To: "gpfsug-discuss at spectrumscale.org" >, Simon Thompson > Subject: Re: [gpfsug-discuss] AFM weirdness I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: gpfsug main discussion list > Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" on behalf of S.J.Thompson at bham.ac.uk> wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Aug 24 13:56:49 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 24 Aug 2017 08:56:49 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: <12c154d2-8095-408e-ac7e-e654b1448a25@nasa.gov> Thanks Felipe, and everything you said makes sense and I think holds true to my experiences concerning different workloads affecting likelihood of hitting various problems (especially being one of only a handful of sites that hit that 301 SGpanic error from several years back). Perhaps language as subtle as "internal testing revealed" vs "based on reports from customer sites" could be used? But then again I imagine you could encounter a case where you discover something in testing that a customer site subsequently experiences which might limit the usefulness of the wording. I still think it's useful to know if an issue has been exacerbated or triggered by in the wild workloads vs what I imagine to be quite rigorous lab testing perhaps deigned to shake out certain bugs. -Aaron On 8/23/17 12:40 AM, Felipe Knop wrote: > Aaron, > > IBM's policy is to issue a flash when such data corruption/loss > problem has been identified, even if the problem has never been > encountered by any customer. In fact, most of the flashes have been > the result of internal test activity, even though the discovery took > place after the affected versions/PTFs have already been released. > ?This is the case of two of the recent flashes: > > http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293 > > http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 > > The flashes normally do not indicate the risk level that a given > problem has of being hit, since there are just too many variables at > play, given that clusters and workloads vary significantly. > > The first issue above appears to be uncommon (and potentially rare). > ?The second issue seems to have a higher probability of occurring -- > and as described in the flash, the problem is triggered by failures > being encountered while running one of the commands listed in the > "Users Affected" section of the writeup. > > I don't think precise recommendations could be given on > > ?if the bugs fall in the category of "drop everything and patch *now*" > or "this is a theoretically nasty bug but we've yet to see it in the wild" > > since different clusters, configuration, or workload may drastically > affect the the likelihood of hitting the problem. ?On the other hand, > when coming up with the text for the flash, the team attempts to > provide as much information as possible/available on the known > triggers and mitigation circumstances. > > ? Felipe > > ---- > Felipe Knop ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 ?T/L 293-9314 > > > > > > From: ? ? ? ?Aaron Knister > To: ? ? ? ?gpfsug main discussion list > Date: ? ? ? ?08/22/2017 10:37 AM > Subject: ? ? ? ?Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale > could lead to data loss > Sent by: ? ? ? ?gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi Jochen, > > I share your concern about data loss bugs and I too have found it > troubling especially since the 4.2 stream is in my immediate future > (although I would have rather stayed on 4.1 due to my perception of > stability/integrity issues in 4.2). By and large 4.1 has been > *extremely* stable for me. > > While not directly related to the stability concerns, I'm curious as > to why your customer sites are requiring downtime to do the upgrades? > While, of course, individual servers need to be taken offline to > update GPFS the collective should be able to stay up. Perhaps your > customer environments just don't lend themselves to that.? > > It occurs to me that some of these bugs sound serious (and indeed I > believe this one is) I recently found myself jumping prematurely into > an update for the metanode filesize corruption bug that as it turns > out that while very scary sounding is not necessarily a particularly > common bug (if I understand correctly). Perhaps it would be helpful if > IBM could clarify the believed risk of these updates or give us some > indication if the bugs fall in the category of "drop everything and > patch *now*" or "this is a theoretically nasty bug but we've yet to > see it in the wild". I could imagine IBM legal wanting to avoid a > situation where IBM indicates something is low risk but someone hits > it and it eats data. Although many companies do this with security > patches so perhaps it's a non-issue. > > From my perspective I don't think existing customers are being > "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt > to an ever-changing world and I think these features are necessary and > useful. Perhaps Scale would benefit from more resources being > dedicated to QA/Testing which isn't a particularly sexy thing-- it > doesn't result in any new shiny features for customers (although "not > eating your data" is a feature I find really attractive). > > Anyway, I hope IBM can find a way to minimize the frequency of these > bugs. Personally speaking, I'm pretty convinced, it's not for lack of > capability or dedication on the part of the great folks actually > writing the code. > > -Aaron > > On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen > <_Jochen.Zeller at sva.de_ > wrote: > Dear community, > ? > this morning I started in a good mood, until I?ve checked my mailbox. > Again a reported bug in Spectrum Scale that could lead to data loss. > During the last year I was looking for a stable Scale version, and > each time I?ve thought: ?Yes, this one is stable and without serious > data loss bugs? - a few day later, IBM announced a new APAR with > possible data loss for this version. > ? > I am supporting many clients in central Europe. They store databases, > backup data, life science data, video data, results of technical > computing, do HPC on the file systems, etc. Some of them had to change > their Scale version nearly monthly during the last year to prevent > running in one of the serious data loss bugs in Scale. From my > perspective, it was and is a shame to inform clients about new > reported bugs right after the last update. From client perspective, it > was and is a lot of work and planning to do to get a new downtime for > updates. And their internal customers are not satisfied with those > many downtimes of the clusters and applications. > ? > For me, it seems that Scale development is working on features for a > specific project or client, to achieve special requirements. But they > forgot the existing clients, using Scale for storing important data or > running important workloads on it. > ? > To make us more visible, I?ve used the IBM recommended way to notify > about mandatory enhancements, the less favored RFE: > ? > _http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334_ > ? > If you like, vote for more reliability in Scale. > ? > I hope this a good way to show development and responsible persons > that we have trouble and are not satisfied with the quality of the > releases. > ? > ? > Regards, > ? > Jochen > ? > ? > ? > ? > ? > ? > ? > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at _spectrumscale.org_ > _ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Aug 25 08:44:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Fri, 25 Aug 2017 07:44:35 +0000 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: So as Venkat says, AFM doesn't support using fallocate() to preallocate space. So why aren't other people seeing this ... Well ... We use EasyBuild to build our HPC cluster software including the compiler tool chains. This enables the new linker ld.gold by default rather than the "old" ld. Interestingly we don't seem to have seen this with C code being compiled, only fortran. We can work around it by using the options to gfortran I mention below. There is a mention to this limitation at: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_afmlimitations.htm We aren;t directly calling gpfs_prealloc, but I guess the linker is indirectly calling it by making a call to posix_fallocate. I do have a new problem with AFM where the data written to the cache differs from that replicated back to home... I'm beginning to think I don't like the decision to use AFM! Given the data written back to HOME is corrupt, I think this is definitely PMR time. But ... If you have Abaqus on you system and are using AFM, I'd be interested to see if someone else sees the same issue as us! Simon From: > on behalf of Simon Thompson > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 23 August 2017 at 14:01 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] AFM weirdness I've got a PMR open about this ... Will email you the number directly. Looking at the man page for ld.gold, it looks to set '--posix-fallocate' by default. In fact, testing with '-Xlinker -no-posix-fallocate' does indeed make the code compile. Simon From: "vpuvvada at in.ibm.com" > Date: Wednesday, 23 August 2017 at 13:36 To: "gpfsug-discuss at spectrumscale.org" >, Simon Thompson > Subject: Re: [gpfsug-discuss] AFM weirdness I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: gpfsug main discussion list > Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" on behalf of S.J.Thompson at bham.ac.uk> wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Fri Aug 25 22:36:39 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 25 Aug 2017 17:36:39 -0400 Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing In-Reply-To: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> References: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> Message-ID: Hi, >>I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? >> I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Please ensure that all the recommended FPO settings (e.g. allowWriteAffinity=yes in the FPO storage pool, readReplicaPolicy=local, restripeOnDiskFailure=yes) are set properly. Please find the FPO Best practices/tunings, in the links below: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Big%20Data%20Best%20practices https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ab5c2792-feef-4a3a-a21b-d22c6f5d728a/attachment/80d5c300-7b39-4d6e-9596-84934fcc4638/media/Deploying_a_big_data_solution_using_IBM_Spectrum_Scale_v1.7.5.pdf >> For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). >> Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. With FPO, GPFS metadata (-m) and data replication (-r) needs to be enabled. The Write-affinity-Depth (WAD) setting defines the policy for directing writes. It indicates that the node writing the data directs the write to disks on its own node for the first copy and to the disks on other nodes for the second and third copies (if specified). readReplicaPolicy=local will enable the policy to read replicas from local disks. At the minimum, ensure that the networking used for GPFS is sized properly and has bandwidth 2X or 3X that of the local disk speeds to ensure FPO write bandwidth is not being constrained by GPFS replication over the network. For example, if 24 x Drives in RAID-0 results in ~4.8 GB/s (assuming ~200MB/s per drive) and GPFS metadata/data replication is set to 3 (-m 3 -r 3) then for optimal FPO write bandwidth, we need to ensure the network-interconnect between the FPO nodes is non-blocking/high-speed and can sustain ~14.4 GB/s ( data_replication_factor * local_storage_bandwidth). One possibility, is minimum of 2 x EDR Infiniband (configure GPFS verbsRdma/verbsPorts) or bonded 40GigE between the FPO nodes (for GPFS daemon-to-daemon communication). Application reads requiring FPO reads from remote GPFS node would as well benefit from high-speed network-interconnect between the FPO nodes. Regards, -Kums From: Evan Koutsandreou To: "gpfsug-discuss at spectrumscale.org" Date: 08/20/2017 11:06 PM Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Aug 25 23:41:53 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 25 Aug 2017 18:41:53 -0400 Subject: [gpfsug-discuss] multicluster security In-Reply-To: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> References: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> Message-ID: Hi Aaron, If cluster A uses the mmauth command to grant a file system read-only access to a remote cluster B, nodes on cluster B can only mount that file system with read-only access. But the only checking being done at the RPC level is the TLS authentication. This should prevent non-root users from initiating RPCs, since TLS authentication requires access to the local cluster's private key. However, a root user on cluster B, having access to cluster B's private key, might be able to craft RPCs that may allow one to work around the checks which are implemented at the file system level. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 08/21/2017 11:04 PM Subject: [gpfsug-discuss] multicluster security Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Everyone, I have a theoretical question about GPFS multiclusters and security. Let's say I have clusters A and B. Cluster A is exporting a filesystem as read-only to cluster B. Where does the authorization burden lay? Meaning, does the security rely on mmfsd in cluster B to behave itself and enforce the conditions of the multi-cluster export? Could someone using the credentials on a compromised node in cluster B just start sending arbitrary nsd read/write commands to the nsds from cluster A (or something along those lines)? Do the NSD servers in cluster A do any sort of sanity or security checking on the I/O requests coming from cluster B to the NSDs they're serving to exported filesystems? I imagine any enforcement would go out the window with shared disks in a multi-cluster environment since a compromised node could just "dd" over the LUNs. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=oK_bEPbjuD7j6qLTHbe7HM4ujUlpcNYtX3tMW2QC7_w&s=BliMQ0pToLIIiO1jfyUp2Q3icewcONrcmHpsIj_hMtY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sat Aug 26 20:39:58 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sat, 26 Aug 2017 19:39:58 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Message-ID: Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Sun Aug 27 01:35:06 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Sat, 26 Aug 2017 20:35:06 -0400 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sun Aug 27 14:32:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sun, 27 Aug 2017 13:32:20 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> Fred / All, Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock > wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Sun Aug 27 23:07:17 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Sun, 27 Aug 2017 18:07:17 -0400 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> References: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> Message-ID: Hi Kevin, >> Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? I presume, by "mmrestripefs data loss bug" you are referring to APAR IV98609 (link below)? If yes, 4.2.3.4 contains the fix for APAR IV98609. http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 Problems fixed in GPFS 4.2.3.4 (details in link below): https://www.ibm.com/developerworks/community/forums/html/topic?id=f3705faa-b6aa-415c-a3e6-1fe9d8293db1&ps=25 * This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063. Regards, -Kums From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/27/2017 09:32 AM Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Fred / All, Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=McIf98wfiVqHU8ZygezLrQ&m=0rUCqrbJ4Ny44Rmr8x8HvX5q4yqS-4tkN02fiIm9ttg&s=FYfr0P3sVBhnGGsj33W-A9JoDj7X300yTt5D4y5rpJY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Aug 28 13:26:35 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 28 Aug 2017 08:26:35 -0400 Subject: [gpfsug-discuss] sas avago/lsi hba reseller recommendation Message-ID: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact with Advance HPC? -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Aug 28 13:36:16 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Mon, 28 Aug 2017 12:36:16 +0000 Subject: [gpfsug-discuss] sas avago/lsi hba reseller recommendation In-Reply-To: References: Message-ID: <28676C04-60E6-4AB6-8FEF-24EA719E8786@nasa.gov> Hi Eric, I shot you an email directly with contact info. -Aaron On August 28, 2017 at 08:26:56 EDT, J. Eric Wonderley wrote: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact with Advance HPC? -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Aug 29 15:30:25 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 29 Aug 2017 14:30:25 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Aug 29 16:53:51 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 29 Aug 2017 15:53:51 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Aug 29 18:52:41 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 29 Aug 2017 17:52:41 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: , Message-ID: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Buterbaugh, Kevin L Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 14:54:41 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 13:54:41 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Aug 30 14:56:29 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 30 Aug 2017 13:56:29 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 30 15:06:00 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 30 Aug 2017 14:06:00 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Oh, the first one looks like the AFM issue I mentioned a couple of days back with Abaqus ... (if you use Abaqus on your AFM cache, then this is for you!) Simon From: > on behalf of "Buterbaugh, Kevin L" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 30 August 2017 at 14:54 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From:gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Aug 30 15:12:30 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 30 Aug 2017 14:12:30 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 30 15:21:09 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 30 Aug 2017 14:21:09 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 15:28:07 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 14:28:07 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> References: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> Message-ID: <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> Hi Bryan, NO - it has the fix for the mmrestripefs data loss bug, but you need the efix on top of 4.2.3-4 for the mmadddisk / mmdeldisk issue. Let me take this opportunity to also explain a workaround that has worked for us so far for that issue ? the basic problem is two-fold (on our cluster, at least). First, the /var/mmfs/gen/mmsdrfs file isn?t making it out to all nodes all the time. That is simple enough to fix (mmrefresh -fa) and verify that it?s fixed (md5sum /var/mmfs/gen/mmsdrfs). Second, however - and this is the real problem ? some nodes are never actually rereading that file and therefore have incorrect information *in memory*. This has been especially problematic for us as we are replacing a batch of 80 8 TB drives with bad firmware. I am therefore deleting and subsequently recreating NSDs *with the same name*. If a client node still has the ?old? information in memory then it unmounts the filesystem when I try to mmadddisk the new NSD. The workaround is to identify those nodes (mmfsadm dump nsd and grep for the identifier of the NSD(s) in question) and force them to reread the info (tsctl rereadnsd). HTH? Kevin On Aug 30, 2017, at 9:21 AM, Bryan Banister > wrote: Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C493f1f9e41e343324f1508d4efb25f4f%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636396996783027614&sdata=qYxCMMg9O31LzFg%2FQkCdQg8vV%2FgL2AuRk%2B6V2j76c7Y%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 30 15:30:07 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 30 Aug 2017 14:30:07 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> References: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> Message-ID: <48dac1a1fc6945fdb0d8e94cb7269e3a@jumptrading.com> Thanks for the excellent description? I have my PMR open for the e-fix, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, August 30, 2017 9:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Hi Bryan, NO - it has the fix for the mmrestripefs data loss bug, but you need the efix on top of 4.2.3-4 for the mmadddisk / mmdeldisk issue. Let me take this opportunity to also explain a workaround that has worked for us so far for that issue ? the basic problem is two-fold (on our cluster, at least). First, the /var/mmfs/gen/mmsdrfs file isn?t making it out to all nodes all the time. That is simple enough to fix (mmrefresh -fa) and verify that it?s fixed (md5sum /var/mmfs/gen/mmsdrfs). Second, however - and this is the real problem ? some nodes are never actually rereading that file and therefore have incorrect information *in memory*. This has been especially problematic for us as we are replacing a batch of 80 8 TB drives with bad firmware. I am therefore deleting and subsequently recreating NSDs *with the same name*. If a client node still has the ?old? information in memory then it unmounts the filesystem when I try to mmadddisk the new NSD. The workaround is to identify those nodes (mmfsadm dump nsd and grep for the identifier of the NSD(s) in question) and force them to reread the info (tsctl rereadnsd). HTH? Kevin On Aug 30, 2017, at 9:21 AM, Bryan Banister > wrote: Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C493f1f9e41e343324f1508d4efb25f4f%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636396996783027614&sdata=qYxCMMg9O31LzFg%2FQkCdQg8vV%2FgL2AuRk%2B6V2j76c7Y%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 20:26:41 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 19:26:41 +0000 Subject: [gpfsug-discuss] Permissions issue in GPFS 4.2.3-4? Message-ID: Hi All, We have a script that takes the output of mmlsfs and mmlsquota and formats a users? GPFS quota usage into something a little ?nicer? than what mmlsquota displays (and doesn?t display 50 irrelevant lines of output for filesets they don?t have access to). After upgrading to 4.2.3-4 over the weekend it started throwing errors it hadn?t before: awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsfs: Unexpected error from awk. Return code: 2 awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsfs: Unexpected error from awk. Return code: 2 Home (user): 11.82G 30G 40G 10807 200000 300000 awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsquota: Unexpected error from awk. Return code: 2 It didn?t take long to track down that the mmfs.cfg.show file had permissions of 600 and a chmod 644 of it (on our login gateways only, which is the only place users run that script anyway) fixed the problem. So I just wanted to see if this was a known issue in 4.2.3-4? Notice that the error appears to be coming from the GPFS commands my script runs, not my script itself ? I sure don?t call awk! ;-) Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Wed Aug 30 20:34:46 2017 From: david_johnson at brown.edu (David Johnson) Date: Wed, 30 Aug 2017 15:34:46 -0400 Subject: [gpfsug-discuss] Permissions issue in GPFS 4.2.3-4? In-Reply-To: References: Message-ID: <13019F3B-AF64-4D92-AAB1-4CF3A635383C@brown.edu> We ran into this back in mid February. Never really got a satisfactory answer how it got this way, the thought was that a bunch of nodes were expelled during an mmchconfig, and the files ended up with the wrong permissions. ? ddj > On Aug 30, 2017, at 3:26 PM, Buterbaugh, Kevin L wrote: > > Hi All, > > We have a script that takes the output of mmlsfs and mmlsquota and formats a users? GPFS quota usage into something a little ?nicer? than what mmlsquota displays (and doesn?t display 50 irrelevant lines of output for filesets they don?t have access to). After upgrading to 4.2.3-4 over the weekend it started throwing errors it hadn?t before: > > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsfs: Unexpected error from awk. Return code: 2 > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsfs: Unexpected error from awk. Return code: 2 > Home (user): 11.82G 30G 40G 10807 200000 300000 > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsquota: Unexpected error from awk. Return code: 2 > > It didn?t take long to track down that the mmfs.cfg.show file had permissions of 600 and a chmod 644 of it (on our login gateways only, which is the only place users run that script anyway) fixed the problem. > > So I just wanted to see if this was a known issue in 4.2.3-4? Notice that the error appears to be coming from the GPFS commands my script runs, not my script itself ? I sure don?t call awk! ;-) > > Thanks? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 1 07:16:19 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 1 Aug 2017 09:16:19 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports Message-ID: Hi, I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum scale cluster (CentOS). I dont need NFSv4 ACLS enabled, but i dont mind them to be if its mandatory for the NFSv4 to work. I have created the domain user "fwuser" in the Active Directory (domain=LH20), it is in group Domain users, Domain Admins, Backup Operators and administrators. In the linux machine im with user ilanwalk (sudoer) [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) groups=12000513(LH20\domain users),12001603(LH20\fwuser),12000572(LH20\denied rodc password replication group),12000512(LH20\domain admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) and when trying to add smb export: [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share /fs_gpfs01 --option "admin users=LH20\fwuser" mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file system that does not enforce NFSv4 ACLs. [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) Also, when trying to enable NFS i get: [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS Failed to enable NFS service. Ensure file authentication is removed prior enabling service. What am I missing ? From jonathan at buzzard.me.uk Tue Aug 1 09:50:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Tue, 01 Aug 2017 09:50:05 +0100 Subject: [gpfsug-discuss] Quota and hardlimit enforcement In-Reply-To: <3789811E-523F-47AE-93F3-E2985DD84D60@vanderbilt.edu> References: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> <20170731160353.54412s4i1r957eax@support.scinet.utoronto.ca> <3789811E-523F-47AE-93F3-E2985DD84D60@vanderbilt.edu> Message-ID: <1501577405.17548.11.camel@buzzard.me.uk> On Mon, 2017-07-31 at 20:11 +0000, Buterbaugh, Kevin L wrote: > Jaime, > > > That?s heavily workload dependent. We run a traditional HPC cluster > and have a 7 day grace on home and 14 days on scratch. By setting the > soft and hard limits appropriately we?ve slammed the door on many a > runaway user / group / fileset. YMMV? > I would concur that it is heavily workload dependant. I have never had a problem with a 7 day period. Besides which if they can significantly blow through the hard limit due to heavy writing and the "in doubt" value then it matters not one jot that grace is 7 days or two hours. My preference however is to set the grace period to as long as possible (which from memory is about 10 years on GPFS) then set the soft at 90% of the hard and use over quota callbacks to signal that there is a problem. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From r.sobey at imperial.ac.uk Tue Aug 1 15:23:58 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 1 Aug 2017 14:23:58 +0000 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports Message-ID: You must have nfs4 Acl semantics only to create smb exports. Mmchfs -k parameter as I recall. On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: Hi, I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum scale cluster (CentOS). I dont need NFSv4 ACLS enabled, but i dont mind them to be if its mandatory for the NFSv4 to work. I have created the domain user "fwuser" in the Active Directory (domain=LH20), it is in group Domain users, Domain Admins, Backup Operators and administrators. In the linux machine im with user ilanwalk (sudoer) [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) groups=12000513(LH20\domain users),12001603(LH20\fwuser),12000572(LH20\denied rodc password replication group),12000512(LH20\domain admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) and when trying to add smb export: [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share /fs_gpfs01 --option "admin users=LH20\fwuser" mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file system that does not enforce NFSv4 ACLs. [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) Also, when trying to enable NFS i get: [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS Failed to enable NFS service. Ensure file authentication is removed prior enabling service. What am I missing ? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 1 16:34:29 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 1 Aug 2017 18:34:29 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: References: Message-ID: Yes I succeeded to make smb share. But only the user i put in the command can write files to it. Others can read only. How can i enable write it to all domain users? The group. And what about the error when enabling nfs? On Aug 1, 2017 17:24, "Sobey, Richard A" wrote: > You must have nfs4 Acl semantics only to create smb exports. > > Mmchfs -k parameter as I recall. > > On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: > > Hi, > I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum > scale cluster (CentOS). > I dont need NFSv4 ACLS enabled, but i dont mind them to be if its > mandatory for the NFSv4 to work. > > I have created the domain user "fwuser" in the Active Directory > (domain=LH20), it is in group Domain users, Domain Admins, Backup > Operators and administrators. > > In the linux machine im with user ilanwalk (sudoer) > > [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser > uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) > groups=12000513(LH20\domain > users),12001603(LH20\fwuser),12000572(LH20\denied rodc password > replication group),12000512(LH20\domain > admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) > > > and when trying to add smb export: > [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share > /fs_gpfs01 --option "admin users=LH20\fwuser" > mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file > system that does not enforce NFSv4 ACLs. > > > [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs > fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) > > > > Also, when trying to enable NFS i get: > [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS > Failed to enable NFS service. Ensure file authentication is removed > prior enabling service. > > > What am I missing ? > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Tue Aug 1 23:40:38 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Tue, 1 Aug 2017 23:40:38 +0100 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: References: Message-ID: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> Could you please give a break down of the commands that you have used to configure/setup the CES services? Which guide did you follow? and what version of GPFS/SS are you currently running -- Lauz On 01/08/2017 16:34, Ilan Schwarts wrote: > > Yes I succeeded to make smb share. But only the user i put in the > command can write files to it. Others can read only. > > How can i enable write it to all domain users? The group. > And what about the error when enabling nfs? > > On Aug 1, 2017 17:24, "Sobey, Richard A" > wrote: > > You must have nfs4 Acl semantics only to create smb exports. > > Mmchfs -k parameter as I recall. > > On 1 Aug 2017 7:16 am, Ilan Schwarts > wrote: > > Hi, > I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes > spectrum > scale cluster (CentOS). > I dont need NFSv4 ACLS enabled, but i dont mind them to be if its > mandatory for the NFSv4 to work. > > I have created the domain user "fwuser" in the Active Directory > (domain=LH20), it is in group Domain users, Domain Admins, Backup > Operators and administrators. > > In the linux machine im with user ilanwalk (sudoer) > > [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser > uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) > groups=12000513(LH20\domain > users),12001603(LH20\fwuser),12000572(LH20\denied rodc password > replication group),12000512(LH20\domain > admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) > > > and when trying to add smb export: > [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share > /fs_gpfs01 --option "admin users=LH20\fwuser" > mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file > system that does not enforce NFSv4 ACLs. > > > [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs > fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) > > > > Also, when trying to enable NFS i get: > [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS > Failed to enable NFS service. Ensure file authentication is > removed > prior enabling service. > > > What am I missing ? > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Wed Aug 2 05:33:02 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Wed, 2 Aug 2017 07:33:02 +0300 Subject: [gpfsug-discuss] spectrum scale add nfs/smb exports In-Reply-To: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> References: <1c58642a-1cc0-9cd4-e920-90bdf63d8346@qsplace.co.uk> Message-ID: Hi, I use SpectrumScale 4.2.2 I have configured the CES as in documentation: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_setcessharedroot.htm This means i did the following: mmchconfig cesSharedRoot=/fs_gpfs01 mmchnode -?ces-enable ?N LH20-GPFS1,LH20-GPFS2 Thank you Some output: [root at LH20-GPFS1 ~]# mmces state show -a NODE AUTH BLOCK NETWORK AUTH_OBJ NFS OBJ SMB CES LH20-GPFS1 HEALTHY DISABLED DEGRADED DISABLED DISABLED DISABLED HEALTHY DEGRADED LH20-GPFS2 HEALTHY DISABLED DEGRADED DISABLED DISABLED DISABLED HEALTHY DEGRADED [root at LH20-GPFS1 ~]# [root at LH20-GPFS1 ~]# mmces node list Node Name Node Flags Node Groups ----------------------------------------------------------------- 1 LH20-GPFS1 none 3 LH20-GPFS2 none [root at LH20-GPFS1 ~]# mmlscluster --ces GPFS cluster information ======================== GPFS cluster name: LH20-GPFS1 GPFS cluster id: 10777108240438931454 Cluster Export Services global parameters ----------------------------------------- Shared root directory: /fs_gpfs01 Enabled Services: SMB Log level: 0 Address distribution policy: even-coverage Node Daemon node name IP address CES IP address list ----------------------------------------------------------------------- 1 LH20-GPFS1 10.10.158.61 None 3 LH20-GPFS2 10.10.158.62 None On Wed, Aug 2, 2017 at 1:40 AM, Laurence Horrocks-Barlow wrote: > Could you please give a break down of the commands that you have used to > configure/setup the CES services? > > Which guide did you follow? and what version of GPFS/SS are you currently > running > > -- Lauz > > > On 01/08/2017 16:34, Ilan Schwarts wrote: > > Yes I succeeded to make smb share. But only the user i put in the command > can write files to it. Others can read only. > > How can i enable write it to all domain users? The group. > And what about the error when enabling nfs? > > On Aug 1, 2017 17:24, "Sobey, Richard A" wrote: >> >> You must have nfs4 Acl semantics only to create smb exports. >> >> Mmchfs -k parameter as I recall. >> >> On 1 Aug 2017 7:16 am, Ilan Schwarts wrote: >> >> Hi, >> I need to set up SMB/NFSv3/NFSv4 exports on my GPFS 2 nodes spectrum >> scale cluster (CentOS). >> I dont need NFSv4 ACLS enabled, but i dont mind them to be if its >> mandatory for the NFSv4 to work. >> >> I have created the domain user "fwuser" in the Active Directory >> (domain=LH20), it is in group Domain users, Domain Admins, Backup >> Operators and administrators. >> >> In the linux machine im with user ilanwalk (sudoer) >> >> [root at LH20-GPFS1 ilanwalk]# id LH20\\fwuser >> uid=12001603(LH20\fwuser) gid=12000513(LH20\domain users) >> groups=12000513(LH20\domain >> users),12001603(LH20\fwuser),12000572(LH20\denied rodc password >> replication group),12000512(LH20\domain >> admins),11000545(BUILTIN\users),11000544(BUILTIN\administrators) >> >> >> and when trying to add smb export: >> [root at LH20-GPFS1 ilanwalk]# mmsmb export add auditAdminfs0Share >> /fs_gpfs01 --option "admin users=LH20\fwuser" >> mmsmb export add: [E] The specified path, /fs_gpfs01, is in a file >> system that does not enforce NFSv4 ACLs. >> >> >> [root at LH20-GPFS1 ilanwalk]# mount -v | grep gpfs >> fs_gpfs01 on /fs_gpfs01 type gpfs (rw,relatime) >> >> >> >> Also, when trying to enable NFS i get: >> [root at LH20-GPFS1 ilanwalk]# mmces service enable NFS >> Failed to enable NFS service. Ensure file authentication is removed >> prior enabling service. >> >> >> What am I missing ? >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- - Ilan Schwarts From john.hearns at asml.com Wed Aug 2 10:49:36 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 09:49:36 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn't work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Renar.Grunenberg at huk-coburg.de Wed Aug 2 11:01:20 2017 From: Renar.Grunenberg at huk-coburg.de (Grunenberg, Renar) Date: Wed, 2 Aug 2017 10:01:20 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From john.hearns at asml.com Wed Aug 2 11:50:29 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 10:50:29 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> References: <6635d6c952884f10a6ed749d2b1307a1@SMXRF105.msg.hukrf.de> Message-ID: Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: From cgirda at wustl.edu Wed Aug 2 13:07:05 2017 From: cgirda at wustl.edu (Chakravarthy Girda) Date: Wed, 2 Aug 2017 17:37:05 +0530 Subject: [gpfsug-discuss] Modify template variables on pre-built grafana dashboard. In-Reply-To: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> References: <200a086c1740448da544e667c03887e5@SMXRF105.msg.hukrf.de> Message-ID: <978835cd-4e29-4207-9936-6c95159356a3@wustl.edu> Hi, Successfully created bridge port and imported the pre-built grafana dashboard. https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/a180eb7e-9161-4e07-a6e4-35a0a076f7b3/attachment/5e9a5886-5bd9-4a6f-919e-bc66d16760cf/media/default%20dashboards%20set.zip Getting updates on some graphs but not all. Looks like I need to update the template variables. Need some help/instructions on how to evaluate those default variables on CLI, so I can fix them. Eg:- I get into the "File Systems View" Variable ( gpfsMetrics_fs1 ) --> Query ( gpfsMetrics_fs1 ) Regex ( /.*[^gpfs_fs_inode_used|gpfs_fs_inode_alloc|gpfs_fs_inode_free|gpfs_fs_inode_max]/ ) Question: * How can I execute the above Query and regex to fix the issues. * Is there any document on CLI options? Thank you Chakri -------------- next part -------------- An HTML attachment was scrubbed... URL: From truongv at us.ibm.com Wed Aug 2 15:35:12 2017 From: truongv at us.ibm.com (Truong Vu) Date: Wed, 2 Aug 2017 10:35:12 -0400 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 < https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Ftopic%3Fid%3D00104bb5-acf5-4036-93ba-29ea7b1d43b7%26ps%3D25&data=01%7C01%7Cjohn.hearns%40asml.com%7Caf48038c0f334674b53208d4d98d739e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=XuRlV4%2BRTilLfWD5NTK7n08m6IzjAmZ5mZOwUTNplSQ%3D&reserved=0 > Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org< mailto:gpfsug-discuss-bounces at spectrumscale.org> [ mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741< https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fsystemd%2Fsystemd%2Fissues%2F1741&data=01%7C01%7Cjohn.hearns%40asml.com%7Caf48038c0f334674b53208d4d98d739e%7Caf73baa8f5944eb2a39d93e96cad61fc%7C1&sdata=MNPDZ4bKsQBtYiz0j6SMI%2FCsKmnMbrc7kD6LMh0FQBw%3D&reserved=0 > However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170802/c0c43ae8/attachment.html > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From john.hearns at asml.com Wed Aug 2 15:49:15 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 14:49:15 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: Truong, thankyou for responding. The discussion which Renar referred to discussed system version 208, and suggested upgrading this. The system I am working on at the moment has systemd version 219, and there is only a slight minor number upgrade available. I should say that the temporary fix suggested in that discussion did work for me. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Truong Vu Sent: Wednesday, August 02, 2017 4:35 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. [Inactive hide details for gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: > Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de> ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From john.hearns at asml.com Wed Aug 2 16:19:27 2017 From: john.hearns at asml.com (John Hearns) Date: Wed, 2 Aug 2017 15:19:27 +0000 Subject: [gpfsug-discuss] Systemd will not allow the mount of a filesystem In-Reply-To: References: Message-ID: Truong, thanks again for the response. I shall implement what is suggested in the FAQ. As we are in polite company I shall maintain a smiley face when mentioning systemd From: John Hearns Sent: Wednesday, August 02, 2017 4:49 PM To: 'gpfsug main discussion list' Subject: RE: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Truong, thankyou for responding. The discussion which Renar referred to discussed system version 208, and suggested upgrading this. The system I am working on at the moment has systemd version 219, and there is only a slight minor number upgrade available. I should say that the temporary fix suggested in that discussion did work for me. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Truong Vu Sent: Wednesday, August 02, 2017 4:35 PM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem This sounds like a known problem that was fixed. If you don't have the fix, have you checkout the around in the FAQ 2.4? Tru. [Inactive hide details for gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gp]gpfsug-discuss-request---08/02/2017 06:51:02 AM---Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/02/2017 06:51 AM Subject: gpfsug-discuss Digest, Vol 67, Issue 4 Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: Systemd will not allow the mount of a filesystem (John Hearns) ---------------------------------------------------------------------- Message: 1 Date: Wed, 2 Aug 2017 10:50:29 +0000 From: John Hearns > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Message-ID: > Content-Type: text/plain; charset="utf-8" Thankyou Renar. In fact the tests I am running are in fact tests of a version upgrade before we do this on our production cluster.. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Grunenberg, Renar Sent: Wednesday, August 02, 2017 12:01 PM To: 'gpfsug main discussion list' > Subject: Re: [gpfsug-discuss] Systemd will not allow the mount of a filesystem Hallo John, you are on a backlevel Spectrum Scale Release and a backlevel Systemd package. Please see here: https://www.ibm.com/developerworks/community/forums/html/topic?id=00104bb5-acf5-4036-93ba-29ea7b1d43b7&ps=25 Renar Grunenberg Abteilung Informatik ? Betrieb HUK-COBURG Bahnhofsplatz 96444 Coburg Telefon: 09561 96-44110 Telefax: 09561 96-44104 E-Mail: Renar.Grunenberg at huk-coburg.de Internet: www.huk.de> ________________________________ HUK-COBURG Haftpflicht-Unterst?tzungs-Kasse kraftfahrender Beamter Deutschlands a. G. in Coburg Reg.-Gericht Coburg HRB 100; St.-Nr. 9212/101/00021 Sitz der Gesellschaft: Bahnhofsplatz, 96444 Coburg Vorsitzender des Aufsichtsrats: Prof. Dr. Heinrich R. Schradin. Vorstand: Klaus-J?rgen Heitmann (Sprecher), Stefan Gronbach, Dr. Hans Olav Her?y, Dr. J?rg Rheinl?nder (stv.), Sarah R?ssler, Daniel Thomas (stv.). ________________________________ Diese Nachricht enth?lt vertrauliche und/oder rechtlich gesch?tzte Informationen. Wenn Sie nicht der richtige Adressat sind oder diese Nachricht irrt?mlich erhalten haben, informieren Sie bitte sofort den Absender und vernichten Sie diese Nachricht. Das unerlaubte Kopieren sowie die unbefugte Weitergabe dieser Nachricht ist nicht gestattet. This information may contain confidential and/or privileged information. If you are not the intended recipient (or have received this information in error) please notify the sender immediately and destroy this information. Any unauthorized copying, disclosure or distribution of the material in this information is strictly forbidden. ________________________________ Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von John Hearns Gesendet: Mittwoch, 2. August 2017 11:50 An: gpfsug main discussion list > Betreff: [gpfsug-discuss] Systemd will not allow the mount of a filesystem I am setting up a filesystem for some tests, so this is not mission critical. This is on an OS with systemd When I create a new filesystem, named gpfstest, then mmmount it the filesystem is logged as being mounted then immediately dismounted. Having fought with this for several hours I now find this in the system messages file: Aug 2 10:36:56 tosmn001 systemd: Unit hpc-gpfstest.mount is bound to inactive unit dev-gpfstest.device. Stopping, too. I stopped then started gpfs. I have run a systemctl daemon-reload I created a new filesystem, using the same physical disk, with a new filesystem name, testtest, and a new mountpoint. Aug 2 11:03:50 tosmn001 systemd: Unit hpc-testtest.mount is bound to inactive unit dev-testtest.device. Stopping, too. GPFS itself logs: Wed Aug 2 11:03:50.824 2017: [I] Command: successful mount testtest Wed Aug 2 11:03:50.837 2017: [I] Command: unmount testtest Wed Aug 2 11:03:51.192 2017: [I] Command: successful unmount testtest If anyone else has seen this behavior please let me know. I found this issue https://github.com/systemd/systemd/issues/1741 However the only suggested fix is a system daemon-reload, and if this doesn?t work ??? Also if this is a stupid mistake on my part, I pro-actively hang my head in shame. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 4 ********************************************* -- The information contained in this communication and any attachments is confidential and may be privileged, and is for the sole use of the intended recipient(s). Any unauthorized review, use, disclosure or distribution is prohibited. Unless explicitly stated otherwise in the body of this communication or the attachment thereto (if any), the information is provided on an AS-IS basis without any express or implied warranties or liabilities. To the extent you are relying on this information, you are doing so at your own risk. If you are not the intended recipient, please notify the sender immediately by replying to this message and destroy all copies of this message and any attachments. Neither the sender nor the company/group of companies he or she represents shall be liable for the proper and complete transmission of the information contained in this communication, or for any delay in its receipt. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 105 bytes Desc: image001.gif URL: From stijn.deweirdt at ugent.be Wed Aug 2 16:57:55 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 17:57:55 +0200 Subject: [gpfsug-discuss] data integrity documentation Message-ID: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> hi all, is there any documentation wrt data integrity in spectrum scale: assuming a crappy network, does gpfs garantee somehow that data written by client ends up safe in the nsd gpfs daemon; and similarly from the nsd gpfs daemon to disk. and wrt crappy network, what about rdma on crappy network? is it the same? (we are hunting down a crappy infiniband issue; ibm support says it's network issue; and we see no errors anywhere...) thanks a lot, stijn From eric.wonderley at vt.edu Wed Aug 2 17:15:12 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 2 Aug 2017 12:15:12 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: No guarantee...unless you are using ess/gss solution. Crappy network will get you loads of expels and occasional fscks. Which I guess beats data loss and recovery from backup. YOu probably have a network issue...they can be subtle. Gpfs is a very extremely thorough network tester. Eric On Wed, Aug 2, 2017 at 11:57 AM, Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 17:26:29 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 16:26:29 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: the very first thing you should check is if you have this setting set : mmlsconfig envVar envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1 if that doesn't come back the way above you need to set it : mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" there was a problem in the Mellanox FW in various versions that was never completely addressed (bugs where found and fixed, but it was never fully proven to be addressed) the above environment variables turn code on in the mellanox driver that prevents this potential code path from being used to begin with. in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale that even you don't set this variables the problem can't happen anymore until then the only choice you have is the envVar above (which btw ships as default on all ESS systems). you also should be on the latest available Mellanox FW & Drivers as not all versions even have the code that is activated by the environment variables above, i think at a minimum you need to be at 3.4 but i don't remember the exact version. There had been multiple defects opened around this area, the last one i remember was : 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock you may ask your mellanox representative if they can get you access to this defect. while it was found on ESS , means on PPC64 and with ConnectX-3 cards its a general issue that affects all cards and on intel as well as Power. On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 17:26:29 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 16:26:29 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: the very first thing you should check is if you have this setting set : mmlsconfig envVar envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 MLX5_USE_MUTEX 1 if that doesn't come back the way above you need to set it : mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" there was a problem in the Mellanox FW in various versions that was never completely addressed (bugs where found and fixed, but it was never fully proven to be addressed) the above environment variables turn code on in the mellanox driver that prevents this potential code path from being used to begin with. in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale that even you don't set this variables the problem can't happen anymore until then the only choice you have is the envVar above (which btw ships as default on all ESS systems). you also should be on the latest available Mellanox FW & Drivers as not all versions even have the code that is activated by the environment variables above, i think at a minimum you need to be at 3.4 but i don't remember the exact version. There had been multiple defects opened around this area, the last one i remember was : 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock you may ask your mellanox representative if they can get you access to this defect. while it was found on ESS , means on PPC64 and with ConnectX-3 cards its a general issue that affects all cards and on intel as well as Power. On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt wrote: > hi all, > > is there any documentation wrt data integrity in spectrum scale: > assuming a crappy network, does gpfs garantee somehow that data written > by client ends up safe in the nsd gpfs daemon; and similarly from the > nsd gpfs daemon to disk. > > and wrt crappy network, what about rdma on crappy network? is it the same? > > (we are hunting down a crappy infiniband issue; ibm support says it's > network issue; and we see no errors anywhere...) > > thanks a lot, > > stijn > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 19:38:13 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:38:13 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: <2518112e-0311-09c6-4f24-daa2f18bd80c@ugent.be> > No guarantee...unless you are using ess/gss solution. ok, so crappy network == corrupt data? hmmm, that is really a pity on 2017... > > Crappy network will get you loads of expels and occasional fscks. Which I > guess beats data loss and recovery from backup. if only we had errors like that. with the current issue mmfsck is the only tool that seems to trigger them (and setting some of the nsdChksum config flags reports checksum errors in the log files). but nsdperf with verify=on reports nothing. > > YOu probably have a network issue...they can be subtle. Gpfs is a very > extremely thorough network tester. we know ;) stijn > > > Eric > > On Wed, Aug 2, 2017 at 11:57 AM, Stijn De Weirdt > wrote: > >> hi all, >> >> is there any documentation wrt data integrity in spectrum scale: >> assuming a crappy network, does gpfs garantee somehow that data written >> by client ends up safe in the nsd gpfs daemon; and similarly from the >> nsd gpfs daemon to disk. >> >> and wrt crappy network, what about rdma on crappy network? is it the same? >> >> (we are hunting down a crappy infiniband issue; ibm support says it's >> network issue; and we see no errors anywhere...) >> >> thanks a lot, >> >> stijn >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From stijn.deweirdt at ugent.be Wed Aug 2 19:43:51 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:43:51 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> Message-ID: <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> hi sven, > the very first thing you should check is if you have this setting set : maybe the very first thing to check should be the faq/wiki that has this documented? > > mmlsconfig envVar > > envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > MLX5_USE_MUTEX 1 > > if that doesn't come back the way above you need to set it : > > mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" i just set this (wasn't set before), but problem is still present. > > there was a problem in the Mellanox FW in various versions that was never > completely addressed (bugs where found and fixed, but it was never fully > proven to be addressed) the above environment variables turn code on in the > mellanox driver that prevents this potential code path from being used to > begin with. > > in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale > that even you don't set this variables the problem can't happen anymore > until then the only choice you have is the envVar above (which btw ships as > default on all ESS systems). > > you also should be on the latest available Mellanox FW & Drivers as not all > versions even have the code that is activated by the environment variables > above, i think at a minimum you need to be at 3.4 but i don't remember the > exact version. There had been multiple defects opened around this area, the > last one i remember was : we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from dell, and the fw is a bit behind. i'm trying to convince dell to make new one. mellanox used to allow to make your own, but they don't anymore. > > 00154843 : ESS ConnectX-3 performance issue - spinning on pthread_spin_lock > > you may ask your mellanox representative if they can get you access to this > defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > cards its a general issue that affects all cards and on intel as well as > Power. ok, thanks for this. maybe such a reference is enough for dell to update their firmware. stijn > > On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt > wrote: > >> hi all, >> >> is there any documentation wrt data integrity in spectrum scale: >> assuming a crappy network, does gpfs garantee somehow that data written >> by client ends up safe in the nsd gpfs daemon; and similarly from the >> nsd gpfs daemon to disk. >> >> and wrt crappy network, what about rdma on crappy network? is it the same? >> >> (we are hunting down a crappy infiniband issue; ibm support says it's >> network issue; and we see no errors anywhere...) >> >> thanks a lot, >> >> stijn >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 19:47:52 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 18:47:52 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> Message-ID: How can you reproduce this so quick ? Did you restart all daemons after that ? On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt wrote: > hi sven, > > > > the very first thing you should check is if you have this setting set : > maybe the very first thing to check should be the faq/wiki that has this > documented? > > > > > mmlsconfig envVar > > > > envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > > MLX5_USE_MUTEX 1 > > > > if that doesn't come back the way above you need to set it : > > > > mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > > MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > i just set this (wasn't set before), but problem is still present. > > > > > there was a problem in the Mellanox FW in various versions that was never > > completely addressed (bugs where found and fixed, but it was never fully > > proven to be addressed) the above environment variables turn code on in > the > > mellanox driver that prevents this potential code path from being used to > > begin with. > > > > in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale > > that even you don't set this variables the problem can't happen anymore > > until then the only choice you have is the envVar above (which btw ships > as > > default on all ESS systems). > > > > you also should be on the latest available Mellanox FW & Drivers as not > all > > versions even have the code that is activated by the environment > variables > > above, i think at a minimum you need to be at 3.4 but i don't remember > the > > exact version. There had been multiple defects opened around this area, > the > > last one i remember was : > we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > dell, and the fw is a bit behind. i'm trying to convince dell to make > new one. mellanox used to allow to make your own, but they don't anymore. > > > > > 00154843 : ESS ConnectX-3 performance issue - spinning on > pthread_spin_lock > > > > you may ask your mellanox representative if they can get you access to > this > > defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > > cards its a general issue that affects all cards and on intel as well as > > Power. > ok, thanks for this. maybe such a reference is enough for dell to update > their firmware. > > stijn > > > > > On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt > > wrote: > > > >> hi all, > >> > >> is there any documentation wrt data integrity in spectrum scale: > >> assuming a crappy network, does gpfs garantee somehow that data written > >> by client ends up safe in the nsd gpfs daemon; and similarly from the > >> nsd gpfs daemon to disk. > >> > >> and wrt crappy network, what about rdma on crappy network? is it the > same? > >> > >> (we are hunting down a crappy infiniband issue; ibm support says it's > >> network issue; and we see no errors anywhere...) > >> > >> thanks a lot, > >> > >> stijn > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 19:53:09 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 20:53:09 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> Message-ID: <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> yes ;) the system is in preproduction, so nothing that can't stopped/started in a few minutes (current setup has only 4 nsds, and no clients). mmfsck triggers the errors very early during inode replica compare. stijn On 08/02/2017 08:47 PM, Sven Oehme wrote: > How can you reproduce this so quick ? > Did you restart all daemons after that ? > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > wrote: > >> hi sven, >> >> >>> the very first thing you should check is if you have this setting set : >> maybe the very first thing to check should be the faq/wiki that has this >> documented? >> >>> >>> mmlsconfig envVar >>> >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>> MLX5_USE_MUTEX 1 >>> >>> if that doesn't come back the way above you need to set it : >>> >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >> i just set this (wasn't set before), but problem is still present. >> >>> >>> there was a problem in the Mellanox FW in various versions that was never >>> completely addressed (bugs where found and fixed, but it was never fully >>> proven to be addressed) the above environment variables turn code on in >> the >>> mellanox driver that prevents this potential code path from being used to >>> begin with. >>> >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in Scale >>> that even you don't set this variables the problem can't happen anymore >>> until then the only choice you have is the envVar above (which btw ships >> as >>> default on all ESS systems). >>> >>> you also should be on the latest available Mellanox FW & Drivers as not >> all >>> versions even have the code that is activated by the environment >> variables >>> above, i think at a minimum you need to be at 3.4 but i don't remember >> the >>> exact version. There had been multiple defects opened around this area, >> the >>> last one i remember was : >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >> dell, and the fw is a bit behind. i'm trying to convince dell to make >> new one. mellanox used to allow to make your own, but they don't anymore. >> >>> >>> 00154843 : ESS ConnectX-3 performance issue - spinning on >> pthread_spin_lock >>> >>> you may ask your mellanox representative if they can get you access to >> this >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>> cards its a general issue that affects all cards and on intel as well as >>> Power. >> ok, thanks for this. maybe such a reference is enough for dell to update >> their firmware. >> >> stijn >> >>> >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt >>> wrote: >>> >>>> hi all, >>>> >>>> is there any documentation wrt data integrity in spectrum scale: >>>> assuming a crappy network, does gpfs garantee somehow that data written >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>> nsd gpfs daemon to disk. >>>> >>>> and wrt crappy network, what about rdma on crappy network? is it the >> same? >>>> >>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>> network issue; and we see no errors anywhere...) >>>> >>>> thanks a lot, >>>> >>>> stijn >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 20:10:07 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 19:10:07 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> Message-ID: ok, i think i understand now, the data was already corrupted. the config change i proposed only prevents a potentially known future on the wire corruption, this will not fix something that made it to the disk already. Sven On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt wrote: > yes ;) > > the system is in preproduction, so nothing that can't stopped/started in > a few minutes (current setup has only 4 nsds, and no clients). > mmfsck triggers the errors very early during inode replica compare. > > > stijn > > On 08/02/2017 08:47 PM, Sven Oehme wrote: > > How can you reproduce this so quick ? > > Did you restart all daemons after that ? > > > > On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > > wrote: > > > >> hi sven, > >> > >> > >>> the very first thing you should check is if you have this setting set : > >> maybe the very first thing to check should be the faq/wiki that has this > >> documented? > >> > >>> > >>> mmlsconfig envVar > >>> > >>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>> MLX5_USE_MUTEX 1 > >>> > >>> if that doesn't come back the way above you need to set it : > >>> > >>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >> i just set this (wasn't set before), but problem is still present. > >> > >>> > >>> there was a problem in the Mellanox FW in various versions that was > never > >>> completely addressed (bugs where found and fixed, but it was never > fully > >>> proven to be addressed) the above environment variables turn code on in > >> the > >>> mellanox driver that prevents this potential code path from being used > to > >>> begin with. > >>> > >>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > Scale > >>> that even you don't set this variables the problem can't happen anymore > >>> until then the only choice you have is the envVar above (which btw > ships > >> as > >>> default on all ESS systems). > >>> > >>> you also should be on the latest available Mellanox FW & Drivers as not > >> all > >>> versions even have the code that is activated by the environment > >> variables > >>> above, i think at a minimum you need to be at 3.4 but i don't remember > >> the > >>> exact version. There had been multiple defects opened around this area, > >> the > >>> last one i remember was : > >> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >> dell, and the fw is a bit behind. i'm trying to convince dell to make > >> new one. mellanox used to allow to make your own, but they don't > anymore. > >> > >>> > >>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >> pthread_spin_lock > >>> > >>> you may ask your mellanox representative if they can get you access to > >> this > >>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > >>> cards its a general issue that affects all cards and on intel as well > as > >>> Power. > >> ok, thanks for this. maybe such a reference is enough for dell to update > >> their firmware. > >> > >> stijn > >> > >>> > >>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be> > >>> wrote: > >>> > >>>> hi all, > >>>> > >>>> is there any documentation wrt data integrity in spectrum scale: > >>>> assuming a crappy network, does gpfs garantee somehow that data > written > >>>> by client ends up safe in the nsd gpfs daemon; and similarly from the > >>>> nsd gpfs daemon to disk. > >>>> > >>>> and wrt crappy network, what about rdma on crappy network? is it the > >> same? > >>>> > >>>> (we are hunting down a crappy infiniband issue; ibm support says it's > >>>> network issue; and we see no errors anywhere...) > >>>> > >>>> thanks a lot, > >>>> > >>>> stijn > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 20:20:14 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 21:20:14 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> Message-ID: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> hi sven, the data is not corrupted. mmfsck compares 2 inodes, says they don't match, but checking the data with tbdbfs reveals they are equal. (one replica has to be fetched over the network; the nsds cannot access all disks) with some nsdChksum... settings we get during this mmfsck a lot of "Encountered XYZ checksum errors on network I/O to NSD Client disk" ibm support says these are hardware issues, but wrt to mmfsck false positives. anyway, our current question is: if these are hardware issues, is there anything in gpfs client->nsd (on the network side) that would detect such errors. ie can we trust the data (and metadata). i was under the impression that client to disk is not covered, but i assumed that at least client to nsd (the network part) was checksummed. stijn On 08/02/2017 09:10 PM, Sven Oehme wrote: > ok, i think i understand now, the data was already corrupted. the config > change i proposed only prevents a potentially known future on the wire > corruption, this will not fix something that made it to the disk already. > > Sven > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > wrote: > >> yes ;) >> >> the system is in preproduction, so nothing that can't stopped/started in >> a few minutes (current setup has only 4 nsds, and no clients). >> mmfsck triggers the errors very early during inode replica compare. >> >> >> stijn >> >> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>> How can you reproduce this so quick ? >>> Did you restart all daemons after that ? >>> >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >>> wrote: >>> >>>> hi sven, >>>> >>>> >>>>> the very first thing you should check is if you have this setting set : >>>> maybe the very first thing to check should be the faq/wiki that has this >>>> documented? >>>> >>>>> >>>>> mmlsconfig envVar >>>>> >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>> MLX5_USE_MUTEX 1 >>>>> >>>>> if that doesn't come back the way above you need to set it : >>>>> >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>> i just set this (wasn't set before), but problem is still present. >>>> >>>>> >>>>> there was a problem in the Mellanox FW in various versions that was >> never >>>>> completely addressed (bugs where found and fixed, but it was never >> fully >>>>> proven to be addressed) the above environment variables turn code on in >>>> the >>>>> mellanox driver that prevents this potential code path from being used >> to >>>>> begin with. >>>>> >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >> Scale >>>>> that even you don't set this variables the problem can't happen anymore >>>>> until then the only choice you have is the envVar above (which btw >> ships >>>> as >>>>> default on all ESS systems). >>>>> >>>>> you also should be on the latest available Mellanox FW & Drivers as not >>>> all >>>>> versions even have the code that is activated by the environment >>>> variables >>>>> above, i think at a minimum you need to be at 3.4 but i don't remember >>>> the >>>>> exact version. There had been multiple defects opened around this area, >>>> the >>>>> last one i remember was : >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>> new one. mellanox used to allow to make your own, but they don't >> anymore. >>>> >>>>> >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>> pthread_spin_lock >>>>> >>>>> you may ask your mellanox representative if they can get you access to >>>> this >>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>>>> cards its a general issue that affects all cards and on intel as well >> as >>>>> Power. >>>> ok, thanks for this. maybe such a reference is enough for dell to update >>>> their firmware. >>>> >>>> stijn >>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be> >>>>> wrote: >>>>> >>>>>> hi all, >>>>>> >>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>> assuming a crappy network, does gpfs garantee somehow that data >> written >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>>>> nsd gpfs daemon to disk. >>>>>> >>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>> same? >>>>>> >>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>>>> network issue; and we see no errors anywhere...) >>>>>> >>>>>> thanks a lot, >>>>>> >>>>>> stijn >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From ewahl at osc.edu Wed Aug 2 21:11:53 2017 From: ewahl at osc.edu (Edward Wahl) Date: Wed, 2 Aug 2017 16:11:53 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: <20170802161153.4eea6f61@osc.edu> What version of GPFS? Are you generating a patch file? Try using this before your mmfsck: mmdsh -N mmfsadm test fsck usePatchQueue 0 my notes say all, but I would have only had NSD nodes up at the time. Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. I won't know for sure until late August. Ed On Wed, 2 Aug 2017 21:20:14 +0200 Stijn De Weirdt wrote: > hi sven, > > the data is not corrupted. mmfsck compares 2 inodes, says they don't > match, but checking the data with tbdbfs reveals they are equal. > (one replica has to be fetched over the network; the nsds cannot access > all disks) > > with some nsdChksum... settings we get during this mmfsck a lot of > "Encountered XYZ checksum errors on network I/O to NSD Client disk" > > ibm support says these are hardware issues, but wrt to mmfsck false > positives. > > anyway, our current question is: if these are hardware issues, is there > anything in gpfs client->nsd (on the network side) that would detect > such errors. ie can we trust the data (and metadata). > i was under the impression that client to disk is not covered, but i > assumed that at least client to nsd (the network part) was checksummed. > > stijn > > > On 08/02/2017 09:10 PM, Sven Oehme wrote: > > ok, i think i understand now, the data was already corrupted. the config > > change i proposed only prevents a potentially known future on the wire > > corruption, this will not fix something that made it to the disk already. > > > > Sven > > > > > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > > wrote: > > > >> yes ;) > >> > >> the system is in preproduction, so nothing that can't stopped/started in > >> a few minutes (current setup has only 4 nsds, and no clients). > >> mmfsck triggers the errors very early during inode replica compare. > >> > >> > >> stijn > >> > >> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>> How can you reproduce this so quick ? > >>> Did you restart all daemons after that ? > >>> > >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > >>> wrote: > >>> > >>>> hi sven, > >>>> > >>>> > >>>>> the very first thing you should check is if you have this setting > >>>>> set : > >>>> maybe the very first thing to check should be the faq/wiki that has this > >>>> documented? > >>>> > >>>>> > >>>>> mmlsconfig envVar > >>>>> > >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>>>> MLX5_USE_MUTEX 1 > >>>>> > >>>>> if that doesn't come back the way above you need to set it : > >>>>> > >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>> i just set this (wasn't set before), but problem is still present. > >>>> > >>>>> > >>>>> there was a problem in the Mellanox FW in various versions that was > >> never > >>>>> completely addressed (bugs where found and fixed, but it was never > >> fully > >>>>> proven to be addressed) the above environment variables turn code on > >>>>> in > >>>> the > >>>>> mellanox driver that prevents this potential code path from being used > >> to > >>>>> begin with. > >>>>> > >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >> Scale > >>>>> that even you don't set this variables the problem can't happen anymore > >>>>> until then the only choice you have is the envVar above (which btw > >> ships > >>>> as > >>>>> default on all ESS systems). > >>>>> > >>>>> you also should be on the latest available Mellanox FW & Drivers as > >>>>> not > >>>> all > >>>>> versions even have the code that is activated by the environment > >>>> variables > >>>>> above, i think at a minimum you need to be at 3.4 but i don't remember > >>>> the > >>>>> exact version. There had been multiple defects opened around this > >>>>> area, > >>>> the > >>>>> last one i remember was : > >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make > >>>> new one. mellanox used to allow to make your own, but they don't > >> anymore. > >>>> > >>>>> > >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>> pthread_spin_lock > >>>>> > >>>>> you may ask your mellanox representative if they can get you access to > >>>> this > >>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 > >>>>> cards its a general issue that affects all cards and on intel as well > >> as > >>>>> Power. > >>>> ok, thanks for this. maybe such a reference is enough for dell to update > >>>> their firmware. > >>>> > >>>> stijn > >>>> > >>>>> > >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >> stijn.deweirdt at ugent.be> > >>>>> wrote: > >>>>> > >>>>>> hi all, > >>>>>> > >>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>> assuming a crappy network, does gpfs garantee somehow that data > >> written > >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the > >>>>>> nsd gpfs daemon to disk. > >>>>>> > >>>>>> and wrt crappy network, what about rdma on crappy network? is it the > >>>> same? > >>>>>> > >>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's > >>>>>> network issue; and we see no errors anywhere...) > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> stijn > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From stijn.deweirdt at ugent.be Wed Aug 2 21:38:29 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 22:38:29 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <20170802161153.4eea6f61@osc.edu> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> <20170802161153.4eea6f61@osc.edu> Message-ID: <393b54ec-ec6a-040b-ef04-6076632db60c@ugent.be> hi ed, On 08/02/2017 10:11 PM, Edward Wahl wrote: > What version of GPFS? Are you generating a patch file? 4.2.3 series, now we run 4.2.3.3 to be clear, right now we use mmfsck to trigger the chksum issue hoping we can find the actual "hardware" issue. we know by elimination which HCAs to avoid, so we do not get the checksum errors. but to consider that a fix, we need to know if the data written by the client can be trusted due to these silent hw errors. > > Try using this before your mmfsck: > > mmdsh -N mmfsadm test fsck usePatchQueue 0 mmchmgr somefs nsdXYZ mmfsck somefs -Vn -m -N nsdXYZ -t /var/tmp/ the idea is to force everything as much as possible on one node, accessing the other failure group is forced over network > > my notes say all, but I would have only had NSD nodes up at the time. > Supposedly the mmfsck mess in 4.1 and 4.2.x was fixed in 4.2.2.3. we had the "pleasure" last to have mmfsck segfaulting while we were trying to recover a filesystem, at least that was certainly fixed ;) stijn > I won't know for sure until late August. > > Ed > > > On Wed, 2 Aug 2017 21:20:14 +0200 > Stijn De Weirdt wrote: > >> hi sven, >> >> the data is not corrupted. mmfsck compares 2 inodes, says they don't >> match, but checking the data with tbdbfs reveals they are equal. >> (one replica has to be fetched over the network; the nsds cannot access >> all disks) >> >> with some nsdChksum... settings we get during this mmfsck a lot of >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >> >> ibm support says these are hardware issues, but wrt to mmfsck false >> positives. >> >> anyway, our current question is: if these are hardware issues, is there >> anything in gpfs client->nsd (on the network side) that would detect >> such errors. ie can we trust the data (and metadata). >> i was under the impression that client to disk is not covered, but i >> assumed that at least client to nsd (the network part) was checksummed. >> >> stijn >> >> >> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>> ok, i think i understand now, the data was already corrupted. the config >>> change i proposed only prevents a potentially known future on the wire >>> corruption, this will not fix something that made it to the disk already. >>> >>> Sven >>> >>> >>> >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt >>> wrote: >>> >>>> yes ;) >>>> >>>> the system is in preproduction, so nothing that can't stopped/started in >>>> a few minutes (current setup has only 4 nsds, and no clients). >>>> mmfsck triggers the errors very early during inode replica compare. >>>> >>>> >>>> stijn >>>> >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>> How can you reproduce this so quick ? >>>>> Did you restart all daemons after that ? >>>>> >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >>>>> wrote: >>>>> >>>>>> hi sven, >>>>>> >>>>>> >>>>>>> the very first thing you should check is if you have this setting >>>>>>> set : >>>>>> maybe the very first thing to check should be the faq/wiki that has this >>>>>> documented? >>>>>> >>>>>>> >>>>>>> mmlsconfig envVar >>>>>>> >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>>>> MLX5_USE_MUTEX 1 >>>>>>> >>>>>>> if that doesn't come back the way above you need to set it : >>>>>>> >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>> i just set this (wasn't set before), but problem is still present. >>>>>> >>>>>>> >>>>>>> there was a problem in the Mellanox FW in various versions that was >>>> never >>>>>>> completely addressed (bugs where found and fixed, but it was never >>>> fully >>>>>>> proven to be addressed) the above environment variables turn code on >>>>>>> in >>>>>> the >>>>>>> mellanox driver that prevents this potential code path from being used >>>> to >>>>>>> begin with. >>>>>>> >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>> Scale >>>>>>> that even you don't set this variables the problem can't happen anymore >>>>>>> until then the only choice you have is the envVar above (which btw >>>> ships >>>>>> as >>>>>>> default on all ESS systems). >>>>>>> >>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>>>>> not >>>>>> all >>>>>>> versions even have the code that is activated by the environment >>>>>> variables >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't remember >>>>>> the >>>>>>> exact version. There had been multiple defects opened around this >>>>>>> area, >>>>>> the >>>>>>> last one i remember was : >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>>>> new one. mellanox used to allow to make your own, but they don't >>>> anymore. >>>>>> >>>>>>> >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>> pthread_spin_lock >>>>>>> >>>>>>> you may ask your mellanox representative if they can get you access to >>>>>> this >>>>>>> defect. while it was found on ESS , means on PPC64 and with ConnectX-3 >>>>>>> cards its a general issue that affects all cards and on intel as well >>>> as >>>>>>> Power. >>>>>> ok, thanks for this. maybe such a reference is enough for dell to update >>>>>> their firmware. >>>>>> >>>>>> stijn >>>>>> >>>>>>> >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>> stijn.deweirdt at ugent.be> >>>>>>> wrote: >>>>>>> >>>>>>>> hi all, >>>>>>>> >>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>> written >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from the >>>>>>>> nsd gpfs daemon to disk. >>>>>>>> >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>>>> same? >>>>>>>> >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says it's >>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>> >>>>>>>> thanks a lot, >>>>>>>> >>>>>>>> stijn >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > From eric.wonderley at vt.edu Wed Aug 2 22:02:20 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 2 Aug 2017 17:02:20 -0400 Subject: [gpfsug-discuss] mmsetquota produces error Message-ID: for one of our home filesystem we get: mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'nathanfootest' error (22): 'Invalid argument'. mmedquota -j home:nathanfootest does work however -------------- next part -------------- An HTML attachment was scrubbed... URL: From oehmes at gmail.com Wed Aug 2 22:05:18 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 21:05:18 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: before i answer the rest of your questions, can you share what version of GPFS exactly you are on mmfsadm dump version would be best source for that. if you have 2 inodes and you know the exact address of where they are stored on disk one could 'dd' them of the disk and compare if they are really equal. we only support checksums when you use GNR based systems, they cover network as well as Disk side for that. the nsdchecksum code you refer to is the one i mentioned above thats only supported with GNR at least i am not aware that we ever claimed it to be supported outside of it, but i can check that. sven On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt wrote: > hi sven, > > the data is not corrupted. mmfsck compares 2 inodes, says they don't > match, but checking the data with tbdbfs reveals they are equal. > (one replica has to be fetched over the network; the nsds cannot access > all disks) > > with some nsdChksum... settings we get during this mmfsck a lot of > "Encountered XYZ checksum errors on network I/O to NSD Client disk" > > ibm support says these are hardware issues, but wrt to mmfsck false > positives. > > anyway, our current question is: if these are hardware issues, is there > anything in gpfs client->nsd (on the network side) that would detect > such errors. ie can we trust the data (and metadata). > i was under the impression that client to disk is not covered, but i > assumed that at least client to nsd (the network part) was checksummed. > > stijn > > > On 08/02/2017 09:10 PM, Sven Oehme wrote: > > ok, i think i understand now, the data was already corrupted. the config > > change i proposed only prevents a potentially known future on the wire > > corruption, this will not fix something that made it to the disk already. > > > > Sven > > > > > > > > On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt > > > wrote: > > > >> yes ;) > >> > >> the system is in preproduction, so nothing that can't stopped/started in > >> a few minutes (current setup has only 4 nsds, and no clients). > >> mmfsck triggers the errors very early during inode replica compare. > >> > >> > >> stijn > >> > >> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>> How can you reproduce this so quick ? > >>> Did you restart all daemons after that ? > >>> > >>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt > > >>> wrote: > >>> > >>>> hi sven, > >>>> > >>>> > >>>>> the very first thing you should check is if you have this setting > set : > >>>> maybe the very first thing to check should be the faq/wiki that has > this > >>>> documented? > >>>> > >>>>> > >>>>> mmlsconfig envVar > >>>>> > >>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 > >>>>> MLX5_USE_MUTEX 1 > >>>>> > >>>>> if that doesn't come back the way above you need to set it : > >>>>> > >>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>> i just set this (wasn't set before), but problem is still present. > >>>> > >>>>> > >>>>> there was a problem in the Mellanox FW in various versions that was > >> never > >>>>> completely addressed (bugs where found and fixed, but it was never > >> fully > >>>>> proven to be addressed) the above environment variables turn code on > in > >>>> the > >>>>> mellanox driver that prevents this potential code path from being > used > >> to > >>>>> begin with. > >>>>> > >>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >> Scale > >>>>> that even you don't set this variables the problem can't happen > anymore > >>>>> until then the only choice you have is the envVar above (which btw > >> ships > >>>> as > >>>>> default on all ESS systems). > >>>>> > >>>>> you also should be on the latest available Mellanox FW & Drivers as > not > >>>> all > >>>>> versions even have the code that is activated by the environment > >>>> variables > >>>>> above, i think at a minimum you need to be at 3.4 but i don't > remember > >>>> the > >>>>> exact version. There had been multiple defects opened around this > area, > >>>> the > >>>>> last one i remember was : > >>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from > >>>> dell, and the fw is a bit behind. i'm trying to convince dell to make > >>>> new one. mellanox used to allow to make your own, but they don't > >> anymore. > >>>> > >>>>> > >>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>> pthread_spin_lock > >>>>> > >>>>> you may ask your mellanox representative if they can get you access > to > >>>> this > >>>>> defect. while it was found on ESS , means on PPC64 and with > ConnectX-3 > >>>>> cards its a general issue that affects all cards and on intel as well > >> as > >>>>> Power. > >>>> ok, thanks for this. maybe such a reference is enough for dell to > update > >>>> their firmware. > >>>> > >>>> stijn > >>>> > >>>>> > >>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >> stijn.deweirdt at ugent.be> > >>>>> wrote: > >>>>> > >>>>>> hi all, > >>>>>> > >>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>> assuming a crappy network, does gpfs garantee somehow that data > >> written > >>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from > the > >>>>>> nsd gpfs daemon to disk. > >>>>>> > >>>>>> and wrt crappy network, what about rdma on crappy network? is it the > >>>> same? > >>>>>> > >>>>>> (we are hunting down a crappy infiniband issue; ibm support says > it's > >>>>>> network issue; and we see no errors anywhere...) > >>>>>> > >>>>>> thanks a lot, > >>>>>> > >>>>>> stijn > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 22:14:45 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:14:45 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: hi sven, > before i answer the rest of your questions, can you share what version of > GPFS exactly you are on mmfsadm dump version would be best source for that. it returns Build branch "4.2.3.3 ". > if you have 2 inodes and you know the exact address of where they are > stored on disk one could 'dd' them of the disk and compare if they are > really equal. ok, i can try that later. are you suggesting that the "tsdbfs comp" might gave wrong results? because we ran that and got eg > # tsdbfs somefs comp 7:5137408 25:221785088 1024 > Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = 0x19:D382C00: > All sectors identical > we only support checksums when you use GNR based systems, they cover > network as well as Disk side for that. > the nsdchecksum code you refer to is the one i mentioned above thats only > supported with GNR at least i am not aware that we ever claimed it to be > supported outside of it, but i can check that. ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, and they are not in the same gpfs cluster. i thought the GNR extended the checksumming to disk, and that it was already there for the network part. thanks for clearing this up. but that is worse then i thought... stijn > > sven > > On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt > wrote: > >> hi sven, >> >> the data is not corrupted. mmfsck compares 2 inodes, says they don't >> match, but checking the data with tbdbfs reveals they are equal. >> (one replica has to be fetched over the network; the nsds cannot access >> all disks) >> >> with some nsdChksum... settings we get during this mmfsck a lot of >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >> >> ibm support says these are hardware issues, but wrt to mmfsck false >> positives. >> >> anyway, our current question is: if these are hardware issues, is there >> anything in gpfs client->nsd (on the network side) that would detect >> such errors. ie can we trust the data (and metadata). >> i was under the impression that client to disk is not covered, but i >> assumed that at least client to nsd (the network part) was checksummed. >> >> stijn >> >> >> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>> ok, i think i understand now, the data was already corrupted. the config >>> change i proposed only prevents a potentially known future on the wire >>> corruption, this will not fix something that made it to the disk already. >>> >>> Sven >>> >>> >>> >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt >> >>> wrote: >>> >>>> yes ;) >>>> >>>> the system is in preproduction, so nothing that can't stopped/started in >>>> a few minutes (current setup has only 4 nsds, and no clients). >>>> mmfsck triggers the errors very early during inode replica compare. >>>> >>>> >>>> stijn >>>> >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>> How can you reproduce this so quick ? >>>>> Did you restart all daemons after that ? >>>>> >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt >> >>>>> wrote: >>>>> >>>>>> hi sven, >>>>>> >>>>>> >>>>>>> the very first thing you should check is if you have this setting >> set : >>>>>> maybe the very first thing to check should be the faq/wiki that has >> this >>>>>> documented? >>>>>> >>>>>>> >>>>>>> mmlsconfig envVar >>>>>>> >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF 1 >>>>>>> MLX5_USE_MUTEX 1 >>>>>>> >>>>>>> if that doesn't come back the way above you need to set it : >>>>>>> >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>> i just set this (wasn't set before), but problem is still present. >>>>>> >>>>>>> >>>>>>> there was a problem in the Mellanox FW in various versions that was >>>> never >>>>>>> completely addressed (bugs where found and fixed, but it was never >>>> fully >>>>>>> proven to be addressed) the above environment variables turn code on >> in >>>>>> the >>>>>>> mellanox driver that prevents this potential code path from being >> used >>>> to >>>>>>> begin with. >>>>>>> >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>> Scale >>>>>>> that even you don't set this variables the problem can't happen >> anymore >>>>>>> until then the only choice you have is the envVar above (which btw >>>> ships >>>>>> as >>>>>>> default on all ESS systems). >>>>>>> >>>>>>> you also should be on the latest available Mellanox FW & Drivers as >> not >>>>>> all >>>>>>> versions even have the code that is activated by the environment >>>>>> variables >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >> remember >>>>>> the >>>>>>> exact version. There had been multiple defects opened around this >> area, >>>>>> the >>>>>>> last one i remember was : >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards from >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to make >>>>>> new one. mellanox used to allow to make your own, but they don't >>>> anymore. >>>>>> >>>>>>> >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>> pthread_spin_lock >>>>>>> >>>>>>> you may ask your mellanox representative if they can get you access >> to >>>>>> this >>>>>>> defect. while it was found on ESS , means on PPC64 and with >> ConnectX-3 >>>>>>> cards its a general issue that affects all cards and on intel as well >>>> as >>>>>>> Power. >>>>>> ok, thanks for this. maybe such a reference is enough for dell to >> update >>>>>> their firmware. >>>>>> >>>>>> stijn >>>>>> >>>>>>> >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>> stijn.deweirdt at ugent.be> >>>>>>> wrote: >>>>>>> >>>>>>>> hi all, >>>>>>>> >>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>> written >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >> the >>>>>>>> nsd gpfs daemon to disk. >>>>>>>> >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it the >>>>>> same? >>>>>>>> >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >> it's >>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>> >>>>>>>> thanks a lot, >>>>>>>> >>>>>>>> stijn >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From oehmes at gmail.com Wed Aug 2 22:23:44 2017 From: oehmes at gmail.com (Sven Oehme) Date: Wed, 02 Aug 2017 21:23:44 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: ok, you can't be any newer that that. i just wonder why you have 512b inodes if this is a new system ? are this raw disks in this setup or raid controllers ? whats the disk sector size and how was the filesystem created (mmlsfs FSNAME would show answer to the last question) on the tsdbfs i am not sure if it gave wrong results, but it would be worth a test to see whats actually on the disk . you are correct that GNR extends this to the disk, but the network part is covered by the nsdchecksums you turned on when you enable the not to be named checksum parameter do you actually still get an error from fsck ? sven On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt wrote: > hi sven, > > > before i answer the rest of your questions, can you share what version of > > GPFS exactly you are on mmfsadm dump version would be best source for > that. > it returns > Build branch "4.2.3.3 ". > > > if you have 2 inodes and you know the exact address of where they are > > stored on disk one could 'dd' them of the disk and compare if they are > > really equal. > ok, i can try that later. are you suggesting that the "tsdbfs comp" > might gave wrong results? because we ran that and got eg > > > # tsdbfs somefs comp 7:5137408 25:221785088 1024 > > Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = > 0x19:D382C00: > > All sectors identical > > > > we only support checksums when you use GNR based systems, they cover > > network as well as Disk side for that. > > the nsdchecksum code you refer to is the one i mentioned above thats only > > supported with GNR at least i am not aware that we ever claimed it to be > > supported outside of it, but i can check that. > ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, > and they are not in the same gpfs cluster. > > i thought the GNR extended the checksumming to disk, and that it was > already there for the network part. thanks for clearing this up. but > that is worse then i thought... > > stijn > > > > > sven > > > > On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt > > > wrote: > > > >> hi sven, > >> > >> the data is not corrupted. mmfsck compares 2 inodes, says they don't > >> match, but checking the data with tbdbfs reveals they are equal. > >> (one replica has to be fetched over the network; the nsds cannot access > >> all disks) > >> > >> with some nsdChksum... settings we get during this mmfsck a lot of > >> "Encountered XYZ checksum errors on network I/O to NSD Client disk" > >> > >> ibm support says these are hardware issues, but wrt to mmfsck false > >> positives. > >> > >> anyway, our current question is: if these are hardware issues, is there > >> anything in gpfs client->nsd (on the network side) that would detect > >> such errors. ie can we trust the data (and metadata). > >> i was under the impression that client to disk is not covered, but i > >> assumed that at least client to nsd (the network part) was checksummed. > >> > >> stijn > >> > >> > >> On 08/02/2017 09:10 PM, Sven Oehme wrote: > >>> ok, i think i understand now, the data was already corrupted. the > config > >>> change i proposed only prevents a potentially known future on the wire > >>> corruption, this will not fix something that made it to the disk > already. > >>> > >>> Sven > >>> > >>> > >>> > >>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be > >>> > >>> wrote: > >>> > >>>> yes ;) > >>>> > >>>> the system is in preproduction, so nothing that can't stopped/started > in > >>>> a few minutes (current setup has only 4 nsds, and no clients). > >>>> mmfsck triggers the errors very early during inode replica compare. > >>>> > >>>> > >>>> stijn > >>>> > >>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: > >>>>> How can you reproduce this so quick ? > >>>>> Did you restart all daemons after that ? > >>>>> > >>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < > stijn.deweirdt at ugent.be > >>> > >>>>> wrote: > >>>>> > >>>>>> hi sven, > >>>>>> > >>>>>> > >>>>>>> the very first thing you should check is if you have this setting > >> set : > >>>>>> maybe the very first thing to check should be the faq/wiki that has > >> this > >>>>>> documented? > >>>>>> > >>>>>>> > >>>>>>> mmlsconfig envVar > >>>>>>> > >>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF > 1 > >>>>>>> MLX5_USE_MUTEX 1 > >>>>>>> > >>>>>>> if that doesn't come back the way above you need to set it : > >>>>>>> > >>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 > >>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" > >>>>>> i just set this (wasn't set before), but problem is still present. > >>>>>> > >>>>>>> > >>>>>>> there was a problem in the Mellanox FW in various versions that was > >>>> never > >>>>>>> completely addressed (bugs where found and fixed, but it was never > >>>> fully > >>>>>>> proven to be addressed) the above environment variables turn code > on > >> in > >>>>>> the > >>>>>>> mellanox driver that prevents this potential code path from being > >> used > >>>> to > >>>>>>> begin with. > >>>>>>> > >>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in > >>>> Scale > >>>>>>> that even you don't set this variables the problem can't happen > >> anymore > >>>>>>> until then the only choice you have is the envVar above (which btw > >>>> ships > >>>>>> as > >>>>>>> default on all ESS systems). > >>>>>>> > >>>>>>> you also should be on the latest available Mellanox FW & Drivers as > >> not > >>>>>> all > >>>>>>> versions even have the code that is activated by the environment > >>>>>> variables > >>>>>>> above, i think at a minimum you need to be at 3.4 but i don't > >> remember > >>>>>> the > >>>>>>> exact version. There had been multiple defects opened around this > >> area, > >>>>>> the > >>>>>>> last one i remember was : > >>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards > from > >>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to > make > >>>>>> new one. mellanox used to allow to make your own, but they don't > >>>> anymore. > >>>>>> > >>>>>>> > >>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on > >>>>>> pthread_spin_lock > >>>>>>> > >>>>>>> you may ask your mellanox representative if they can get you access > >> to > >>>>>> this > >>>>>>> defect. while it was found on ESS , means on PPC64 and with > >> ConnectX-3 > >>>>>>> cards its a general issue that affects all cards and on intel as > well > >>>> as > >>>>>>> Power. > >>>>>> ok, thanks for this. maybe such a reference is enough for dell to > >> update > >>>>>> their firmware. > >>>>>> > >>>>>> stijn > >>>>>> > >>>>>>> > >>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < > >>>> stijn.deweirdt at ugent.be> > >>>>>>> wrote: > >>>>>>> > >>>>>>>> hi all, > >>>>>>>> > >>>>>>>> is there any documentation wrt data integrity in spectrum scale: > >>>>>>>> assuming a crappy network, does gpfs garantee somehow that data > >>>> written > >>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from > >> the > >>>>>>>> nsd gpfs daemon to disk. > >>>>>>>> > >>>>>>>> and wrt crappy network, what about rdma on crappy network? is it > the > >>>>>> same? > >>>>>>>> > >>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says > >> it's > >>>>>>>> network issue; and we see no errors anywhere...) > >>>>>>>> > >>>>>>>> thanks a lot, > >>>>>>>> > >>>>>>>> stijn > >>>>>>>> _______________________________________________ > >>>>>>>> gpfsug-discuss mailing list > >>>>>>>> gpfsug-discuss at spectrumscale.org > >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> gpfsug-discuss mailing list > >>>>>>> gpfsug-discuss at spectrumscale.org > >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>>> > >>>>>> _______________________________________________ > >>>>>> gpfsug-discuss mailing list > >>>>>> gpfsug-discuss at spectrumscale.org > >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>>> > >>>>> > >>>>> > >>>>> > >>>>> _______________________________________________ > >>>>> gpfsug-discuss mailing list > >>>>> gpfsug-discuss at spectrumscale.org > >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>>> > >>>> _______________________________________________ > >>>> gpfsug-discuss mailing list > >>>> gpfsug-discuss at spectrumscale.org > >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>>> > >>> > >>> > >>> > >>> _______________________________________________ > >>> gpfsug-discuss mailing list > >>> gpfsug-discuss at spectrumscale.org > >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>> > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From sxiao at us.ibm.com Wed Aug 2 22:36:06 2017 From: sxiao at us.ibm.com (Steve Xiao) Date: Wed, 2 Aug 2017 17:36:06 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: Message-ID: The nsdChksum settings for none GNR/ESS based system is not officially supported. It will perform checksum on data transfer over the network only and can be used to help debug data corruption when network is a suspect. Did any of those "Encountered XYZ checksum errors on network I/O to NSD Client disk" warning messages resulted in disk been changed to "down" state due to IO error? If no disk IO error was reported in GPFS log, that means data was retransmitted successfully on retry. As sven said, only GNR/ESS provids the full end to end data integrity. Steve Y. Xiao -------------- next part -------------- An HTML attachment was scrubbed... URL: From stijn.deweirdt at ugent.be Wed Aug 2 22:47:36 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:47:36 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <5a6a0823-62c6-e7ab-1005-239abc98a811@ugent.be> <15ec1be0-0256-26af-d5f5-ab9d8bfec9e7@ugent.be> <6e28e24a-719d-8ef3-7204-722dc31156e0@ugent.be> <4bac973e-7a0b-8500-4804-06879d3739fe@ugent.be> Message-ID: hi sven, > ok, you can't be any newer that that. i just wonder why you have 512b > inodes if this is a new system ? because we rsynced 100M files to it ;) it's supposed to replace another system. > are this raw disks in this setup or raid controllers ? raid (DDP on MD3460) > whats the disk sector size euhm, you mean the luns? for metadata disks (SSD in raid 1): > # parted /dev/mapper/f1v01e0g0_Dm01o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0g0_Dm01o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0g0_Dm01o0: 219GB > Sector size (logical/physical): 512B/512B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 219GB 219GB GPFS: hidden for data disks (DDP) > [root at nsd01 ~]# parted /dev/mapper/f1v01e0p0_S17o0 > GNU Parted 3.1 > Using /dev/mapper/f1v01e0p0_S17o0 > Welcome to GNU Parted! Type 'help' to view a list of commands. > (parted) p > Model: Linux device-mapper (multipath) (dm) > Disk /dev/mapper/f1v01e0p0_S17o0: 35.2TB > Sector size (logical/physical): 512B/4096B > Partition Table: gpt > Disk Flags: > > Number Start End Size File system Name Flags > 1 24.6kB 35.2TB 35.2TB GPFS: hidden > > (parted) q and how was the filesystem created (mmlsfs FSNAME would show > answer to the last question) > # mmlsfs somefilesystem > flag value description > ------------------- ------------------------ ----------------------------------- > -f 16384 Minimum fragment size in bytes (system pool) > 262144 Minimum fragment size in bytes (other pools) > -i 4096 Inode size in bytes > -I 32768 Indirect block size in bytes > -m 2 Default number of metadata replicas > -M 2 Maximum number of metadata replicas > -r 1 Default number of data replicas > -R 2 Maximum number of data replicas > -j scatter Block allocation type > -D nfs4 File locking semantics in effect > -k all ACL semantics in effect > -n 850 Estimated number of nodes that will mount file system > -B 524288 Block size (system pool) > 8388608 Block size (other pools) > -Q user;group;fileset Quotas accounting enabled > user;group;fileset Quotas enforced > none Default quotas enabled > --perfileset-quota Yes Per-fileset quota enforcement > --filesetdf Yes Fileset df enabled? > -V 17.00 (4.2.3.0) File system version > --create-time Wed May 31 12:54:00 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E No Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation option > --fastea Yes Fast external attributes enabled? > --encryption No Encryption enabled? > --inode-limit 313524224 Maximum number of inodes in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > --subblocks-per-full-block 32 Number of subblocks per full block > -P system;MD3260 Disk storage pools in file system > -d f0v00e0g0_Sm00o0;f0v00e0p0_S00o0;f1v01e0g0_Sm01o0;f1v01e0p0_S01o0;f0v02e0g0_Sm02o0;f0v02e0p0_S02o0;f1v03e0g0_Sm03o0;f1v03e0p0_S03o0;f0v04e0g0_Sm04o0;f0v04e0p0_S04o0; > -d f1v05e0g0_Sm05o0;f1v05e0p0_S05o0;f0v06e0g0_Sm06o0;f0v06e0p0_S06o0;f1v07e0g0_Sm07o0;f1v07e0p0_S07o0;f0v00e0g0_Sm08o1;f0v00e0p0_S08o1;f1v01e0g0_Sm09o1;f1v01e0p0_S09o1; > -d f0v02e0g0_Sm10o1;f0v02e0p0_S10o1;f1v03e0g0_Sm11o1;f1v03e0p0_S11o1;f0v04e0g0_Sm12o1;f0v04e0p0_S12o1;f1v05e0g0_Sm13o1;f1v05e0p0_S13o1;f0v06e0g0_Sm14o1;f0v06e0p0_S14o1; > -d f1v07e0g0_Sm15o1;f1v07e0p0_S15o1;f0v00e0p0_S16o0;f1v01e0p0_S17o0;f0v02e0p0_S18o0;f1v03e0p0_S19o0;f0v04e0p0_S20o0;f1v05e0p0_S21o0;f0v06e0p0_S22o0;f1v07e0p0_S23o0; > -d f0v00e0p0_S24o1;f1v01e0p0_S25o1;f0v02e0p0_S26o1;f1v03e0p0_S27o1;f0v04e0p0_S28o1;f1v05e0p0_S29o1;f0v06e0p0_S30o1;f1v07e0p0_S31o1 Disks in file system > -A no Automatic mount option > -o none Additional mount options > -T /scratch Default mount point > --mount-priority 0 > > on the tsdbfs i am not sure if it gave wrong results, but it would be worth > a test to see whats actually on the disk . ok. i'll try this tomorrow. > > you are correct that GNR extends this to the disk, but the network part is > covered by the nsdchecksums you turned on > when you enable the not to be named checksum parameter do you actually > still get an error from fsck ? hah, no, we don't. mmfsck says the filesystem is clean. we found this odd, so we already asked ibm support about this but no answer yet. stijn > > sven > > > On Wed, Aug 2, 2017 at 2:14 PM Stijn De Weirdt > wrote: > >> hi sven, >> >>> before i answer the rest of your questions, can you share what version of >>> GPFS exactly you are on mmfsadm dump version would be best source for >> that. >> it returns >> Build branch "4.2.3.3 ". >> >>> if you have 2 inodes and you know the exact address of where they are >>> stored on disk one could 'dd' them of the disk and compare if they are >>> really equal. >> ok, i can try that later. are you suggesting that the "tsdbfs comp" >> might gave wrong results? because we ran that and got eg >> >>> # tsdbfs somefs comp 7:5137408 25:221785088 1024 >>> Comparing 1024 sectors at 7:5137408 = 0x7:4E6400 and 25:221785088 = >> 0x19:D382C00: >>> All sectors identical >> >> >>> we only support checksums when you use GNR based systems, they cover >>> network as well as Disk side for that. >>> the nsdchecksum code you refer to is the one i mentioned above thats only >>> supported with GNR at least i am not aware that we ever claimed it to be >>> supported outside of it, but i can check that. >> ok, maybe i'm a bit consfused. we have a GNR too, but it's not this one, >> and they are not in the same gpfs cluster. >> >> i thought the GNR extended the checksumming to disk, and that it was >> already there for the network part. thanks for clearing this up. but >> that is worse then i thought... >> >> stijn >> >>> >>> sven >>> >>> On Wed, Aug 2, 2017 at 12:20 PM Stijn De Weirdt >> >>> wrote: >>> >>>> hi sven, >>>> >>>> the data is not corrupted. mmfsck compares 2 inodes, says they don't >>>> match, but checking the data with tbdbfs reveals they are equal. >>>> (one replica has to be fetched over the network; the nsds cannot access >>>> all disks) >>>> >>>> with some nsdChksum... settings we get during this mmfsck a lot of >>>> "Encountered XYZ checksum errors on network I/O to NSD Client disk" >>>> >>>> ibm support says these are hardware issues, but wrt to mmfsck false >>>> positives. >>>> >>>> anyway, our current question is: if these are hardware issues, is there >>>> anything in gpfs client->nsd (on the network side) that would detect >>>> such errors. ie can we trust the data (and metadata). >>>> i was under the impression that client to disk is not covered, but i >>>> assumed that at least client to nsd (the network part) was checksummed. >>>> >>>> stijn >>>> >>>> >>>> On 08/02/2017 09:10 PM, Sven Oehme wrote: >>>>> ok, i think i understand now, the data was already corrupted. the >> config >>>>> change i proposed only prevents a potentially known future on the wire >>>>> corruption, this will not fix something that made it to the disk >> already. >>>>> >>>>> Sven >>>>> >>>>> >>>>> >>>>> On Wed, Aug 2, 2017 at 11:53 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be >>>>> >>>>> wrote: >>>>> >>>>>> yes ;) >>>>>> >>>>>> the system is in preproduction, so nothing that can't stopped/started >> in >>>>>> a few minutes (current setup has only 4 nsds, and no clients). >>>>>> mmfsck triggers the errors very early during inode replica compare. >>>>>> >>>>>> >>>>>> stijn >>>>>> >>>>>> On 08/02/2017 08:47 PM, Sven Oehme wrote: >>>>>>> How can you reproduce this so quick ? >>>>>>> Did you restart all daemons after that ? >>>>>>> >>>>>>> On Wed, Aug 2, 2017, 11:43 AM Stijn De Weirdt < >> stijn.deweirdt at ugent.be >>>>> >>>>>>> wrote: >>>>>>> >>>>>>>> hi sven, >>>>>>>> >>>>>>>> >>>>>>>>> the very first thing you should check is if you have this setting >>>> set : >>>>>>>> maybe the very first thing to check should be the faq/wiki that has >>>> this >>>>>>>> documented? >>>>>>>> >>>>>>>>> >>>>>>>>> mmlsconfig envVar >>>>>>>>> >>>>>>>>> envVar MLX4_POST_SEND_PREFER_BF 0 MLX4_USE_MUTEX 1 MLX5_SHUT_UP_BF >> 1 >>>>>>>>> MLX5_USE_MUTEX 1 >>>>>>>>> >>>>>>>>> if that doesn't come back the way above you need to set it : >>>>>>>>> >>>>>>>>> mmchconfig envVar="MLX4_POST_SEND_PREFER_BF=0 MLX5_SHUT_UP_BF=1 >>>>>>>>> MLX5_USE_MUTEX=1 MLX4_USE_MUTEX=1" >>>>>>>> i just set this (wasn't set before), but problem is still present. >>>>>>>> >>>>>>>>> >>>>>>>>> there was a problem in the Mellanox FW in various versions that was >>>>>> never >>>>>>>>> completely addressed (bugs where found and fixed, but it was never >>>>>> fully >>>>>>>>> proven to be addressed) the above environment variables turn code >> on >>>> in >>>>>>>> the >>>>>>>>> mellanox driver that prevents this potential code path from being >>>> used >>>>>> to >>>>>>>>> begin with. >>>>>>>>> >>>>>>>>> in Spectrum Scale 4.2.4 (not yet released) we added a workaround in >>>>>> Scale >>>>>>>>> that even you don't set this variables the problem can't happen >>>> anymore >>>>>>>>> until then the only choice you have is the envVar above (which btw >>>>>> ships >>>>>>>> as >>>>>>>>> default on all ESS systems). >>>>>>>>> >>>>>>>>> you also should be on the latest available Mellanox FW & Drivers as >>>> not >>>>>>>> all >>>>>>>>> versions even have the code that is activated by the environment >>>>>>>> variables >>>>>>>>> above, i think at a minimum you need to be at 3.4 but i don't >>>> remember >>>>>>>> the >>>>>>>>> exact version. There had been multiple defects opened around this >>>> area, >>>>>>>> the >>>>>>>>> last one i remember was : >>>>>>>> we run mlnx ofed 4.1, fw is not the latest, but we have edr cards >> from >>>>>>>> dell, and the fw is a bit behind. i'm trying to convince dell to >> make >>>>>>>> new one. mellanox used to allow to make your own, but they don't >>>>>> anymore. >>>>>>>> >>>>>>>>> >>>>>>>>> 00154843 : ESS ConnectX-3 performance issue - spinning on >>>>>>>> pthread_spin_lock >>>>>>>>> >>>>>>>>> you may ask your mellanox representative if they can get you access >>>> to >>>>>>>> this >>>>>>>>> defect. while it was found on ESS , means on PPC64 and with >>>> ConnectX-3 >>>>>>>>> cards its a general issue that affects all cards and on intel as >> well >>>>>> as >>>>>>>>> Power. >>>>>>>> ok, thanks for this. maybe such a reference is enough for dell to >>>> update >>>>>>>> their firmware. >>>>>>>> >>>>>>>> stijn >>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Aug 2, 2017 at 8:58 AM Stijn De Weirdt < >>>>>> stijn.deweirdt at ugent.be> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> hi all, >>>>>>>>>> >>>>>>>>>> is there any documentation wrt data integrity in spectrum scale: >>>>>>>>>> assuming a crappy network, does gpfs garantee somehow that data >>>>>> written >>>>>>>>>> by client ends up safe in the nsd gpfs daemon; and similarly from >>>> the >>>>>>>>>> nsd gpfs daemon to disk. >>>>>>>>>> >>>>>>>>>> and wrt crappy network, what about rdma on crappy network? is it >> the >>>>>>>> same? >>>>>>>>>> >>>>>>>>>> (we are hunting down a crappy infiniband issue; ibm support says >>>> it's >>>>>>>>>> network issue; and we see no errors anywhere...) >>>>>>>>>> >>>>>>>>>> thanks a lot, >>>>>>>>>> >>>>>>>>>> stijn >>>>>>>>>> _______________________________________________ >>>>>>>>>> gpfsug-discuss mailing list >>>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> gpfsug-discuss mailing list >>>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> gpfsug-discuss mailing list >>>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>>> gpfsug-discuss mailing list >>>>>>> gpfsug-discuss at spectrumscale.org >>>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>>> >>>>>> _______________________________________________ >>>>>> gpfsug-discuss mailing list >>>>>> gpfsug-discuss at spectrumscale.org >>>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>>> >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>> _______________________________________________ >>>> gpfsug-discuss mailing list >>>> gpfsug-discuss at spectrumscale.org >>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From stijn.deweirdt at ugent.be Wed Aug 2 22:53:50 2017 From: stijn.deweirdt at ugent.be (Stijn De Weirdt) Date: Wed, 2 Aug 2017 23:53:50 +0200 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: Message-ID: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> hi steve, > The nsdChksum settings for none GNR/ESS based system is not officially > supported. It will perform checksum on data transfer over the network > only and can be used to help debug data corruption when network is a > suspect. i'll take not officially supported over silent bitrot any day. > > Did any of those "Encountered XYZ checksum errors on network I/O to NSD > Client disk" warning messages resulted in disk been changed to "down" > state due to IO error? no. If no disk IO error was reported in GPFS log, > that means data was retransmitted successfully on retry. we suspected as much. as sven already asked, mmfsck now reports clean filesystem. i have an ibdump of 2 involved nsds during the reported checksums, i'll have a closer look if i can spot these retries. > > As sven said, only GNR/ESS provids the full end to end data integrity. so with the silent network error, we have high probabilty that the data is corrupted. we are now looking for a test to find out what adapters are affected. we hoped that nsdperf with verify=on would tell us, but it doesn't. > > Steve Y. Xiao > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > From aaron.s.knister at nasa.gov Thu Aug 3 01:48:07 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 20:48:07 -0400 Subject: [gpfsug-discuss] documentation about version compatibility Message-ID: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> Hey All, I swear that some time recently someone posted a link to some IBM documentation that outlined the recommended versions of GPFS to upgrade to/from (e.g. if you're at 3.5 get to 4.1 before going to 4.2.3). I can't for the life of me find it. Does anyone know what I'm talking about? Thanks, Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Aug 3 02:00:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 21:00:00 -0400 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: I'm a little late to the party here but I thought I'd share our recent experiences. We recently completed a mass UID number migration (half a billion inodes) and developed two tools ("luke filewalker" and the "mmilleniumfacl") to get the job done. Both luke filewalker and the mmilleniumfacl are based heavily on the code in /usr/lpp/mmfs/samples/util/tsreaddir.c and /usr/lpp/mmfs/samples/util/tsinode.c. luke filewalker targets traditional POSIX permissions whereas mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem in parallel and both but particularly the 2nd, are extremely I/O intensive on your metadata disks. The gist of luke filewalker is to scan the inode structures using the gpfs APIs and populate a mapping of inode number to gid and uid number. It then walks the filesystem in parallel using the APIs, looks up the inode number in an in-memory hash, and if appropriate changes ownership using the chown() API. The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs using the GPFS inode API so it walks the filesystem and reads the ACL of any and every file, updating the ACL entries as appropriate. I'm going to see if I can share the source code for both tools, although I don't know if I can post it here since it modified existing IBM source code. Could someone from IBM chime in here? If I were to send the code to IBM could they publish it perhaps on the wiki? -Aaron On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: > Hello, > > We're trying to change most of our users uids, is there a clean way to > migrate all of one users files with say `mmapplypolicy`? We have to change the > owner of around 273539588 files, and my estimates for runtime are around 6 days. > > What we've been doing is indexing all of the files and splitting them up by > owner which takes around an hour, and then we were locking the user out while we > chown their files. I made it multi threaded as it weirdly gave a 10% speedup > despite my expectation that multi threading access from a single node would not > give any speedup. > > Generally I'm looking for advice on how to make the chowning faster. Would > spreading the chowning processes over multiple nodes improve performance? Should > I not stat the files before running lchown on them, since lchown checks the file > before changing it? I saw mention of inodescan(), in an old gpfsug email, which > speeds up disk read access, by not guaranteeing that the data is up to date. We > have a maintenance day coming up where all users will be locked out, so the file > handles(?) from GPFS's perspective will not be able to go stale. Is there a > function with similar constraints to inodescan that I can use to speed up this > process? > > Thank you for your time, > > Luke > Storrs-HPC > University of Connecticut > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Thu Aug 3 02:03:23 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 2 Aug 2017 21:03:23 -0400 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: Oh, the one *huge* gotcha I thought I'd share-- we wrote a perl script to drive the migration and part of the perl script's process was to clone quotas from old uid numbers to the new number. I upset our GPFS cluster during a particular migration in which the user was over the grace period of the quota so after a certain point every chown() put the destination UID even further over its quota. The problem with this being that at this point every chown() operation would cause GPFS to do some cluster-wide quota accounting-related RPCs. That hurt. It's worth making sure there are no quotas defined for the destination UID numbers and if they are that the data coming from the source UID number will fit. -Aaron On 8/2/17 9:00 PM, Aaron Knister wrote: > I'm a little late to the party here but I thought I'd share our recent > experiences. > > We recently completed a mass UID number migration (half a billion > inodes) and developed two tools ("luke filewalker" and the > "mmilleniumfacl") to get the job done. Both luke filewalker and the > mmilleniumfacl are based heavily on the code in > /usr/lpp/mmfs/samples/util/tsreaddir.c and > /usr/lpp/mmfs/samples/util/tsinode.c. > > luke filewalker targets traditional POSIX permissions whereas > mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem in > parallel and both but particularly the 2nd, are extremely I/O intensive > on your metadata disks. > > The gist of luke filewalker is to scan the inode structures using the > gpfs APIs and populate a mapping of inode number to gid and uid number. > It then walks the filesystem in parallel using the APIs, looks up the > inode number in an in-memory hash, and if appropriate changes ownership > using the chown() API. > > The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs > using the GPFS inode API so it walks the filesystem and reads the ACL of > any and every file, updating the ACL entries as appropriate. > > I'm going to see if I can share the source code for both tools, although > I don't know if I can post it here since it modified existing IBM source > code. Could someone from IBM chime in here? If I were to send the code > to IBM could they publish it perhaps on the wiki? > > -Aaron > > On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: >> Hello, >> >> We're trying to change most of our users uids, is there a clean >> way to >> migrate all of one users files with say `mmapplypolicy`? We have to >> change the >> owner of around 273539588 files, and my estimates for runtime are >> around 6 days. >> >> What we've been doing is indexing all of the files and splitting >> them up by >> owner which takes around an hour, and then we were locking the user >> out while we >> chown their files. I made it multi threaded as it weirdly gave a 10% >> speedup >> despite my expectation that multi threading access from a single node >> would not >> give any speedup. >> >> Generally I'm looking for advice on how to make the chowning >> faster. Would >> spreading the chowning processes over multiple nodes improve >> performance? Should >> I not stat the files before running lchown on them, since lchown >> checks the file >> before changing it? I saw mention of inodescan(), in an old gpfsug >> email, which >> speeds up disk read access, by not guaranteeing that the data is up to >> date. We >> have a maintenance day coming up where all users will be locked out, >> so the file >> handles(?) from GPFS's perspective will not be able to go stale. Is >> there a >> function with similar constraints to inodescan that I can use to speed >> up this >> process? >> >> Thank you for your time, >> >> Luke >> Storrs-HPC >> University of Connecticut >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From scale at us.ibm.com Thu Aug 3 06:18:46 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Thu, 3 Aug 2017 13:18:46 +0800 Subject: [gpfsug-discuss] Gpfs Memory Usaage Keeps going up and we don't know why. In-Reply-To: <1500906086.571.9.camel@qmul.ac.uk> References: <261384244.3866909.1500901872347.ref@mail.yahoo.com><261384244.3866909.1500901872347@mail.yahoo.com><1500903047.571.7.camel@qmul.ac.uk><1770436429.3911327.1500905445052@mail.yahoo.com> <1500906086.571.9.camel@qmul.ac.uk> Message-ID: Can you provide the output of "pmap 4444"? If there's no "pmap" command on your system, then get the memory maps of mmfsd from file of /proc/4444/maps. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Peter Childs To: "gpfsug-discuss at spectrumscale.org" Date: 07/24/2017 10:22 PM Subject: Re: [gpfsug-discuss] Gpfs Memory Usaage Keeps going up and we don't know why. Sent by: gpfsug-discuss-bounces at spectrumscale.org top but ps gives the same value. [root at dn29 ~]# ps auww -q 4444 USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 4444 2.7 22.3 10537600 5472580 ? S wrote: I've had a look at mmfsadm dump malloc and it looks to agree with the output from mmdiag --memory. and does not seam to account for the excessive memory usage. The new machines do have idleSocketTimout set to 0 from what your saying it could be related to keeping that many connections between nodes working. Thanks in advance Peter. [root at dn29 ~]# mmdiag --memory === mmdiag: memory === mmfsd heap size: 2039808 bytes Statistics for MemoryPool id 1 ("Shared Segment (EPHEMERAL)") 128 bytes in use 17500049370 hard limit on memory usage 1048576 bytes committed to regions 1 number of regions 555 allocations 555 frees 0 allocation failures Statistics for MemoryPool id 2 ("Shared Segment") 42179592 bytes in use 17500049370 hard limit on memory usage 56623104 bytes committed to regions 9 number of regions 100027 allocations 79624 frees 0 allocation failures Statistics for MemoryPool id 3 ("Token Manager") 2099520 bytes in use 17500049370 hard limit on memory usage 16778240 bytes committed to regions 1 number of regions 4 allocations 0 frees 0 allocation failures On Mon, 2017-07-24 at 13:11 +0000, Jim Doherty wrote: There are 3 places that the GPFS mmfsd uses memory the pagepool plus 2 shared memory segments. To see the memory utilization of the shared memory segments run the command mmfsadm dump malloc . The statistics for memory pool id 2 is where maxFilesToCache/maxStatCache objects are and the manager nodes use memory pool id 3 to track the MFTC/MSC objects. You might want to upgrade to later PTF as there was a PTF to fix a memory leak that occurred in tscomm associated with network connection drops. On Monday, July 24, 2017 5:29 AM, Peter Childs wrote: We have two GPFS clusters. One is fairly old and running 4.2.1-2 and non CCR and the nodes run fine using up about 1.5G of memory and is consistent (GPFS pagepool is set to 1G, so that looks about right.) The other one is "newer" running 4.2.1-3 with CCR and the nodes keep increasing in there memory usage, starting at about 1.1G and are find for a few days however after a while they grow to 4.2G which when the node need to run real work, means the work can't be done. I'm losing track of what maybe different other than CCR, and I'm trying to find some more ideas of where to look. I'm checked all the standard things like pagepool and maxFilesToCache (set to the default of 4000), workerThreads is set to 128 on the new gpfs cluster (against default 48 on the old) I'm not sure what else to look at on this one hence why I'm asking the community. Thanks in advance Peter Childs ITS Research Storage Queen Mary University of London. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Peter Childs ITS Research Storage Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Peter Childs ITS Research Storage Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Thu Aug 3 07:42:37 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Thu, 3 Aug 2017 07:42:37 +0100 Subject: [gpfsug-discuss] documentation about version compatibility In-Reply-To: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> References: <6c34bd11-ee8e-f483-7030-dd2e836b42b9@nasa.gov> Message-ID: <0a283eb9-a458-bd2c-4e7b-1f46bb22e385@pixitmedia.com> Hi This is the Installation Guide of each target version under the section 'Migrating from to '. Jez On 03/08/17 01:48, Aaron Knister wrote: > Hey All, > > I swear that some time recently someone posted a link to some IBM > documentation that outlined the recommended versions of GPFS to > upgrade to/from (e.g. if you're at 3.5 get to 4.1 before going to > 4.2.3). I can't for the life of me find it. Does anyone know what I'm > talking about? > > Thanks, > Aaron > -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Thu Aug 3 07:46:36 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Thu, 3 Aug 2017 07:46:36 +0100 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: Perhaps IBM might consider letting you commit it to https://github.com/gpfsug/gpfsug-tools he says, asking out loud... It'll require a friendly IBMer to take the reins up for you. Scott? :-) Jez On 03/08/17 02:00, Aaron Knister wrote: > I'm a little late to the party here but I thought I'd share our recent > experiences. > > We recently completed a mass UID number migration (half a billion > inodes) and developed two tools ("luke filewalker" and the > "mmilleniumfacl") to get the job done. Both luke filewalker and the > mmilleniumfacl are based heavily on the code in > /usr/lpp/mmfs/samples/util/tsreaddir.c and > /usr/lpp/mmfs/samples/util/tsinode.c. > > luke filewalker targets traditional POSIX permissions whereas > mmilleniumfacl targets posix ACLs. Both tools traverse the filesystem > in parallel and both but particularly the 2nd, are extremely I/O > intensive on your metadata disks. > > The gist of luke filewalker is to scan the inode structures using the > gpfs APIs and populate a mapping of inode number to gid and uid > number. It then walks the filesystem in parallel using the APIs, looks > up the inode number in an in-memory hash, and if appropriate changes > ownership using the chown() API. > > The mmilleniumfacl doesn't have the luxury of scanning for POSIX ACLs > using the GPFS inode API so it walks the filesystem and reads the ACL > of any and every file, updating the ACL entries as appropriate. > > I'm going to see if I can share the source code for both tools, > although I don't know if I can post it here since it modified existing > IBM source code. Could someone from IBM chime in here? If I were to > send the code to IBM could they publish it perhaps on the wiki? > > -Aaron > > On 6/30/17 11:20 AM, hpc-luke at uconn.edu wrote: >> Hello, >> >> We're trying to change most of our users uids, is there a clean >> way to >> migrate all of one users files with say `mmapplypolicy`? We have to >> change the >> owner of around 273539588 files, and my estimates for runtime are >> around 6 days. >> >> What we've been doing is indexing all of the files and splitting >> them up by >> owner which takes around an hour, and then we were locking the user >> out while we >> chown their files. I made it multi threaded as it weirdly gave a 10% >> speedup >> despite my expectation that multi threading access from a single node >> would not >> give any speedup. >> >> Generally I'm looking for advice on how to make the chowning >> faster. Would >> spreading the chowning processes over multiple nodes improve >> performance? Should >> I not stat the files before running lchown on them, since lchown >> checks the file >> before changing it? I saw mention of inodescan(), in an old gpfsug >> email, which >> speeds up disk read access, by not guaranteeing that the data is up >> to date. We >> have a maintenance day coming up where all users will be locked out, >> so the file >> handles(?) from GPFS's perspective will not be able to go stale. Is >> there a >> function with similar constraints to inodescan that I can use to >> speed up this >> process? >> >> Thank you for your time, >> >> Luke >> Storrs-HPC >> University of Connecticut >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- *Jez Tucker* Head of Research and Development, Pixit Media 07764193820 | jtucker at pixitmedia.com www.pixitmedia.com | Tw:@pixitmedia.com -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan.buzzard at strath.ac.uk Thu Aug 3 09:49:26 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Thu, 03 Aug 2017 09:49:26 +0100 Subject: [gpfsug-discuss] Mass UID migration suggestions In-Reply-To: References: <59566c3b.kqOSy9e2kg80XZ/k%hpc-luke@uconn.edu> Message-ID: <1501750166.17548.43.camel@strath.ac.uk> On Wed, 2017-08-02 at 21:03 -0400, Aaron Knister wrote: > Oh, the one *huge* gotcha I thought I'd share-- we wrote a perl script > to drive the migration and part of the perl script's process was to > clone quotas from old uid numbers to the new number. I upset our GPFS > cluster during a particular migration in which the user was over the > grace period of the quota so after a certain point every chown() put the > destination UID even further over its quota. The problem with this being > that at this point every chown() operation would cause GPFS to do some > cluster-wide quota accounting-related RPCs. That hurt. It's worth making > sure there are no quotas defined for the destination UID numbers and if > they are that the data coming from the source UID number will fit. For similar reasons if you are doing a restore of a file system (any file system for that matter not just GPFS) for whatever reason, don't turn quotas back on till *after* the restore is complete. Well unless you can be sure a user is not going to go over quota during the restore. However as this is generally not possible to determine you end up with no quota's either set/enforced till the restore is complete. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From oehmes at gmail.com Thu Aug 3 14:06:49 2017 From: oehmes at gmail.com (Sven Oehme) Date: Thu, 03 Aug 2017 13:06:49 +0000 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: a trace during a mmfsck with the checksum parameters turned on would reveal it. the support team should be able to give you specific triggers to cut a trace during checksum errors , this way the trace is cut when the issue happens and then from the trace on server and client side one can extract which card was used on each side. sven On Wed, Aug 2, 2017 at 2:53 PM Stijn De Weirdt wrote: > hi steve, > > > The nsdChksum settings for none GNR/ESS based system is not officially > > supported. It will perform checksum on data transfer over the network > > only and can be used to help debug data corruption when network is a > > suspect. > i'll take not officially supported over silent bitrot any day. > > > > > Did any of those "Encountered XYZ checksum errors on network I/O to NSD > > Client disk" warning messages resulted in disk been changed to "down" > > state due to IO error? > no. > > If no disk IO error was reported in GPFS log, > > that means data was retransmitted successfully on retry. > we suspected as much. as sven already asked, mmfsck now reports clean > filesystem. > i have an ibdump of 2 involved nsds during the reported checksums, i'll > have a closer look if i can spot these retries. > > > > > As sven said, only GNR/ESS provids the full end to end data integrity. > so with the silent network error, we have high probabilty that the data > is corrupted. > > we are now looking for a test to find out what adapters are affected. we > hoped that nsdperf with verify=on would tell us, but it doesn't. > > > > > Steve Y. Xiao > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamiedavis at us.ibm.com Thu Aug 3 14:11:23 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Thu, 3 Aug 2017 13:11:23 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Fri Aug 4 06:02:22 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 4 Aug 2017 01:02:22 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: 4.2.2.3 I want to think maybe this started after expanding inode space On Thu, Aug 3, 2017 at 9:11 AM, James Davis wrote: > Hey, > > Hmm, your invocation looks valid to me. What's your GPFS level? > > Cheers, > > Jamie > > > ----- Original message ----- > From: "J. Eric Wonderley" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] mmsetquota produces error > Date: Wed, Aug 2, 2017 5:03 PM > > for one of our home filesystem we get: > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > 'Invalid argument'. > > > mmedquota -j home:nathanfootest > does work however > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Aug 4 09:00:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 4 Aug 2017 04:00:35 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 Message-ID: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Hey All, Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some rather disconcerting behavior. Specifically on some of the upgraded nodes GPFS will seemingly deadlock on the entire node rendering it unusable. I can't even get a session on the node (but I can trigger a crash dump via a sysrq trigger). Most blocked tasks are blocked are in cxiWaitEventWait at the top of their call trace. That's probably not very helpful in of itself but I'm curious if anyone else out there has run into this issue or if this is a known bug. (I'll open a PMR later today once I've gathered more diagnostic information). -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From eric.wonderley at vt.edu Fri Aug 4 13:58:12 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 4 Aug 2017 08:58:12 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: i actually hit this assert and turned it in to support on this version: Build branch "4.2.2.3 efix6 (987197)". i was told do to exactly what sven mentioned. i thought it strange that i did NOT hit the assert in a no pass but hit it in a yes pass. On Thu, Aug 3, 2017 at 9:06 AM, Sven Oehme wrote: > a trace during a mmfsck with the checksum parameters turned on would > reveal it. > the support team should be able to give you specific triggers to cut a > trace during checksum errors , this way the trace is cut when the issue > happens and then from the trace on server and client side one can extract > which card was used on each side. > > sven > > On Wed, Aug 2, 2017 at 2:53 PM Stijn De Weirdt > wrote: > >> hi steve, >> >> > The nsdChksum settings for none GNR/ESS based system is not officially >> > supported. It will perform checksum on data transfer over the network >> > only and can be used to help debug data corruption when network is a >> > suspect. >> i'll take not officially supported over silent bitrot any day. >> >> > >> > Did any of those "Encountered XYZ checksum errors on network I/O to NSD >> > Client disk" warning messages resulted in disk been changed to "down" >> > state due to IO error? >> no. >> >> If no disk IO error was reported in GPFS log, >> > that means data was retransmitted successfully on retry. >> we suspected as much. as sven already asked, mmfsck now reports clean >> filesystem. >> i have an ibdump of 2 involved nsds during the reported checksums, i'll >> have a closer look if i can spot these retries. >> >> > >> > As sven said, only GNR/ESS provids the full end to end data integrity. >> so with the silent network error, we have high probabilty that the data >> is corrupted. >> >> we are now looking for a test to find out what adapters are affected. we >> hoped that nsdperf with verify=on would tell us, but it doesn't. >> >> > >> > Steve Y. Xiao >> > >> > >> > >> > >> > >> > >> > _______________________________________________ >> > gpfsug-discuss mailing list >> > gpfsug-discuss at spectrumscale.org >> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Aug 4 15:45:49 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 4 Aug 2017 16:45:49 +0200 Subject: [gpfsug-discuss] restrict user quota on specific filesets Message-ID: Hi, Is it possible to let users only write data in filesets where some quota is explicitly set ? We have independent filesets with quota defined for users that should have access in a specific fileset. The problem is when users using another fileset give eg global write access on their directories, the former users can write without limits, because it is by default 0 == no limits. Setting the quota on the file system will only restrict users quota in the root fileset, and setting quota for each user - fileset combination would be a huge mess. Setting default quotas does not work for existing users. Thank you !! Kenneth From aaron.s.knister at nasa.gov Fri Aug 4 16:02:04 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 4 Aug 2017 11:02:04 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 In-Reply-To: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> References: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Message-ID: I've narrowed the problem down to 4.1.1.16. We'll most likely be downgrading to 4.1.1.15. -Aaron On 8/4/17 4:00 AM, Aaron Knister wrote: > Hey All, > > Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? > > We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some rather > disconcerting behavior. Specifically on some of the upgraded nodes GPFS > will seemingly deadlock on the entire node rendering it unusable. I > can't even get a session on the node (but I can trigger a crash dump via > a sysrq trigger). > > Most blocked tasks are blocked are in cxiWaitEventWait at the top of > their call trace. That's probably not very helpful in of itself but I'm > curious if anyone else out there has run into this issue or if this is a > known bug. > > (I'll open a PMR later today once I've gathered more diagnostic > information). > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From jonathan.buzzard at strath.ac.uk Fri Aug 4 16:15:44 2017 From: jonathan.buzzard at strath.ac.uk (Jonathan Buzzard) Date: Fri, 04 Aug 2017 16:15:44 +0100 Subject: [gpfsug-discuss] restrict user quota on specific filesets In-Reply-To: References: Message-ID: <1501859744.17548.69.camel@strath.ac.uk> On Fri, 2017-08-04 at 16:45 +0200, Kenneth Waegeman wrote: > Hi, > > Is it possible to let users only write data in filesets where some quota > is explicitly set ? > > We have independent filesets with quota defined for users that should > have access in a specific fileset. The problem is when users using > another fileset give eg global write access on their directories, the > former users can write without limits, because it is by default 0 == no > limits. Setting appropriate ACL's on the junction point of the fileset so that they can only write to file sets that they have permissions to is how you achieve this. I would say create groups and do it that way, but *nasty* things happen when you are a member of more than 16 supplemental groups and are using NFSv3 (NFSv4 and up is fine). So as long as that is not an issue go nuts with groups as it is much easier to manage. > Setting the quota on the file system will only restrict users quota in > the root fileset, and setting quota for each user - fileset combination > would be a huge mess. Setting default quotas does not work for existing > users. Not sure abusing the quota system for permissions a sensible approach. Put another way it was not designed with that purpose in mind so don't be surprised when you can't use it to do that. JAB. -- Jonathan A. Buzzard Tel: +44141-5483420 HPC System Administrator, ARCHIE-WeSt. University of Strathclyde, John Anderson Building, Glasgow. G4 0NG From ilan84 at gmail.com Sun Aug 6 09:26:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 11:26:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood Message-ID: Hi guys, I see IBM spectrumscale configure the NFS via command: mmnfs Is the command mmnfs is a wrapper on top of the normal kernel NFS (Kernel VFS) ? Is it a wrapper on top of ganesha NFS ? Or it is NFS implemented by SpectrumScale team ? Thanks -- - Ilan Schwarts From S.J.Thompson at bham.ac.uk Sun Aug 6 10:10:45 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Sun, 6 Aug 2017 09:10:45 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] Sent: 06 August 2017 09:26 To: gpfsug main discussion list Subject: [gpfsug-discuss] what is mmnfs under the hood Hi guys, I see IBM spectrumscale configure the NFS via command: mmnfs Is the command mmnfs is a wrapper on top of the normal kernel NFS (Kernel VFS) ? Is it a wrapper on top of ganesha NFS ? Or it is NFS implemented by SpectrumScale team ? Thanks -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Sun Aug 6 10:42:30 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 12:42:30 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, I cannot use ganesha NFS. How do I make NFS exports ? just editing all nodes /etc/exports is enough ? I should i use the CNFS as described here: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) wrote: > Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... > > Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. > > Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > Sent: 06 August 2017 09:26 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] what is mmnfs under the hood > > Hi guys, > > I see IBM spectrumscale configure the NFS via command: mmnfs > > Is the command mmnfs is a wrapper on top of the normal kernel NFS > (Kernel VFS) ? > Is it a wrapper on top of ganesha NFS ? > Or it is NFS implemented by SpectrumScale team ? > > > Thanks > > -- > > > - > Ilan Schwarts > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- - Ilan Schwarts From ilan84 at gmail.com Sun Aug 6 10:49:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 12:49:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts From S.J.Thompson at bham.ac.uk Sun Aug 6 11:54:17 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Sun, 6 Aug 2017 10:54:17 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: , Message-ID: What do you mean by cannot use mmsmb and cannot use Ganesha? Do you functionally you are not allowed to or they are not working for you? If it's the latter, then this should be resolvable. If you are under active maintenance you could try raising a ticket with IBM, though basic implementation is not really a support issue and so you may be better engaging a business partner or integrator to help you out. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] Sent: 06 August 2017 10:49 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] what is mmnfs under the hood I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Sun Aug 6 12:39:56 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Sun, 6 Aug 2017 14:39:56 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: In my case, I cannot use nfs ganesha, this means I cannot use mmsnb since its part of "ces", if i want to use cnfs i cannot combine it with ces.. so the system architecture need to solve this issue. On Aug 6, 2017 13:54, "Simon Thompson (IT Research Support)" < S.J.Thompson at bham.ac.uk> wrote: > What do you mean by cannot use mmsmb and cannot use Ganesha? Do you > functionally you are not allowed to or they are not working for you? > > If it's the latter, then this should be resolvable. If you are under > active maintenance you could try raising a ticket with IBM, though basic > implementation is not really a support issue and so you may be better > engaging a business partner or integrator to help you out. > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces@ > spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > Sent: 06 August 2017 10:49 > To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] what is mmnfs under the hood > > I have read this atricle: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2. > 0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm > > So, in a shortcut, CNFS cannot be used when sharing via CES. > I cannot use ganesha NFS. > > Is it possible to share a cluster via SMB and NFS without using CES ? > the nfs will be expored via CNFS but what about SMB ? i cannot use > mmsmb.. > > > On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > > I cannot use ganesha NFS. > > How do I make NFS exports ? just editing all nodes /etc/exports is > enough ? > > I should i use the CNFS as described here: > > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2. > 2/com.ibm.spectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > > wrote: > >> Under the hood, the NFS services are provided by IBM supplied Ganesha > rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle > locking, ACLs, quota etc... > >> > >> Note it's different from using the cnfs support in Spectrum Scale which > uses Kernel NFS AFAIK. Using user space Ganesha means they have control of > the NFS stack, so if something needs patching/fixing, then can roll out new > Ganesha rpms rather than having to get (e.g.) RedHat to incorporate > something into kernel NFS. > >> > >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute > the config to the nodes. > >> > >> Simon > >> ________________________________________ > >> From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces@ > spectrumscale.org] on behalf of ilan84 at gmail.com [ilan84 at gmail.com] > >> Sent: 06 August 2017 09:26 > >> To: gpfsug main discussion list > >> Subject: [gpfsug-discuss] what is mmnfs under the hood > >> > >> Hi guys, > >> > >> I see IBM spectrumscale configure the NFS via command: mmnfs > >> > >> Is the command mmnfs is a wrapper on top of the normal kernel NFS > >> (Kernel VFS) ? > >> Is it a wrapper on top of ganesha NFS ? > >> Or it is NFS implemented by SpectrumScale team ? > >> > >> > >> Thanks > >> > >> -- > >> > >> > >> - > >> Ilan Schwarts > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >> _______________________________________________ > >> gpfsug-discuss mailing list > >> gpfsug-discuss at spectrumscale.org > >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > -- > > > > > > - > > Ilan Schwarts > > > > -- > > > - > Ilan Schwarts > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Greg.Lehmann at csiro.au Mon Aug 7 05:58:13 2017 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Mon, 7 Aug 2017 04:58:13 +0000 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: It would be nice to know why you cannot use ganesha or mmsmb. You don't have to use protocols or CES. We are migrating to CES from doing our own thing with NFS and samba on Debian. Debian does not have support for CES, so we had to roll our own. We did not use CNFS either. To get to CES we had to change OS. We did this because we valued the support. I'd say the failover works better with CES than with our solution, particularly with regards failing over and Infiniband IP address. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Ilan Schwarts Sent: Sunday, 6 August 2017 7:50 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] what is mmnfs under the hood I have read this atricle: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_ces_migrationcnfstoces.htm So, in a shortcut, CNFS cannot be used when sharing via CES. I cannot use ganesha NFS. Is it possible to share a cluster via SMB and NFS without using CES ? the nfs will be expored via CNFS but what about SMB ? i cannot use mmsmb.. On Sun, Aug 6, 2017 at 12:42 PM, Ilan Schwarts wrote: > I have gpfs (spectrum scale 4.2.2.0) and I wish to create NFS exports, > I cannot use ganesha NFS. > How do I make NFS exports ? just editing all nodes /etc/exports is enough ? > I should i use the CNFS as described here: > https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.sp > ectrum.scale.v4r22.doc/bl1adv_cnfssetup.htm > > > > > On Sun, Aug 6, 2017 at 12:10 PM, Simon Thompson (IT Research Support) > wrote: >> Under the hood, the NFS services are provided by IBM supplied Ganesha rpms. It's then fully supported by IBM, e.g. the GPFS VFS later to handle locking, ACLs, quota etc... >> >> Note it's different from using the cnfs support in Spectrum Scale which uses Kernel NFS AFAIK. Using user space Ganesha means they have control of the NFS stack, so if something needs patching/fixing, then can roll out new Ganesha rpms rather than having to get (e.g.) RedHat to incorporate something into kernel NFS. >> >> Mmnfs is a wrapper round the config of Ganesha using CCR to distribute the config to the nodes. >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of >> ilan84 at gmail.com [ilan84 at gmail.com] >> Sent: 06 August 2017 09:26 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] what is mmnfs under the hood >> >> Hi guys, >> >> I see IBM spectrumscale configure the NFS via command: mmnfs >> >> Is the command mmnfs is a wrapper on top of the normal kernel NFS >> (Kernel VFS) ? >> Is it a wrapper on top of ganesha NFS ? >> Or it is NFS implemented by SpectrumScale team ? >> >> >> Thanks >> >> -- >> >> >> - >> Ilan Schwarts >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > > - > Ilan Schwarts -- - Ilan Schwarts _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From ilan84 at gmail.com Mon Aug 7 14:27:07 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Mon, 7 Aug 2017 16:27:07 +0300 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Hi all, My setup is 2 nodes GPFS and 1 machine as NFS Client. All machines (3 total) run CentOS 7.2 The 3rd CentOS machine (not part of the cluster) used as NFS Client. I mount the NFS Client machine to one of the nodes: mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 This gives me the following: [root at CentOS7286-64 ~]# mount -v | grep gpfs 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen=255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys,clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) Now, From the Client NFS Machine, I go to the mount directory ("cd /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I use nfs4_getfacl: [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 Operation to request attribute not supported. [root at CentOS7286-64 nfs4]# >From the NODE machine i see the status: [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 flag value description ------------------- ------------------------ ----------------------------------- -f 8192 Minimum fragment size in bytes -i 4096 Inode size in bytes -I 16384 Indirect block size in bytes -m 1 Default number of metadata replicas -M 2 Maximum number of metadata replicas -r 1 Default number of data replicas -R 2 Maximum number of data replicas -j cluster Block allocation type -D nfs4 File locking semantics in effect -k nfs4 ACL semantics in effect -n 32 Estimated number of nodes that will mount file system -B 262144 Block size -Q none Quotas accounting enabled none Quotas enforced none Default quotas enabled --perfileset-quota No Per-fileset quota enforcement --filesetdf No Fileset df enabled? -V 16.00 (4.2.2.0) File system version --create-time Wed Jul 5 12:28:39 2017 File system creation time -z No Is DMAPI enabled? -L 4194304 Logfile size -E Yes Exact mtime mount option -S No Suppress atime mount option -K whenpossible Strict replica allocation option --fastea Yes Fast external attributes enabled? --encryption No Encryption enabled? --inode-limit 171840 Maximum number of inodes in all inode spaces --log-replicas 0 Number of log replicas --is4KAligned Yes is4KAligned? --rapid-repair Yes rapidRepair enabled? --write-cache-threshold 0 HAWC Threshold (max 65536) -P system Disk storage pools in file system -d nynsd1;nynsd2 Disks in file system -A yes Automatic mount option -o none Additional mount options -T /fs_gpfs01 Default mount point --mount-priority 0 Mount priority I saw this thread: https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 Is it still relevant ? Since 2014.. Thanks ! From makaplan at us.ibm.com Mon Aug 7 17:48:39 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 7 Aug 2017 12:48:39 -0400 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Indeed. You can consider and use GPFS/Spectrum Scale as "just another" file system type that can be loaded into/onto a Linux system. But you should consider the pluses and minuses of using other software subsystems that may or may not be designed to work better or inter-operate with Spectrum Scale specific features and APIs. -------------- next part -------------- An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Mon Aug 7 18:14:41 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Mon, 7 Aug 2017 20:14:41 +0300 Subject: [gpfsug-discuss] what is mmnfs under the hood In-Reply-To: References: Message-ID: Thanks for response. I am not a system engineer / storage architect. I maintain kernel module that interact with file system drivers.. so I need to configure gpfs and perform tests.. for example I noticed that gpfs set extended attribute does not go via VFS On Aug 7, 2017 19:48, "Marc A Kaplan" wrote: > Indeed. You can consider and use GPFS/Spectrum Scale as "just another" > file system type that can be loaded into/onto a Linux system. > > But you should consider the pluses and minuses of using other software > subsystems that may or may not be designed to work better or inter-operate > with Spectrum Scale specific features and APIs. > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Aug 7 21:27:03 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 07 Aug 2017 16:27:03 -0400 Subject: [gpfsug-discuss] 'ltfsee info tapes' - Unusable tapes... Message-ID: <8652.1502137623@turing-police.cc.vt.edu> The LTFSEE docs say: https://www.ibm.com/support/knowledgecenter/en/ST9MBR_1.2.3/ltfs_ee_ltfsee_info_tapes.html "Unusable The Unusable status indicates that the tape can't be used. To change the status, remove the tape from the pool by using the ltfsee pool remove command with the -r option. Then, add the tape back into the pool by using the ltfsee pool add command." Do they really mean that? What happens to data that was on the tape? Does the 'pool add' command re-import LTFS's knowledge of what files were on that tape? It's one thing to remove/add tapes with no files on them - but I'm leery of doing it for tapes that contain migrated data, given a lack of clear statement that file index recovery is done at 'pool add' time. (We had a tape get stuck in a drive, and LTFS/EE tried to use the drive, wasn't able to load the tape because the drive was occupied, marked the tape as Unusable. Lather rinse repeat until there's no usable tapes left in the pool... but that's a different issue...) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From jamiedavis at us.ibm.com Mon Aug 7 22:10:06 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Mon, 7 Aug 2017 21:10:06 +0000 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From ilan84 at gmail.com Tue Aug 8 05:28:20 2017 From: ilan84 at gmail.com (Ilan Schwarts) Date: Tue, 8 Aug 2017 07:28:20 +0300 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: Hi, The command should work from server side i know.. but isnt the scenario of: Root user, that is mounted via nfsv4 to a gpfs filesystem, cannot edit any of the mounted files/dirs acls? The acls are editable only from server side? Thanks! On Aug 8, 2017 00:10, "James Davis" wrote: > Hi Ilan, > > 1. Your command might work from the server side; you said you tried it > from the client side. Could you find anything in the docs about this? I > could not. > > 2. I can share this NFSv4-themed wrapper around mmputacl if it would be > useful to you. You would have to run it from the GPFS side, not the NFS > client side. > > Regards, > > Jamie > > # ./updateNFSv4ACL -h > Update the NFSv4 ACL governing a file's access permissions. > Appends to the existing ACL, overwriting conflicting permissions. > Usage: ./updateNFSv4ACL -file /path/to/file { ADD_PERM_SPEC | > DEL_PERM_SPEC }+ > ADD_PERM_SPEC: { -owningUser PERM | -owningGroup PERM | -other PERM | > -ace nameType:name:PERM:aceType } > DEL_PERM_SPEC: { -noACEFor nameType:name } > PERM: Specify a string composed of one or more of the following letters > in no particular order: > r (ead) > w (rite) > a (ppend) Must agree with write > x (execute) > d (elete) > D (elete child) Dirs only > t (read attrs) > T (write attrs) > c (read ACL) > C (write ACL) > o (change owner) > You can also provide these, but they will have no effect in GPFS: > n (read named attrs) > N (write named attrs) > y (support synchronous I/O) > > To indicate no permissions, give a - > nameType: 'user' or 'group'. > aceType: 'allow' or 'deny'. > Examples: ./updateNFSv4ACL -file /fs1/f -owningUser rtc -owningGroup > rwaxdtc -other '-' > Assign these permissions to 'owner', 'group', 'other'. > ./updateNFSv4ACL -file /fs1/f -ace 'user:pfs001:rtc:allow' > -noACEFor 'group:fvt001' > Allow user pfs001 read/read attrs/read ACL permission > Remove all ACEs (allow and deny) for group fvt001. > Notes: > Permissions you do not allow are denied by default. > See the GPFS docs for some other restrictions. > ace is short for Access Control Entry > > > ----- Original message ----- > From: Ilan Schwarts > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster > Date: Mon, Aug 7, 2017 9:27 AM > > Hi all, > My setup is 2 nodes GPFS and 1 machine as NFS Client. > All machines (3 total) run CentOS 7.2 > > The 3rd CentOS machine (not part of the cluster) used as NFS Client. > > I mount the NFS Client machine to one of the nodes: mount -t nfs > 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 > > This gives me the following: > > [root at CentOS7286-64 ~]# mount -v | grep gpfs > 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 > (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen= > 255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys, > clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) > > Now, From the Client NFS Machine, I go to the mount directory ("cd > /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I > use nfs4_getfacl: > [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 > Operation to request attribute not supported. > [root at CentOS7286-64 nfs4]# > > From the NODE machine i see the status: > [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 > flag value description > ------------------- ------------------------ ------------------------------ > ----- > -f 8192 Minimum fragment size in bytes > -i 4096 Inode size in bytes > -I 16384 Indirect block size in bytes > -m 1 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data > replicas > -R 2 Maximum number of data > replicas > -j cluster Block allocation type > -D nfs4 File locking semantics in > effect > -k nfs4 ACL semantics in effect > -n 32 Estimated number of nodes > that will mount file system > -B 262144 Block size > -Q none Quotas accounting enabled > none Quotas enforced > none Default quotas enabled > --perfileset-quota No Per-fileset quota enforcement > --filesetdf No Fileset df enabled? > -V 16.00 (4.2.2.0) File system version > --create-time Wed Jul 5 12:28:39 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E Yes Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation > option > --fastea Yes Fast external attributes > enabled? > --encryption No Encryption enabled? > --inode-limit 171840 Maximum number of inodes > in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > -P system Disk storage pools in file > system > -d nynsd1;nynsd2 Disks in file system > -A yes Automatic mount option > -o none Additional mount options > -T /fs_gpfs01 Default mount point > --mount-priority 0 Mount priority > > > > I saw this thread: > https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 > > Is it still relevant ? Since 2014.. > > Thanks ! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Tue Aug 8 05:50:10 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Tue, 8 Aug 2017 10:20:10 +0530 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chetkulk at in.ibm.com Tue Aug 8 17:30:13 2017 From: chetkulk at in.ibm.com (Chetan R Kulkarni) Date: Tue, 8 Aug 2017 22:00:13 +0530 Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster In-Reply-To: References: Message-ID: (seems my earlier reply created a new topic; hence trying to reply back original thread started by Ilan Schwarts...) >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. From: gpfsug-discuss-request at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Date: 08/08/2017 04:30 PM Subject: gpfsug-discuss Digest, Vol 67, Issue 21 Sent by: gpfsug-discuss-bounces at spectrumscale.org Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. Re: How to use nfs4_getfacl (or set) on GPFS cluster (Ilan Schwarts) 2. How to use nfs4_getfacl (or set) on GPFS cluster (Chetan R Kulkarni) ---------------------------------------------------------------------- Message: 1 Date: Tue, 8 Aug 2017 07:28:20 +0300 From: Ilan Schwarts To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Content-Type: text/plain; charset="utf-8" Hi, The command should work from server side i know.. but isnt the scenario of: Root user, that is mounted via nfsv4 to a gpfs filesystem, cannot edit any of the mounted files/dirs acls? The acls are editable only from server side? Thanks! On Aug 8, 2017 00:10, "James Davis" wrote: > Hi Ilan, > > 1. Your command might work from the server side; you said you tried it > from the client side. Could you find anything in the docs about this? I > could not. > > 2. I can share this NFSv4-themed wrapper around mmputacl if it would be > useful to you. You would have to run it from the GPFS side, not the NFS > client side. > > Regards, > > Jamie > > # ./updateNFSv4ACL -h > Update the NFSv4 ACL governing a file's access permissions. > Appends to the existing ACL, overwriting conflicting permissions. > Usage: ./updateNFSv4ACL -file /path/to/file { ADD_PERM_SPEC | > DEL_PERM_SPEC }+ > ADD_PERM_SPEC: { -owningUser PERM | -owningGroup PERM | -other PERM | > -ace nameType:name:PERM:aceType } > DEL_PERM_SPEC: { -noACEFor nameType:name } > PERM: Specify a string composed of one or more of the following letters > in no particular order: > r (ead) > w (rite) > a (ppend) Must agree with write > x (execute) > d (elete) > D (elete child) Dirs only > t (read attrs) > T (write attrs) > c (read ACL) > C (write ACL) > o (change owner) > You can also provide these, but they will have no effect in GPFS: > n (read named attrs) > N (write named attrs) > y (support synchronous I/O) > > To indicate no permissions, give a - > nameType: 'user' or 'group'. > aceType: 'allow' or 'deny'. > Examples: ./updateNFSv4ACL -file /fs1/f -owningUser rtc -owningGroup > rwaxdtc -other '-' > Assign these permissions to 'owner', 'group', 'other'. > ./updateNFSv4ACL -file /fs1/f -ace 'user:pfs001:rtc:allow' > -noACEFor 'group:fvt001' > Allow user pfs001 read/read attrs/read ACL permission > Remove all ACEs (allow and deny) for group fvt001. > Notes: > Permissions you do not allow are denied by default. > See the GPFS docs for some other restrictions. > ace is short for Access Control Entry > > > ----- Original message ----- > From: Ilan Schwarts > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster > Date: Mon, Aug 7, 2017 9:27 AM > > Hi all, > My setup is 2 nodes GPFS and 1 machine as NFS Client. > All machines (3 total) run CentOS 7.2 > > The 3rd CentOS machine (not part of the cluster) used as NFS Client. > > I mount the NFS Client machine to one of the nodes: mount -t nfs > 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 > > This gives me the following: > > [root at CentOS7286-64 ~]# mount -v | grep gpfs > 10.10.158.61:/fs_gpfs01/nfs on /mnt/nfs4 type nfs4 > (rw,relatime,vers=4.0,rsize=524288,wsize=524288,namlen= > 255,hard,proto=tcp,port=0,timeo=600,retrans=2,sec=sys, > clientaddr=10.10.149.188,local_lock=none,addr=10.10.158.61) > > Now, From the Client NFS Machine, I go to the mount directory ("cd > /mnt/nfs4") and try to set an acl. Since NFSv4 should be supported, I > use nfs4_getfacl: > [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 > Operation to request attribute not supported. > [root at CentOS7286-64 nfs4]# > > From the NODE machine i see the status: > [root at LH20-GPFS1 fs_gpfs01]# mmlsfs fs_gpfs01 > flag value description > ------------------- ------------------------ ------------------------------ > ----- > -f 8192 Minimum fragment size in bytes > -i 4096 Inode size in bytes > -I 16384 Indirect block size in bytes > -m 1 Default number of metadata > replicas > -M 2 Maximum number of metadata > replicas > -r 1 Default number of data > replicas > -R 2 Maximum number of data > replicas > -j cluster Block allocation type > -D nfs4 File locking semantics in > effect > -k nfs4 ACL semantics in effect > -n 32 Estimated number of nodes > that will mount file system > -B 262144 Block size > -Q none Quotas accounting enabled > none Quotas enforced > none Default quotas enabled > --perfileset-quota No Per-fileset quota enforcement > --filesetdf No Fileset df enabled? > -V 16.00 (4.2.2.0) File system version > --create-time Wed Jul 5 12:28:39 2017 File system creation time > -z No Is DMAPI enabled? > -L 4194304 Logfile size > -E Yes Exact mtime mount option > -S No Suppress atime mount option > -K whenpossible Strict replica allocation > option > --fastea Yes Fast external attributes > enabled? > --encryption No Encryption enabled? > --inode-limit 171840 Maximum number of inodes > in all inode spaces > --log-replicas 0 Number of log replicas > --is4KAligned Yes is4KAligned? > --rapid-repair Yes rapidRepair enabled? > --write-cache-threshold 0 HAWC Threshold (max 65536) > -P system Disk storage pools in file > system > -d nynsd1;nynsd2 Disks in file system > -A yes Automatic mount option > -o none Additional mount options > -T /fs_gpfs01 Default mount point > --mount-priority 0 Mount priority > > > > I saw this thread: > https://serverfault.com/questions/655112/nfsv4-acls-on-gpfs/722200 > > Is it still relevant ? Since 2014.. > > Thanks ! > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170808/0e20196d/attachment-0001.html > ------------------------------ Message: 2 Date: Tue, 8 Aug 2017 10:20:10 +0530 From: "Chetan R Kulkarni" To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] How to use nfs4_getfacl (or set) on GPFS cluster Message-ID: Content-Type: text/plain; charset="us-ascii" >> # mount -t nfs 10.10.158.61:/fs_gpfs01/nfs /mnt/nfs4 >> [root at CentOS7286-64 nfs4]# nfs4_getfacl mydir11 >> Operation to request attribute not supported. >> [root at CentOS7286-64 nfs4]# On my test setup (rhel7.3 nodes gpfs cluster and rhel7.2 nfs client); I can successfully read nfsv4 acls (nfs4_getfacl). Can you please try following on your setup? 1> capture network packets for above failure and check what does nfs server return to GETATTR ? => tcpdump -i any host 10.10.158.61 -w /tmp/getfacl.cap &; nfs4_getfacl mydir11; kill %1 2> Also check nfs4_getfacl version is up to date. => /usr/bin/nfs4_getfacl -H 3> If above doesn't help; then make sure you have sufficient nfsv4 acls to read acls (as per my understanding; for reading nfsv4 acls; one needs EXEC_SEARCH on /fs_gpfs01/nfs and READ_ACL on /fs_gpfs01/nfs/mydir11). => mmgetacl -k nfs4 /fs_gpfs01/nfs => mmgetacl -k nfs4 /fs_gpfs01/nfs/mydir11 Thanks, Chetan. -------------- next part -------------- An HTML attachment was scrubbed... URL: < http://gpfsug.org/pipermail/gpfsug-discuss/attachments/20170808/42fbe6c2/attachment-0001.html > ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 67, Issue 21 ********************************************** -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From stefan.dietrich at desy.de Tue Aug 8 18:16:33 2017 From: stefan.dietrich at desy.de (Dietrich, Stefan) Date: Tue, 8 Aug 2017 19:16:33 +0200 (CEST) Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS Message-ID: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Hello, I am currently trying to understand an issue with ACLs and how GPFS handles the umask. The filesystem is configured for NFS4 ACLs only (-k nfs4), filesets have been configured for chmodAndUpdateACL and the access is through a native GPFS client (v4.2.3). If I create a new file in a directory, which has an ACE with inheritance, the configured umask on the shell is completely ignored. The new file only contains ACEs from the inherited ACL. As soon as the ACE with inheritance is removed, newly created files receive the correct configured umask. Obvious downside, no ACLs anymore :( Additionally, it looks like that the specified mode bits for an open call are ignored as well. E.g. with an strace I see, that the open call includes the correct mode bits. However, the new file only has inherited ACEs. According to the NFSv4 RFC, the behavior is more or less undefined, only with NFSv4.2 umask will be added to the protocol. For GPFS, I found a section in the traditional ACL administration section, but nothing in the NFS4 ACL section of the docs. Is my current observation the intended behavior of GPFS? Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ From kkr at lbl.gov Tue Aug 8 19:33:22 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Tue, 8 Aug 2017 11:33:22 -0700 Subject: [gpfsug-discuss] Hold the Date - Spectrum Scale Day @ HPCXXL (Sept 2017, NYC) In-Reply-To: References: Message-ID: Hello, The GPFS Day of the HPCXXL conference is confirmed for Thursday, September 28th. Here is an updated URL, the agenda and registration are still being put together http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ . The GPFS Day will require registration, so we can make sure there is enough room (and coffee/food) for all attendees ?however, there will be no registration fee if you attend the GPFS Day only. I?ll send another update when the agenda is closer to settled. Cheers, Kristy > On Jul 7, 2017, at 3:32 PM, Kristy Kallback-Rose wrote: > > Hello, > > More details will be provided as they become available, but just so you can make a placeholder on your calendar, there will be a Spectrum Scale Day the week of September 25th - 29th, likely Thursday, September 28th. > > This will be a part of the larger HPCXXL meeting (https://www.spxxl.org/?q=New-York-City-2017 ). You may recall this group was formerly called SPXXL and the website is in the process of transitioning to the new name (and getting a new certificate). You will be able to attend *just* the Spectrum Scale day if that is the only portion of the event you would like to attend. > > The NYC location is great for Spectrum Scale events because many IBMers, including developers, can come in from Poughkeepsie. > > More as we get closer to the date and details are settled. > > Cheers, > Kristy -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Tue Aug 8 20:28:31 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Tue, 8 Aug 2017 14:28:31 -0500 Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS In-Reply-To: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> References: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Message-ID: Yes, that is the intended behavior. As in the section on traditional ACLs that you found, the intent is that if there is a default/inherited ACL, the object is created with that (and if there is no default/inherited ACL, then the mode and umask are the basis for the initial set of permissions). Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Dietrich, Stefan" To: gpfsug-discuss at spectrumscale.org Date: 08/08/2017 12:17 PM Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, I am currently trying to understand an issue with ACLs and how GPFS handles the umask. The filesystem is configured for NFS4 ACLs only (-k nfs4), filesets have been configured for chmodAndUpdateACL and the access is through a native GPFS client (v4.2.3). If I create a new file in a directory, which has an ACE with inheritance, the configured umask on the shell is completely ignored. The new file only contains ACEs from the inherited ACL. As soon as the ACE with inheritance is removed, newly created files receive the correct configured umask. Obvious downside, no ACLs anymore :( Additionally, it looks like that the specified mode bits for an open call are ignored as well. E.g. with an strace I see, that the open call includes the correct mode bits. However, the new file only has inherited ACEs. According to the NFSv4 RFC, the behavior is more or less undefined, only with NFSv4.2 umask will be added to the protocol. For GPFS, I found a section in the traditional ACL administration section, but nothing in the NFS4 ACL section of the docs. Is my current observation the intended behavior of GPFS? Regards, Stefan -- ------------------------------------------------------------------------ Stefan Dietrich Deutsches Elektronen-Synchrotron (IT-Systems) Ein Forschungszentrum der Helmholtz-Gemeinschaft Notkestr. 85 phone: +49-40-8998-4696 22607 Hamburg e-mail: stefan.dietrich at desy.de Germany ------------------------------------------------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Tue Aug 8 22:27:20 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 8 Aug 2017 17:27:20 -0400 Subject: [gpfsug-discuss] NFS4 ACLs and umask on GPFS In-Reply-To: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> References: <1818402257.4746864.1502212593085.JavaMail.zimbra@desy.de> Message-ID: (IMO) NFSv4 ACLs are complicated. Confusing. Difficult. Befuddling. PIA. Before questioning the GPFS implementation, see how they work in other file systems. If GPFS does it differently, perhaps there is a rationale, or perhaps you've found a bug. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Tomasz.Wolski at ts.fujitsu.com Wed Aug 9 11:32:32 2017 From: Tomasz.Wolski at ts.fujitsu.com (Tomasz.Wolski at ts.fujitsu.com) Date: Wed, 9 Aug 2017 10:32:32 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: <09520659d6cb44a1bbbed066106b39a2@R01UKEXCASM223.r01.fujitsu.local> Hello Experts, Does GPFS start "down" disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in "down" state by GPFS. Will GPFS in such case start the disk? With best regards / Mit freundlichen Gr??en / Pozdrawiam Tomasz Wolski Development Engineer NDC Eternus CS HE (ET2) [cid:image002.gif at 01CE62B9.8ACFA960] FUJITSU Fujitsu Technology Solutions Sp. z o.o. Textorial Park Bldg C, ul. Fabryczna 17 90-344 Lodz, Poland E-mail: Tomasz.Wolski at ts.fujitsu.com Web: ts.fujitsu.com Company details: ts.fujitsu.com/imprint This communication contains information that is confidential, proprietary in nature and/or privileged. It is for the exclusive use of the intended recipient(s). If you are not the intended recipient(s) or the person responsible for delivering it to the intended recipient(s), please note that any form of dissemination, distribution or copying of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the sender and delete the original communication. Thank you for your cooperation. Please be advised that neither Fujitsu, its affiliates, its employees or agents accept liability for any errors, omissions or damages caused by delays of receipt or by any virus infection in this message or its attachments, or which may otherwise arise as a result of this e-mail transmission. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 2774 bytes Desc: image001.gif URL: From chris.schlipalius at pawsey.org.au Wed Aug 9 11:50:22 2017 From: chris.schlipalius at pawsey.org.au (Chris Schlipalius) Date: Wed, 9 Aug 2017 18:50:22 +0800 Subject: [gpfsug-discuss] Announcement of the next Australian SpectrumScale User Group - Half day August 2017 (Melbourne) References: Message-ID: <0190993E-B870-4A37-9671-115A1201A59D@pawsey.org.au> Hello we have a half day (afternoon) usergroup next week. Please check out the event registration link below for tickets, speakers and topics. https://goo.gl/za8g3r Regards, Chris Schlipalius Lead Organiser Spectrum Scale Usergroups Australia Senior Storage Infrastucture Specialist, The Pawsey Supercomputing Centre From Robert.Oesterlin at nuance.com Wed Aug 9 13:14:46 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 9 Aug 2017 12:14:46 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: By default, GPFS does not automatically start down disks. You could add a callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? if you wanted. If a disk is marked down, it?s better to determine why before trying to start it as it may involve other issues that need investigation. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of "Tomasz.Wolski at ts.fujitsu.com" Reply-To: gpfsug main discussion list Date: Wednesday, August 9, 2017 at 6:33 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs automatically? Does GPFS start ?down? disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in ?down? state by GPFS. Will GPFS in such case start the disk? -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Wed Aug 9 13:22:57 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 9 Aug 2017 14:22:57 +0200 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? In-Reply-To: References: Message-ID: If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. That can be quite useful for stretched clusters, where you want to replicate all blocks to both locations, and this way recover automatically. -jf On Wed, Aug 9, 2017 at 2:14 PM, Oesterlin, Robert < Robert.Oesterlin at nuance.com> wrote: > By default, GPFS does not automatically start down disks. You could add a > callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? > if you wanted. If a disk is marked down, it?s better to determine why > before trying to start it as it may involve other issues that need > investigation. > > > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > > > > > > > > *From: * on behalf of " > Tomasz.Wolski at ts.fujitsu.com" > *Reply-To: *gpfsug main discussion list > *Date: *Wednesday, August 9, 2017 at 6:33 AM > *To: *"gpfsug-discuss at spectrumscale.org" > > *Subject: *[EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs > automatically? > > > > > > Does GPFS start ?down? disks in a filesystem automatically? For instance, > when connection to NSD is recovered, but it the meantime disk was put in > ?down? state by GPFS. Will GPFS in such case start the disk? > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Wed Aug 9 13:48:00 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 9 Aug 2017 12:48:00 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? Message-ID: <3F6EFDF7-B96B-4E89-ABFE-4EEBEE0C0878@nuance.com> Be careful here, as this does: ?When a disk experiences a failure and becomes unavailable, the recovery procedure will first attempt to restart the disk and if this fails, the disk is suspended and its data moved to other disks. ? Which may not be what you want to happen. :-) If you have disks marked down due to a transient failure, kicking of restripes to move the data off might not be the best choice. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Jan-Frode Myklebust Reply-To: gpfsug main discussion list Date: Wednesday, August 9, 2017 at 8:23 AM To: gpfsug main discussion list Subject: [EXTERNAL] Re: [gpfsug-discuss] Is GPFS starting NSDs automatically? If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 9 16:04:35 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 9 Aug 2017 15:04:35 +0000 Subject: [gpfsug-discuss] Is GPFS starting NSDs automatically? In-Reply-To: References: Message-ID: <44e67260b5104860adaaf8222e11e995@jumptrading.com> For non-stretch clusters, I think best practice would be to have an administrator analyze the situation and understand why the NSD was considered unavailable before attempting to start the disks back in the file system. Down NSDs are usually indicative of a serious issue. However I have seen a transient network communication problems or NSD server recovery cause a NSD Client to report a NSD as failed. I would prefer that the FS manager check first that the NSDs are actually not accessible and that there isn?t a recovery operation within the NSD Servers supporting an NSD before marking NSDs as down. Recovery should be allowed to complete and a NSD client should just wait for that to happen. NSDs being marked down can cause serious file system outages!! We?ve also requested that a settable retry configuration setting be provided to have NSD Clients retry access to the NSD before reporting the NSD as failed (https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=104474 if you want to add a vote!). Cheers, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jan-Frode Myklebust Sent: Wednesday, August 09, 2017 7:23 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Is GPFS starting NSDs automatically? Note: External Email ________________________________ If you do a "mmchconfig restripeOnDiskFailure=yes", such a callback will be added for node-join events. That can be quite useful for stretched clusters, where you want to replicate all blocks to both locations, and this way recover automatically. -jf On Wed, Aug 9, 2017 at 2:14 PM, Oesterlin, Robert > wrote: By default, GPFS does not automatically start down disks. You could add a callback ?downdisk? via mmaddcallback that could trigger a ?mmchdisk start? if you wanted. If a disk is marked down, it?s better to determine why before trying to start it as it may involve other issues that need investigation. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: > on behalf of "Tomasz.Wolski at ts.fujitsu.com" > Reply-To: gpfsug main discussion list > Date: Wednesday, August 9, 2017 at 6:33 AM To: "gpfsug-discuss at spectrumscale.org" > Subject: [EXTERNAL] [gpfsug-discuss] Is GPFS starting NSDs automatically? Does GPFS start ?down? disks in a filesystem automatically? For instance, when connection to NSD is recovered, but it the meantime disk was put in ?down? state by GPFS. Will GPFS in such case start the disk? _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Aug 14 22:53:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 14 Aug 2017 17:53:35 -0400 Subject: [gpfsug-discuss] node lockups in gpfs > 4.1.1.14 In-Reply-To: References: <8049b68d-449e-859a-ecca-61e759269fba@nasa.gov> Message-ID: <5e25d9b1-13de-20b7-567d-e14601fd4bd0@nasa.gov> I was remiss in not following up with this sooner and thank you to the kind individual that shot me a direct message to ask the question. It turns out that when I asked for the fix for APAR IV96776 I got an early release of 4.1.1.16 that had a fix for the APAR but also introduced the lockup bug. IBM kindly delayed the release of 4.1.1.16 proper until they had addressed the lockup bug (APAR IV98888). As I understand it the version of 4.1.1.16 that was released via fix central should have a fix for this bug although I haven't tested it I have no reason to believe it's not fixed. -Aaron On 08/04/2017 11:02 AM, Aaron Knister wrote: > I've narrowed the problem down to 4.1.1.16. We'll most likely be > downgrading to 4.1.1.15. > > -Aaron > > On 8/4/17 4:00 AM, Aaron Knister wrote: >> Hey All, >> >> Anyone seen any strange behavior running either 4.1.1.15 or 4.1.1.16? >> >> We are mid upgrade to 4.1.1.16 from 4.1.1.14 and have seen some >> rather disconcerting behavior. Specifically on some of the upgraded >> nodes GPFS will seemingly deadlock on the entire node rendering it >> unusable. I can't even get a session on the node (but I can trigger a >> crash dump via a sysrq trigger). >> >> Most blocked tasks are blocked are in cxiWaitEventWait at the top of >> their call trace. That's probably not very helpful in of itself but >> I'm curious if anyone else out there has run into this issue or if >> this is a known bug. >> >> (I'll open a PMR later today once I've gathered more diagnostic >> information). >> >> -Aaron >> > From aaron.s.knister at nasa.gov Thu Aug 17 14:12:28 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Thu, 17 Aug 2017 13:12:28 +0000 Subject: [gpfsug-discuss] NSD Server/FS Manager Memory Requirements Message-ID: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> Hi Everyone, In the world of GPFS 4.2 is there a particular advantage to having a large amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD Server/FS manager nodes? We currently have half of physical memory allocated to pagepool on these nodes. For some historical context-- we had two indicidents that drove us to increase our NSD server/FS manager pagepools. One was a weird behavior in GPFS 3.5 that was causing bouncing FS managers until we bumped the page pool from a few gigs to about half of the physical memory on the node. The other was a mass round of parallel mmfsck's of all 20 something of our filesystems. It came highly recommended to us to increase the pagepool to something very large for that. I'm curious to hear what other folks do and what the recommendations from IBM folks are. Thanks, Aaron -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Aug 17 14:43:48 2017 From: ewahl at osc.edu (Edward Wahl) Date: Thu, 17 Aug 2017 09:43:48 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: Message-ID: <20170817094348.37d2f51b@osc.edu> On Fri, 4 Aug 2017 01:02:22 -0400 "J. Eric Wonderley" wrote: > 4.2.2.3 > > I want to think maybe this started after expanding inode space What does 'mmlsfileset home nathanfootest -L' say? Ed > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis wrote: > > > Hey, > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > Cheers, > > > > Jamie > > > > > > ----- Original message ----- > > From: "J. Eric Wonderley" > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list > > Cc: > > Subject: [gpfsug-discuss] mmsetquota produces error > > Date: Wed, Aug 2, 2017 5:03 PM > > > > for one of our home filesystem we get: > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > 'Invalid argument'. > > > > > > mmedquota -j home:nathanfootest > > does work however > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From eric.wonderley at vt.edu Thu Aug 17 15:13:57 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 17 Aug 2017 10:13:57 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: <20170817094348.37d2f51b@osc.edu> References: <20170817094348.37d2f51b@osc.edu> Message-ID: The error is very repeatable... [root at cl001 ~]# mmcrfileset home setquotafoo Fileset setquotafoo created with id 61 root inode 3670407. [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo Fileset setquotafoo linked at /gpfs/home/setquotafoo [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid argument'. mmsetquota: Command failed. Examine previous error messages to determine cause. [root at cl001 ~]# mmlsfileset home setquotafoo -L Filesets in file system 'home': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 61 3670407 0 Thu Aug 17 10:10:54 2017 0 0 0 On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > On Fri, 4 Aug 2017 01:02:22 -0400 > "J. Eric Wonderley" wrote: > > > 4.2.2.3 > > > > I want to think maybe this started after expanding inode space > > What does 'mmlsfileset home nathanfootest -L' say? > > Ed > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > > > Hey, > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > Cheers, > > > > > > Jamie > > > > > > > > > ----- Original message ----- > > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > To: gpfsug main discussion list > > > Cc: > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > for one of our home filesystem we get: > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > 'Invalid argument'. > > > > > > > > > mmedquota -j home:nathanfootest > > > does work however > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Thu Aug 17 15:20:06 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Thu, 17 Aug 2017 14:20:06 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: I?ve just done exactly that and can?t reproduce it in my prod environment. Running 4.2.3-2 though. [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L Filesets in file system 'gpfs': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 251 8408295 0 Thu Aug 17 15:17:18 2017 0 0 0 From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of J. Eric Wonderley Sent: 17 August 2017 15:14 To: Edward Wahl Cc: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmsetquota produces error The error is very repeatable... [root at cl001 ~]# mmcrfileset home setquotafoo Fileset setquotafoo created with id 61 root inode 3670407. [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo Fileset setquotafoo linked at /gpfs/home/setquotafoo [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files 10M:10M tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid argument'. mmsetquota: Command failed. Examine previous error messages to determine cause. [root at cl001 ~]# mmlsfileset home setquotafoo -L Filesets in file system 'home': Name Id RootInode ParentId Created InodeSpace MaxInodes AllocInodes Comment setquotafoo 61 3670407 0 Thu Aug 17 10:10:54 2017 0 0 0 On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl > wrote: On Fri, 4 Aug 2017 01:02:22 -0400 "J. Eric Wonderley" > wrote: > 4.2.2.3 > > I want to think maybe this started after expanding inode space What does 'mmlsfileset home nathanfootest -L' say? Ed > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > Hey, > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > Cheers, > > > > Jamie > > > > > > ----- Original message ----- > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list > > > Cc: > > Subject: [gpfsug-discuss] mmsetquota produces error > > Date: Wed, Aug 2, 2017 5:03 PM > > > > for one of our home filesystem we get: > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > 'Invalid argument'. > > > > > > mmedquota -j home:nathanfootest > > does work however > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 -------------- next part -------------- An HTML attachment was scrubbed... URL: From jamiedavis at us.ibm.com Thu Aug 17 15:30:19 2017 From: jamiedavis at us.ibm.com (James Davis) Date: Thu, 17 Aug 2017 14:30:19 +0000 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: , <20170817094348.37d2f51b@osc.edu> Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Thu Aug 17 15:34:26 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 17 Aug 2017 10:34:26 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: I recently opened a pmr on this issue(24603,442,000)...I'll keep this thread posted on results. On Thu, Aug 17, 2017 at 10:30 AM, James Davis wrote: > I've also tried on our in-house latest release and cannot recreate it. > > I'll ask around to see who's running a 4.2.2 cluster I can look at. > > > ----- Original message ----- > From: "Sobey, Richard A" > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list , > "Edward Wahl" > Cc: > Subject: Re: [gpfsug-discuss] mmsetquota produces error > Date: Thu, Aug 17, 2017 10:20 AM > > > I?ve just done exactly that and can?t reproduce it in my prod environment. > Running 4.2.3-2 though. > > > > [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L > > Filesets in file system 'gpfs': > > Name Id RootInode ParentId > Created InodeSpace MaxInodes AllocInodes > Comment > > setquotafoo 251 8408295 0 Thu Aug 17 > 15:17:18 2017 0 0 0 > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > bounces at spectrumscale.org] *On Behalf Of *J. Eric Wonderley > *Sent:* 17 August 2017 15:14 > *To:* Edward Wahl > *Cc:* gpfsug main discussion list > *Subject:* Re: [gpfsug-discuss] mmsetquota produces error > > > > The error is very repeatable... > [root at cl001 ~]# mmcrfileset home setquotafoo > Fileset setquotafoo created with id 61 root inode 3670407. > [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo > Fileset setquotafoo linked at /gpfs/home/setquotafoo > [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files > 10M:10M > tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid > argument'. > mmsetquota: Command failed. Examine previous error messages to determine > cause. > [root at cl001 ~]# mmlsfileset home setquotafoo -L > Filesets in file system 'home': > Name Id RootInode ParentId > Created InodeSpace MaxInodes AllocInodes > Comment > setquotafoo 61 3670407 0 Thu Aug 17 > 10:10:54 2017 0 0 0 > > > > On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > > On Fri, 4 Aug 2017 01:02:22 -0400 > "J. Eric Wonderley" wrote: > > > 4.2.2.3 > > > > I want to think maybe this started after expanding inode space > > What does 'mmlsfileset home nathanfootest -L' say? > > Ed > > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > wrote: > > > > > Hey, > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > Cheers, > > > > > > Jamie > > > > > > > > > ----- Original message ----- > > > From: "J. Eric Wonderley" > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > To: gpfsug main discussion list > > > Cc: > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > for one of our home filesystem we get: > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > 'Invalid argument'. > > > > > > > > > mmedquota -j home:nathanfootest > > > does work however > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > _______________________________________________ > > > gpfsug-discuss mailing list > > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Aug 17 15:50:27 2017 From: ewahl at osc.edu (Edward Wahl) Date: Thu, 17 Aug 2017 10:50:27 -0400 Subject: [gpfsug-discuss] mmsetquota produces error In-Reply-To: References: <20170817094348.37d2f51b@osc.edu> Message-ID: <20170817105027.316ce609@osc.edu> We're running 4.2.2.3 (well technically 4.2.2.3 efix21 (1028007) since yesterday" and we use filesets extensively for everything and I cannot reproduce this. I would guess this is somehow an inode issue, but... ?? checked the logs for the FS creation and looked for odd errors? So this fileset is not a stand-alone, Is there anything odd about the mmlsfileset for the root fileset? mmlsfileset gpfs root -L can you create files in the Junction directory? Does the increase in inodes show up? nothing weird from 'mmdf gpfs -m' ? none of your metadata NSDs are offline? Ed On Thu, 17 Aug 2017 10:34:26 -0400 "J. Eric Wonderley" wrote: > I recently opened a pmr on this issue(24603,442,000)...I'll keep this > thread posted on results. > > On Thu, Aug 17, 2017 at 10:30 AM, James Davis wrote: > > > I've also tried on our in-house latest release and cannot recreate it. > > > > I'll ask around to see who's running a 4.2.2 cluster I can look at. > > > > > > ----- Original message ----- > > From: "Sobey, Richard A" > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > To: gpfsug main discussion list , > > "Edward Wahl" > > Cc: > > Subject: Re: [gpfsug-discuss] mmsetquota produces error > > Date: Thu, Aug 17, 2017 10:20 AM > > > > > > I?ve just done exactly that and can?t reproduce it in my prod environment. > > Running 4.2.3-2 though. > > > > > > > > [root at icgpfs01 ~]# mmlsfileset gpfs setquotafoo -L > > > > Filesets in file system 'gpfs': > > > > Name Id RootInode ParentId > > Created InodeSpace MaxInodes AllocInodes > > Comment > > > > setquotafoo 251 8408295 0 Thu Aug 17 > > 15:17:18 2017 0 0 0 > > > > > > > > *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss- > > bounces at spectrumscale.org] *On Behalf Of *J. Eric Wonderley > > *Sent:* 17 August 2017 15:14 > > *To:* Edward Wahl > > *Cc:* gpfsug main discussion list > > *Subject:* Re: [gpfsug-discuss] mmsetquota produces error > > > > > > > > The error is very repeatable... > > [root at cl001 ~]# mmcrfileset home setquotafoo > > Fileset setquotafoo created with id 61 root inode 3670407. > > [root at cl001 ~]# mmlinkfileset home setquotafoo -J /gpfs/home/setquotafoo > > Fileset setquotafoo linked at /gpfs/home/setquotafoo > > [root at cl001 ~]# mmsetquota home:setquotafoo --block 10T:10T --files > > 10M:10M > > tssetquota: Could not get id of fileset 'setquotafoo' error (22): 'Invalid > > argument'. > > mmsetquota: Command failed. Examine previous error messages to determine > > cause. > > [root at cl001 ~]# mmlsfileset home setquotafoo -L > > Filesets in file system 'home': > > Name Id RootInode ParentId > > Created InodeSpace MaxInodes AllocInodes > > Comment > > setquotafoo 61 3670407 0 Thu Aug 17 > > 10:10:54 2017 0 0 0 > > > > > > > > On Thu, Aug 17, 2017 at 9:43 AM, Edward Wahl wrote: > > > > On Fri, 4 Aug 2017 01:02:22 -0400 > > "J. Eric Wonderley" wrote: > > > > > 4.2.2.3 > > > > > > I want to think maybe this started after expanding inode space > > > > What does 'mmlsfileset home nathanfootest -L' say? > > > > Ed > > > > > > > > > > > > On Thu, Aug 3, 2017 at 9:11 AM, James Davis > > wrote: > > > > > > > Hey, > > > > > > > > Hmm, your invocation looks valid to me. What's your GPFS level? > > > > > > > > Cheers, > > > > > > > > Jamie > > > > > > > > > > > > ----- Original message ----- > > > > From: "J. Eric Wonderley" > > > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > To: gpfsug main discussion list > > > > Cc: > > > > Subject: [gpfsug-discuss] mmsetquota produces error > > > > Date: Wed, Aug 2, 2017 5:03 PM > > > > > > > > for one of our home filesystem we get: > > > > mmsetquota home:nathanfootest --block 10T:10T --files 10M:10M > > > > tssetquota: Could not get id of fileset 'nathanfootest' error (22): > > > > 'Invalid argument'. > > > > > > > > > > > > mmedquota -j home:nathanfootest > > > > does work however > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > > > > > > > _______________________________________________ > > > > gpfsug-discuss mailing list > > > > gpfsug-discuss at spectrumscale.org > > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > -- > > > > Ed Wahl > > Ohio Supercomputer Center > > 614-292-9302 > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From alex.chekholko at gmail.com Thu Aug 17 19:11:39 2017 From: alex.chekholko at gmail.com (Alex Chekholko) Date: Thu, 17 Aug 2017 18:11:39 +0000 Subject: [gpfsug-discuss] NSD Server/FS Manager Memory Requirements In-Reply-To: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> References: <13308C14-FDB5-4CEC-8B81-673858BFCF61@nasa.gov> Message-ID: Hi Aaron, What would be the advantage of decreasing the pagepool size? Regards, Alex On Thu, Aug 17, 2017 at 6:12 AM Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Hi Everyone, > > In the world of GPFS 4.2 is there a particular advantage to having a large > amount of memory (e.g. > 64G) allocated to the pagepool on combination NSD > Server/FS manager nodes? We currently have half of physical memory > allocated to pagepool on these nodes. > > For some historical context-- we had two indicidents that drove us to > increase our NSD server/FS manager pagepools. One was a weird behavior in > GPFS 3.5 that was causing bouncing FS managers until we bumped the page > pool from a few gigs to about half of the physical memory on the node. The > other was a mass round of parallel mmfsck's of all 20 something of our > filesystems. It came highly recommended to us to increase the pagepool to > something very large for that. > > I'm curious to hear what other folks do and what the recommendations from > IBM folks are. > > Thanks, > Aaron > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Aug 19 02:07:29 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Fri, 18 Aug 2017 21:07:29 -0400 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. Message-ID: <35574.1503104849@turing-police.cc.vt.edu> So for a variety of reasons, we had accumulated some 45 tapes that had found ways to get out of Valid status. I've cleaned up most of them, but I'm stuck on a few corner cases. Case 1: l% tfsee info tapes | sort | grep -C 1 'Not Sup' AV0186JD Valid TS1150(J5) 9022 0 56 vbi_tapes VTC 1148 - AV0187JD Not Supported TS1150(J5) 9022 2179 37 vbi_tapes VTC 1149 - AV0188JD Valid TS1150(J5) 9022 1559 67 vbi_tapes VTC 1150 - -- AV0540JD Valid TS1150(J5) 9022 9022 0 vtti_tapes VTC 1607 - AV0541JD Not Supported TS1150(J5) 9022 1797 6 vtti_tapes VTC 1606 - AV0542JD Valid TS1150(J5) 9022 9022 0 vtti_tapes VTC 1605 - How the heck does *that* happen? And how do you fix it? Case 2: The docs say that for 'Invalid', you need to add it to the pool with -c. % ltfsee pool remove -p arc_tapes -l ISB -t AI0084JD; ltfsee pool add -c -p arc_tapes -l ISB -t AI0084JD GLESL043I(01052): Removing tape AI0084JD from storage pool arc_tapes. GLESL041E(01129): Tape AI0084JD does not exist in storage pool arc_tapes or is in an invalid state. Specify a valid tape ID. GLESL042I(00809): Adding tape AI0084JD to storage pool arc_tapes. (Not sure why the last 2 messages got out of order..) % ltfsee info tapes | grep AI0084JD AI0084JD Invalid LTFS TS1150 0 0 0 - ISB 1262 - What do you do if adding it with -c doesn't work? Time to reformat the tape? -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From makaplan at us.ibm.com Sat Aug 19 16:45:48 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sat, 19 Aug 2017 11:45:48 -0400 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. In-Reply-To: <35574.1503104849@turing-police.cc.vt.edu> References: <35574.1503104849@turing-police.cc.vt.edu> Message-ID: I'm kinda curious... I've noticed a few message on this subject -- so I went to the doc.... The doc seems to indicate there are some circumstances where removing the tape with the appropriate command and options and then adding it back will result in the files on the tape becoming available again... But, of course, tapes are not 100% (nothing is), so no guarantee. Perhaps the rigamarole of removing and adding back is compensating for software glitch (bug!) -- Logically seems it shouldn't be necessary -- either the tape is readable or not -- the system should be able to do retries and error correction without removing -- but worth a shot. (I'm a gpfs guy, but not an LTFS/EE/tape guy) -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Sat Aug 19 20:05:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Sat, 19 Aug 2017 20:05:05 +0100 Subject: [gpfsug-discuss] LTFS/EE - fixing bad tapes.. In-Reply-To: References: <35574.1503104849@turing-police.cc.vt.edu> Message-ID: On 19/08/17 16:45, Marc A Kaplan wrote: > I'm kinda curious... I've noticed a few message on this subject -- so I > went to the doc.... > > The doc seems to indicate there are some circumstances where removing > the tape with the appropriate command and options and then adding it > back will result in the files on the tape becoming available again... > But, of course, tapes are not 100% (nothing is), so no guarantee. > Perhaps the rigamarole of removing and adding back is compensating for > software glitch (bug!) -- Logically seems it shouldn't be necessary -- > either the tape is readable or not -- the system should be able to do > retries and error correction without removing -- but worth a shot. > > (I'm a gpfs guy, but not an LTFS/EE/tape guy) > Well with a TSM based HSM there are all sorts of reasons for a tape being marked "offline". Usually it's because there has been some sort of problem with the tape library in my experience. Say there is a problem with the gripper and the library is unable to get the tape, it will mark it as unavailable. Of course issues with reading data from the tape would be another reasons. Typically beyond a number of errors TSM would mark the tape as bad, which is why you always have a copy pool. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From aaron.s.knister at nasa.gov Sun Aug 20 21:02:36 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Sun, 20 Aug 2017 16:02:36 -0400 Subject: [gpfsug-discuss] data integrity documentation In-Reply-To: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> References: <45257e80-4815-f7bc-1341-32dc50f29c54@ugent.be> Message-ID: <30bfb6ca-3d86-08ab-0eec-06def4a2f6db@nasa.gov> I think it would be a huge advantage to support these mysterious nsdChecksum settings for us non GNR folks. Even if the checksums aren't being stored on disk I would think the ability to protect against network-level corruption would be valuable enough to warrant its support. I've created RFE 109269 to request this. We'll see what IBM says. If this is valuable to other folks then please vote for the RFE. -Aaron On 8/2/17 5:53 PM, Stijn De Weirdt wrote: > hi steve, > >> The nsdChksum settings for none GNR/ESS based system is not officially >> supported. It will perform checksum on data transfer over the network >> only and can be used to help debug data corruption when network is a >> suspect. > i'll take not officially supported over silent bitrot any day. > >> Did any of those "Encountered XYZ checksum errors on network I/O to NSD >> Client disk" warning messages resulted in disk been changed to "down" >> state due to IO error? > no. > > If no disk IO error was reported in GPFS log, >> that means data was retransmitted successfully on retry. > we suspected as much. as sven already asked, mmfsck now reports clean > filesystem. > i have an ibdump of 2 involved nsds during the reported checksums, i'll > have a closer look if i can spot these retries. > >> As sven said, only GNR/ESS provids the full end to end data integrity. > so with the silent network error, we have high probabilty that the data > is corrupted. > > we are now looking for a test to find out what adapters are affected. we > hoped that nsdperf with verify=on would tell us, but it doesn't. > >> Steve Y. Xiao >> >> >> >> >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From evan.koutsandreou at adventone.com Mon Aug 21 04:05:40 2017 From: evan.koutsandreou at adventone.com (Evan Koutsandreou) Date: Mon, 21 Aug 2017 03:05:40 +0000 Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Message-ID: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you From mweil at wustl.edu Mon Aug 21 20:54:27 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 21 Aug 2017 14:54:27 -0500 Subject: [gpfsug-discuss] pmcollector node In-Reply-To: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> References: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> Message-ID: <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> any input on this Thanks On 7/5/17 10:51 AM, Matt Weil wrote: > Hello all, > > Question on the requirements on pmcollector node/s for a 500+ node > cluster. Is there a sizing guide? What specifics should we scale? > CPU Disks memory? > > Thanks > > Matt > From kkr at lbl.gov Mon Aug 21 23:33:36 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 21 Aug 2017 15:33:36 -0700 Subject: [gpfsug-discuss] Hold the Date - Spectrum Scale Day @ HPCXXL (Sept 2017, NYC) In-Reply-To: References: Message-ID: <6EF4187F-D8A1-4927-9E4F-4DF703DA04F5@lbl.gov> If you plan on attending the GPFS Day, please use the HPCXXL registration form (link to Eventbrite registration at the link below). The GPFS day is a free event, but you *must* register so we can make sure there are enough seats and food available. If you would like to speak or suggest a topic, please let me know. http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ The agenda is still being worked on, here are some likely topics: --RoadMap/Updates --"New features - New Bugs? (Julich) --GPFS + Openstack (CSCS) --ORNL Update on Spider3-related GPFS work --ANL Site Update --File Corruption Session Best, Kristy > On Aug 8, 2017, at 11:33 AM, Kristy Kallback-Rose wrote: > > Hello, > > The GPFS Day of the HPCXXL conference is confirmed for Thursday, September 28th. Here is an updated URL, the agenda and registration are still being put together http://hpcxxl.org/summer-2017-meeting-september-24-29-new-york-city/ . The GPFS Day will require registration, so we can make sure there is enough room (and coffee/food) for all attendees ?however, there will be no registration fee if you attend the GPFS Day only. > > I?ll send another update when the agenda is closer to settled. > > Cheers, > Kristy > >> On Jul 7, 2017, at 3:32 PM, Kristy Kallback-Rose > wrote: >> >> Hello, >> >> More details will be provided as they become available, but just so you can make a placeholder on your calendar, there will be a Spectrum Scale Day the week of September 25th - 29th, likely Thursday, September 28th. >> >> This will be a part of the larger HPCXXL meeting (https://www.spxxl.org/?q=New-York-City-2017 ). You may recall this group was formerly called SPXXL and the website is in the process of transitioning to the new name (and getting a new certificate). You will be able to attend *just* the Spectrum Scale day if that is the only portion of the event you would like to attend. >> >> The NYC location is great for Spectrum Scale events because many IBMers, including developers, can come in from Poughkeepsie. >> >> More as we get closer to the date and details are settled. >> >> Cheers, >> Kristy > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Aug 22 04:03:35 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Mon, 21 Aug 2017 23:03:35 -0400 Subject: [gpfsug-discuss] multicluster security Message-ID: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> Hi Everyone, I have a theoretical question about GPFS multiclusters and security. Let's say I have clusters A and B. Cluster A is exporting a filesystem as read-only to cluster B. Where does the authorization burden lay? Meaning, does the security rely on mmfsd in cluster B to behave itself and enforce the conditions of the multi-cluster export? Could someone using the credentials on a compromised node in cluster B just start sending arbitrary nsd read/write commands to the nsds from cluster A (or something along those lines)? Do the NSD servers in cluster A do any sort of sanity or security checking on the I/O requests coming from cluster B to the NSDs they're serving to exported filesystems? I imagine any enforcement would go out the window with shared disks in a multi-cluster environment since a compromised node could just "dd" over the LUNs. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From kkr at lbl.gov Tue Aug 22 05:52:58 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Mon, 21 Aug 2017 21:52:58 -0700 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Message-ID: Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E -------------- next part -------------- An HTML attachment was scrubbed... URL: From NSCHULD at de.ibm.com Tue Aug 22 08:44:28 2017 From: NSCHULD at de.ibm.com (Norbert Schuld) Date: Tue, 22 Aug 2017 09:44:28 +0200 Subject: [gpfsug-discuss] pmcollector node In-Reply-To: <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> References: <390fca36-e49a-7255-1ecf-a467a6ee92a5@wustl.edu> <45fd3eab-34e7-d90d-c436-ae31ac4a11d5@wustl.edu> Message-ID: Above ~100 nodes the answer is "it depends" but memory is certainly the main factor. Important parts for the estimation are the number of nodes, filesystems, NSDs, NFS & SMB shares and the frequency (aka period) with which measurements are made. For a lot of sensors today the default is 1/sec which is quite high. Depending on your needs 1/ 10 sec might do or even 1/min. With just guessing on some numbers I end up with ~24-32 GB RAM needed in total and about the same number for disk space. If you want HA double the number, then divide by the number of collector nodes used in the federation setup. Place the collectors on nodes which do not play an additional important part in your cluster, then CPU should not be an issue. Mit freundlichen Gr??en / Kind regards Norbert Schuld From: Matt Weil To: gpfsug-discuss at spectrumscale.org Date: 21/08/2017 21:54 Subject: Re: [gpfsug-discuss] pmcollector node Sent by: gpfsug-discuss-bounces at spectrumscale.org any input on this Thanks On 7/5/17 10:51 AM, Matt Weil wrote: > Hello all, > > Question on the requirements on pmcollector node/s for a 500+ node > cluster. Is there a sizing guide? What specifics should we scale? > CPU Disks memory? > > Thanks > > Matt > _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Jochen.Zeller at sva.de Tue Aug 22 12:09:31 2017 From: Jochen.Zeller at sva.de (Zeller, Jochen) Date: Tue, 22 Aug 2017 11:09:31 +0000 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss Message-ID: Dear community, this morning I started in a good mood, until I've checked my mailbox. Again a reported bug in Spectrum Scale that could lead to data loss. During the last year I was looking for a stable Scale version, and each time I've thought: "Yes, this one is stable and without serious data loss bugs" - a few day later, IBM announced a new APAR with possible data loss for this version. I am supporting many clients in central Europe. They store databases, backup data, life science data, video data, results of technical computing, do HPC on the file systems, etc. Some of them had to change their Scale version nearly monthly during the last year to prevent running in one of the serious data loss bugs in Scale. From my perspective, it was and is a shame to inform clients about new reported bugs right after the last update. From client perspective, it was and is a lot of work and planning to do to get a new downtime for updates. And their internal customers are not satisfied with those many downtimes of the clusters and applications. For me, it seems that Scale development is working on features for a specific project or client, to achieve special requirements. But they forgot the existing clients, using Scale for storing important data or running important workloads on it. To make us more visible, I've used the IBM recommended way to notify about mandatory enhancements, the less favored RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334 If you like, vote for more reliability in Scale. I hope this a good way to show development and responsible persons that we have trouble and are not satisfied with the quality of the releases. Regards, Jochen -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From stockf at us.ibm.com Tue Aug 22 13:31:52 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 22 Aug 2017 08:31:52 -0400 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata"bug In-Reply-To: References: Message-ID: My understanding is that the problem is not with the policy engine scanning but with the commands that move data, for example mmrestripefs. So if you are using the policy engine for other purposes you are not impacted by the problem. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Kristy Kallback-Rose To: gpfsug main discussion list Date: 08/22/2017 12:53 AM Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Sent by: gpfsug-discuss-bounces at spectrumscale.org Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=L6hGADgajb-s1ezkPaD4wQhytCTKnUBGorgQEbmlEzk&s=nDmkF6EvhbMgktl3Oks3UkCb-2-cwR1QLEpOi6qeea4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From sxiao at us.ibm.com Tue Aug 22 14:51:25 2017 From: sxiao at us.ibm.com (Steve Xiao) Date: Tue, 22 Aug 2017 09:51:25 -0400 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" In-Reply-To: References: Message-ID: ILM policy engine scans of metadata is safe and will not trigger the problem. Steve Y. Xiao -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Aug 22 15:06:00 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 22 Aug 2017 14:06:00 +0000 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug In-Reply-To: References: Message-ID: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> Can anyone tell us when a normal PTF release (4.2.3-4 ??) will be made available that will fix this issue? Trying to decide if I should roll an e-fix or just wait for a normal release, thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kristy Kallback-Rose Sent: Monday, August 21, 2017 11:53 PM To: gpfsug main discussion list Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" bug Note: External Email ________________________________ Can someone comment as to whether the bug below could also be tickled by ILM policy engine scans of metadata? We are wanting to know if we should disable ILM scans until we have the patch. Thanks, Kristy https://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Tue Aug 22 15:27:36 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Tue, 22 Aug 2017 16:27:36 +0200 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata"bug In-Reply-To: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> References: <6c371b5ac22242c5844eda9b195810e3@jumptrading.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: From aaron.knister at gmail.com Tue Aug 22 15:37:06 2017 From: aaron.knister at gmail.com (Aaron Knister) Date: Tue, 22 Aug 2017 10:37:06 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: Hi Jochen, I share your concern about data loss bugs and I too have found it troubling especially since the 4.2 stream is in my immediate future (although I would have rather stayed on 4.1 due to my perception of stability/integrity issues in 4.2). By and large 4.1 has been *extremely* stable for me. While not directly related to the stability concerns, I'm curious as to why your customer sites are requiring downtime to do the upgrades? While, of course, individual servers need to be taken offline to update GPFS the collective should be able to stay up. Perhaps your customer environments just don't lend themselves to that. It occurs to me that some of these bugs sound serious (and indeed I believe this one is) I recently found myself jumping prematurely into an update for the metanode filesize corruption bug that as it turns out that while very scary sounding is not necessarily a particularly common bug (if I understand correctly). Perhaps it would be helpful if IBM could clarify the believed risk of these updates or give us some indication if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild". I could imagine IBM legal wanting to avoid a situation where IBM indicates something is low risk but someone hits it and it eats data. Although many companies do this with security patches so perhaps it's a non-issue. >From my perspective I don't think existing customers are being "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to an ever-changing world and I think these features are necessary and useful. Perhaps Scale would benefit from more resources being dedicated to QA/Testing which isn't a particularly sexy thing-- it doesn't result in any new shiny features for customers (although "not eating your data" is a feature I find really attractive). Anyway, I hope IBM can find a way to minimize the frequency of these bugs. Personally speaking, I'm pretty convinced, it's not for lack of capability or dedication on the part of the great folks actually writing the code. -Aaron On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen wrote: > Dear community, > > this morning I started in a good mood, until I?ve checked my mailbox. > Again a reported bug in Spectrum Scale that could lead to data loss. During > the last year I was looking for a stable Scale version, and each time I?ve > thought: ?Yes, this one is stable and without serious data loss bugs? - a > few day later, IBM announced a new APAR with possible data loss for this > version. > > I am supporting many clients in central Europe. They store databases, > backup data, life science data, video data, results of technical computing, > do HPC on the file systems, etc. Some of them had to change their Scale > version nearly monthly during the last year to prevent running in one of > the serious data loss bugs in Scale. From my perspective, it was and is a > shame to inform clients about new reported bugs right after the last > update. From client perspective, it was and is a lot of work and planning > to do to get a new downtime for updates. And their internal customers are > not satisfied with those many downtimes of the clusters and applications. > > For me, it seems that Scale development is working on features for a > specific project or client, to achieve special requirements. But they > forgot the existing clients, using Scale for storing important data or > running important workloads on it. > > To make us more visible, I?ve used the IBM recommended way to notify about > mandatory enhancements, the less favored RFE: > > > *http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334* > > > If you like, vote for more reliability in Scale. > > I hope this a good way to show development and responsible persons that we > have trouble and are not satisfied with the quality of the releases. > > > Regards, > > Jochen > > > > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From kkr at lbl.gov Tue Aug 22 16:24:46 2017 From: kkr at lbl.gov (Kristy Kallback-Rose) Date: Tue, 22 Aug 2017 08:24:46 -0700 Subject: [gpfsug-discuss] Question regarding "scanning file system metadata" In-Reply-To: References: Message-ID: Thanks we just wanted to confirm given the use of the word "scanning" in describing the trigger. On Aug 22, 2017 6:51 AM, "Steve Xiao" wrote: > ILM policy engine scans of metadatais safe and will not trigger the > problem. > > > Steve Y. Xiao > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Tue Aug 22 17:45:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Tue, 22 Aug 2017 12:45:00 -0400 Subject: [gpfsug-discuss] Fwd: FLASH: IBM Spectrum Scale (GPFS): RDMA-enabled network adapter failure on the NSD server may result in file IO error (2017.06.30) In-Reply-To: References: <487469581.449569.1498832342497.JavaMail.webinst@w30112> <2689cf86-eca2-dab6-c6aa-7fc54d923e55@nasa.gov> Message-ID: <3b16ad01-4d83-8106-f2e2-110364f31566@nasa.gov> (I'm slowly catching up on a backlog of e-mail, sorry for the delayed reply). Thanks, Sven. I recognize the complexity and appreciate your explanation. In my mind I had envisioned either the block integrity information being stored as a new metadata structure or stored leveraging T10-DIX/DIF (perhaps configurable on a per-pool basis) to pass the checksums down to the RAID controller. I would quite like to run GNR as software on generic hardware and in fact voted, along with 26 other customers, on an RFE (https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090) requesting this but the request was declined. I think customers spoke pretty loudly there and IBM gave it the kibosh. -Aaron On 06/30/2017 02:25 PM, Sven Oehme wrote: > > end-to-end data integrity is very important and the reason it hasn't > been done in Scale is not because its not important, its because its > very hard to do without impacting performance in a very dramatic way. > > imagine your raid controller blocksize is 1mb and your filesystem > blocksize is 1MB . if your application does a 1 MB write this ends up > being a perfect full block , full track de-stage to your raid layer > and everything works fine and fast. as soon as you add checksum > support you need to add data somehow into this, means your 1MB is no > longer 1 MB but 1 MB+checksum. > > to store this additional data you have multiple options, inline , > outside the data block or some combination ,the net is either you need > to do more physical i/o's to different places to get both the data and > the corresponding checksum or your per block on disc structure becomes > bigger than than what your application reads/or writes, both put > massive burden on the Storage layer as e.g. a 1 MB write will now, > even the blocks are all aligned from the application down to the raid > layer, cause a read/modify/write on the raid layer as the data is > bigger than the physical track size. > > so to get end-to-end checksum in Scale outside of ESS the best way is > to get GNR as SW to run on generic HW, this is what people should vote > for as RFE if they need that functionality. beside end-to-end > checksums you get read/write cache and acceleration , fast rebuild and > many other goodies as a added bonus. > > Sven > > > On Fri, Jun 30, 2017 at 10:53 AM Aaron Knister > > wrote: > > In fact the answer was quite literally "no": > > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=84523 > (the RFE was declined and the answer was that the "function is already > available in GNR environments"). > > Regarding GNR, see this RFE request > https://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=95090 > requesting the use of GNR outside of an ESS/GSS environment. It's > interesting to note this is the highest voted Public RFE for GPFS > that I > can see, at least. It too was declined. > > -Aaron > > On 6/30/17 1:41 PM, Aaron Knister wrote: > > Thanks Olaf, that's good to know (and is kind of what I > suspected). I've > > requested a number of times this capability for those of us who > can't > > use or aren't using GNR and the answer is effectively "no". This > > response is curious to me because I'm sure IBM doesn't believe > that data > > integrity is only important and of value to customers who > purchase their > > hardware *and* software. > > > > -Aaron > > > > On Fri, Jun 30, 2017 at 1:37 PM, Olaf Weiser > > > >> > wrote: > > > > yes.. in case of GNR (GPFS native raid) .. we do end-to-end > > check-summing ... client --> server --> downToDisk > > GNR writes down a chksum to disk (to all pdisks /all "raid" > segments > > ) so that dropped writes can be detected as well as miss-done > > writes (bit flips..) > > > > > > > > From: Aaron Knister > > >> > > To: gpfsug main discussion list > > > >> > > Date: 06/30/2017 07:15 PM > > Subject: [gpfsug-discuss] Fwd: FLASH: IBM Spectrum Scale (GPFS): > > RDMA-enabled network adapter failure on the NSD server may > result in > > file IO error (2017.06.30) > > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > > > > ------------------------------------------------------------------------ > > > > > > > > I'm curious to know why this doesn't affect GSS/ESS? Is it a > feature of > > the additional check-summing done on those platforms? > > > > > > -------- Forwarded Message -------- > > Subject: FLASH: IBM Spectrum Scale (GPFS): RDMA-enabled network > > adapter > > failure on the NSD server may result in file IO error > (2017.06.30) > > Date: Fri, 30 Jun 2017 14:19:02 +0000 > > From: IBM My Notifications > > > >> > > To: aaron.s.knister at nasa.gov > > > > > > > > > > > > My Notifications for Storage - 30 Jun 2017 > > > > Dear Subscriber (aaron.s.knister at nasa.gov > > > >), > > > > Here are your updates from IBM My Notifications. > > > > Your support Notifications display in English by default. > Machine > > translation based on your IBM profile > > language setting is added if you specify this option in My > defaults > > within My Notifications. > > (Note: Not all languages are available at this time, and the > English > > version always takes precedence > > over the machine translated version.) > > > > > ------------------------------------------------------------------------------ > > 1. IBM Spectrum Scale > > > > - TITLE: IBM Spectrum Scale (GPFS): RDMA-enabled network adapter > > failure > > on the NSD server may result in file IO error > > - URL: > > > http://www.ibm.com/support/docview.wss?uid=ssg1S1010233&myns=s033&mynp=OCSTXKQY&mynp=OCSWJ00&mync=E&cm_sp=s033-_-OCSTXKQY-OCSWJ00-_-E > > > > > - ABSTRACT: IBM has identified an issue with all IBM GPFS > and IBM > > Spectrum Scale versions where the NSD server is enabled to > use RDMA for > > file IO and the storage used in your GPFS cluster accessed > via NSD > > servers (not fully SAN accessible) includes anything other > than IBM > > Elastic Storage Server (ESS) or GPFS Storage Server (GSS); > under these > > conditions, when the RDMA-enabled network adapter fails, the > issue may > > result in undetected data corruption for file write or read > operations. > > > > > ------------------------------------------------------------------------------ > > Manage your My Notifications subscriptions, or send > questions and > > comments. > > - Subscribe or Unsubscribe - > > https://www.ibm.com/support/mynotifications > > > > - Feedback - > > > https://www-01.ibm.com/support/feedback/techFeedbackCardContentMyNotifications.html > > > > > > > - Follow us on Twitter - https://twitter.com/IBMStorageSupt > > > > > > > > > > To ensure proper delivery please add > mynotify at stg.events.ihost.com > > > to > > your address book. > > You received this email because you are subscribed to IBM My > > Notifications as: > > aaron.s.knister at nasa.gov > > > > > > Please do not reply to this message as it is generated by an > automated > > service machine. > > > > (C) International Business Machines Corporation 2017. All rights > > reserved. > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Wed Aug 23 05:40:19 2017 From: knop at us.ibm.com (Felipe Knop) Date: Wed, 23 Aug 2017 00:40:19 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: Aaron, IBM's policy is to issue a flash when such data corruption/loss problem has been identified, even if the problem has never been encountered by any customer. In fact, most of the flashes have been the result of internal test activity, even though the discovery took place after the affected versions/PTFs have already been released. This is the case of two of the recent flashes: http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293 http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 The flashes normally do not indicate the risk level that a given problem has of being hit, since there are just too many variables at play, given that clusters and workloads vary significantly. The first issue above appears to be uncommon (and potentially rare). The second issue seems to have a higher probability of occurring -- and as described in the flash, the problem is triggered by failures being encountered while running one of the commands listed in the "Users Affected" section of the writeup. I don't think precise recommendations could be given on if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild" since different clusters, configuration, or workload may drastically affect the the likelihood of hitting the problem. On the other hand, when coming up with the text for the flash, the team attempts to provide as much information as possible/available on the known triggers and mitigation circumstances. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: Aaron Knister To: gpfsug main discussion list Date: 08/22/2017 10:37 AM Subject: Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Jochen, I share your concern about data loss bugs and I too have found it troubling especially since the 4.2 stream is in my immediate future (although I would have rather stayed on 4.1 due to my perception of stability/integrity issues in 4.2). By and large 4.1 has been *extremely* stable for me. While not directly related to the stability concerns, I'm curious as to why your customer sites are requiring downtime to do the upgrades? While, of course, individual servers need to be taken offline to update GPFS the collective should be able to stay up. Perhaps your customer environments just don't lend themselves to that. It occurs to me that some of these bugs sound serious (and indeed I believe this one is) I recently found myself jumping prematurely into an update for the metanode filesize corruption bug that as it turns out that while very scary sounding is not necessarily a particularly common bug (if I understand correctly). Perhaps it would be helpful if IBM could clarify the believed risk of these updates or give us some indication if the bugs fall in the category of "drop everything and patch *now*" or "this is a theoretically nasty bug but we've yet to see it in the wild". I could imagine IBM legal wanting to avoid a situation where IBM indicates something is low risk but someone hits it and it eats data. Although many companies do this with security patches so perhaps it's a non-issue. From my perspective I don't think existing customers are being "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt to an ever-changing world and I think these features are necessary and useful. Perhaps Scale would benefit from more resources being dedicated to QA/Testing which isn't a particularly sexy thing-- it doesn't result in any new shiny features for customers (although "not eating your data" is a feature I find really attractive). Anyway, I hope IBM can find a way to minimize the frequency of these bugs. Personally speaking, I'm pretty convinced, it's not for lack of capability or dedication on the part of the great folks actually writing the code. -Aaron On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen wrote: Dear community, this morning I started in a good mood, until I?ve checked my mailbox. Again a reported bug in Spectrum Scale that could lead to data loss. During the last year I was looking for a stable Scale version, and each time I?ve thought: ?Yes, this one is stable and without serious data loss bugs? - a few day later, IBM announced a new APAR with possible data loss for this version. I am supporting many clients in central Europe. They store databases, backup data, life science data, video data, results of technical computing, do HPC on the file systems, etc. Some of them had to change their Scale version nearly monthly during the last year to prevent running in one of the serious data loss bugs in Scale. From my perspective, it was and is a shame to inform clients about new reported bugs right after the last update. From client perspective, it was and is a lot of work and planning to do to get a new downtime for updates. And their internal customers are not satisfied with those many downtimes of the clusters and applications. For me, it seems that Scale development is working on features for a specific project or client, to achieve special requirements. But they forgot the existing clients, using Scale for storing important data or running important workloads on it. To make us more visible, I?ve used the IBM recommended way to notify about mandatory enhancements, the less favored RFE: http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334 If you like, vote for more reliability in Scale. I hope this a good way to show development and responsible persons that we have trouble and are not satisfied with the quality of the releases. Regards, Jochen _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 23 11:11:37 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 10:11:37 +0000 Subject: [gpfsug-discuss] AFM weirdness Message-ID: We're using an AFM cache from our HPC nodes to access data in another GPFS cluster, mostly this seems to be working fine, but we've just come across an interesting problem with a user using gfortran from the GCC 5.2.0 toolset. When linking their code, they get a "no space left on device" error back from the linker. If we do this on a node that mounts the file-system directly (I.e. Not via AFM cache), then it works fine. We tried with GCC 4.5 based tools and it works OK, but the difference there is that 4.x uses ld and 5x uses ld.gold. If we strike the ld.gold when using AFM, we see: stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 unlink("program") = 0 open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on device) Vs when running directly on the file-system: stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 unlink("program") = 0 open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 fallocate(30, 0, 0, 248480) = 0 Anyone seen anything like this before? ... Actually I'm about to go off and see if its a function of AFM, or maybe something to do with the FS in use (I.e. Make a local directory on the filesystem on the "AFM" FS and see if that works ...) Thanks Simon From S.J.Thompson at bham.ac.uk Wed Aug 23 11:17:58 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 10:17:58 +0000 Subject: [gpfsug-discuss] AFM weirdness Message-ID: OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From vpuvvada at in.ibm.com Wed Aug 23 13:36:33 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 23 Aug 2017 18:06:33 +0530 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" To: gpfsug main discussion list Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 23 14:01:55 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 23 Aug 2017 13:01:55 +0000 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: I've got a PMR open about this ... Will email you the number directly. Looking at the man page for ld.gold, it looks to set '--posix-fallocate' by default. In fact, testing with '-Xlinker -no-posix-fallocate' does indeed make the code compile. Simon From: "vpuvvada at in.ibm.com" > Date: Wednesday, 23 August 2017 at 13:36 To: "gpfsug-discuss at spectrumscale.org" >, Simon Thompson > Subject: Re: [gpfsug-discuss] AFM weirdness I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: gpfsug main discussion list > Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" on behalf of S.J.Thompson at bham.ac.uk> wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Aug 24 13:56:49 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 24 Aug 2017 08:56:49 -0400 Subject: [gpfsug-discuss] Again! Using IBM Spectrum Scale could lead to data loss In-Reply-To: References: Message-ID: <12c154d2-8095-408e-ac7e-e654b1448a25@nasa.gov> Thanks Felipe, and everything you said makes sense and I think holds true to my experiences concerning different workloads affecting likelihood of hitting various problems (especially being one of only a handful of sites that hit that 301 SGpanic error from several years back). Perhaps language as subtle as "internal testing revealed" vs "based on reports from customer sites" could be used? But then again I imagine you could encounter a case where you discover something in testing that a customer site subsequently experiences which might limit the usefulness of the wording. I still think it's useful to know if an issue has been exacerbated or triggered by in the wild workloads vs what I imagine to be quite rigorous lab testing perhaps deigned to shake out certain bugs. -Aaron On 8/23/17 12:40 AM, Felipe Knop wrote: > Aaron, > > IBM's policy is to issue a flash when such data corruption/loss > problem has been identified, even if the problem has never been > encountered by any customer. In fact, most of the flashes have been > the result of internal test activity, even though the discovery took > place after the affected versions/PTFs have already been released. > ?This is the case of two of the recent flashes: > > http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010293 > > http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 > > The flashes normally do not indicate the risk level that a given > problem has of being hit, since there are just too many variables at > play, given that clusters and workloads vary significantly. > > The first issue above appears to be uncommon (and potentially rare). > ?The second issue seems to have a higher probability of occurring -- > and as described in the flash, the problem is triggered by failures > being encountered while running one of the commands listed in the > "Users Affected" section of the writeup. > > I don't think precise recommendations could be given on > > ?if the bugs fall in the category of "drop everything and patch *now*" > or "this is a theoretically nasty bug but we've yet to see it in the wild" > > since different clusters, configuration, or workload may drastically > affect the the likelihood of hitting the problem. ?On the other hand, > when coming up with the text for the flash, the team attempts to > provide as much information as possible/available on the known > triggers and mitigation circumstances. > > ? Felipe > > ---- > Felipe Knop ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? knop at us.ibm.com > GPFS Development and Security > IBM Systems > IBM Building 008 > 2455 South Rd, Poughkeepsie, NY 12601 > (845) 433-9314 ?T/L 293-9314 > > > > > > From: ? ? ? ?Aaron Knister > To: ? ? ? ?gpfsug main discussion list > Date: ? ? ? ?08/22/2017 10:37 AM > Subject: ? ? ? ?Re: [gpfsug-discuss] Again! Using IBM Spectrum Scale > could lead to data loss > Sent by: ? ? ? ?gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi Jochen, > > I share your concern about data loss bugs and I too have found it > troubling especially since the 4.2 stream is in my immediate future > (although I would have rather stayed on 4.1 due to my perception of > stability/integrity issues in 4.2). By and large 4.1 has been > *extremely* stable for me. > > While not directly related to the stability concerns, I'm curious as > to why your customer sites are requiring downtime to do the upgrades? > While, of course, individual servers need to be taken offline to > update GPFS the collective should be able to stay up. Perhaps your > customer environments just don't lend themselves to that.? > > It occurs to me that some of these bugs sound serious (and indeed I > believe this one is) I recently found myself jumping prematurely into > an update for the metanode filesize corruption bug that as it turns > out that while very scary sounding is not necessarily a particularly > common bug (if I understand correctly). Perhaps it would be helpful if > IBM could clarify the believed risk of these updates or give us some > indication if the bugs fall in the category of "drop everything and > patch *now*" or "this is a theoretically nasty bug but we've yet to > see it in the wild". I could imagine IBM legal wanting to avoid a > situation where IBM indicates something is low risk but someone hits > it and it eats data. Although many companies do this with security > patches so perhaps it's a non-issue. > > From my perspective I don't think existing customers are being > "forgotten". I think IBM is pushing hard to help Spectrum Scale adapt > to an ever-changing world and I think these features are necessary and > useful. Perhaps Scale would benefit from more resources being > dedicated to QA/Testing which isn't a particularly sexy thing-- it > doesn't result in any new shiny features for customers (although "not > eating your data" is a feature I find really attractive). > > Anyway, I hope IBM can find a way to minimize the frequency of these > bugs. Personally speaking, I'm pretty convinced, it's not for lack of > capability or dedication on the part of the great folks actually > writing the code. > > -Aaron > > On Tue, Aug 22, 2017 at 7:09 AM, Zeller, Jochen > <_Jochen.Zeller at sva.de_ > wrote: > Dear community, > ? > this morning I started in a good mood, until I?ve checked my mailbox. > Again a reported bug in Spectrum Scale that could lead to data loss. > During the last year I was looking for a stable Scale version, and > each time I?ve thought: ?Yes, this one is stable and without serious > data loss bugs? - a few day later, IBM announced a new APAR with > possible data loss for this version. > ? > I am supporting many clients in central Europe. They store databases, > backup data, life science data, video data, results of technical > computing, do HPC on the file systems, etc. Some of them had to change > their Scale version nearly monthly during the last year to prevent > running in one of the serious data loss bugs in Scale. From my > perspective, it was and is a shame to inform clients about new > reported bugs right after the last update. From client perspective, it > was and is a lot of work and planning to do to get a new downtime for > updates. And their internal customers are not satisfied with those > many downtimes of the clusters and applications. > ? > For me, it seems that Scale development is working on features for a > specific project or client, to achieve special requirements. But they > forgot the existing clients, using Scale for storing important data or > running important workloads on it. > ? > To make us more visible, I?ve used the IBM recommended way to notify > about mandatory enhancements, the less favored RFE: > ? > _http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe&CR_ID=109334_ > ? > If you like, vote for more reliability in Scale. > ? > I hope this a good way to show development and responsible persons > that we have trouble and are not satisfied with the quality of the > releases. > ? > ? > Regards, > ? > Jochen > ? > ? > ? > ? > ? > ? > ? > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at _spectrumscale.org_ > _ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=oNT2koCZX0xmWlSlLblR9Q&m=Nh-z-CGPni6b-k9jTdJfWNw6-jtvc8OJgjogfIyp498&s=Vsf2AaMf7b7F6Qv3lGZ9-xBciF9gdfuqnb206aVG-Go&e= > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Fri Aug 25 08:44:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Fri, 25 Aug 2017 07:44:35 +0000 Subject: [gpfsug-discuss] AFM weirdness In-Reply-To: References: Message-ID: So as Venkat says, AFM doesn't support using fallocate() to preallocate space. So why aren't other people seeing this ... Well ... We use EasyBuild to build our HPC cluster software including the compiler tool chains. This enables the new linker ld.gold by default rather than the "old" ld. Interestingly we don't seem to have seen this with C code being compiled, only fortran. We can work around it by using the options to gfortran I mention below. There is a mention to this limitation at: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1ins_afmlimitations.htm We aren;t directly calling gpfs_prealloc, but I guess the linker is indirectly calling it by making a call to posix_fallocate. I do have a new problem with AFM where the data written to the cache differs from that replicated back to home... I'm beginning to think I don't like the decision to use AFM! Given the data written back to HOME is corrupt, I think this is definitely PMR time. But ... If you have Abaqus on you system and are using AFM, I'd be interested to see if someone else sees the same issue as us! Simon From: > on behalf of Simon Thompson > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 23 August 2017 at 14:01 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] AFM weirdness I've got a PMR open about this ... Will email you the number directly. Looking at the man page for ld.gold, it looks to set '--posix-fallocate' by default. In fact, testing with '-Xlinker -no-posix-fallocate' does indeed make the code compile. Simon From: "vpuvvada at in.ibm.com" > Date: Wednesday, 23 August 2017 at 13:36 To: "gpfsug-discuss at spectrumscale.org" >, Simon Thompson > Subject: Re: [gpfsug-discuss] AFM weirdness I believe this error is result of preallocation failure, but traces are needed to confirm this. AFM caching modes does not support preallocation of blocks (ex. using fallocate()). This feature is supported only in AFM DR. ~Venkat (vpuvvada at in.ibm.com) From: "Simon Thompson (IT Research Support)" > To: gpfsug main discussion list > Date: 08/23/2017 03:48 PM Subject: Re: [gpfsug-discuss] AFM weirdness Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ OK so I checked and if I run directly on the "AFM" FS in a different "non AFM" directory, it works fine, so its something AFM related ... Simon On 23/08/2017, 11:11, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support)" on behalf of S.J.Thompson at bham.ac.uk> wrote: >We're using an AFM cache from our HPC nodes to access data in another GPFS >cluster, mostly this seems to be working fine, but we've just come across >an interesting problem with a user using gfortran from the GCC 5.2.0 >toolset. > >When linking their code, they get a "no space left on device" error back >from the linker. If we do this on a node that mounts the file-system >directly (I.e. Not via AFM cache), then it works fine. > >We tried with GCC 4.5 based tools and it works OK, but the difference >there is that 4.x uses ld and 5x uses ld.gold. > >If we strike the ld.gold when using AFM, we see: > >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = -1 ENOSPC (No space left on >device) > > > >Vs when running directly on the file-system: >stat("program", {st_mode=S_IFREG|0775, st_size=248480, ...}) = 0 >unlink("program") = 0 >open("program", O_RDWR|O_CREAT|O_TRUNC|O_CLOEXEC, 0777) = 30 >fstat(30, {st_mode=S_IFREG|0775, st_size=0, ...}) = 0 >fallocate(30, 0, 0, 248480) = 0 > > > >Anyone seen anything like this before? > >... Actually I'm about to go off and see if its a function of AFM, or >maybe something to do with the FS in use (I.e. Make a local directory on >the filesystem on the "AFM" FS and see if that works ...) > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=92LOlNh2yLzrrGTDA7HnfF8LFr55zGxghLZtvZcZD7A&m=UqTzoU-bx454OgyeB4f0Nrruvs7yYAxFutzIe2eKmnc&s=8E5opHyyAwomLS8kdxpvKCvf6sdKBLlfZvx6wDdaZy4&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Fri Aug 25 22:36:39 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 25 Aug 2017 17:36:39 -0400 Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing In-Reply-To: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> References: <8704990D-C56E-4A57-B919-A90A491C85DB@adventone.com> Message-ID: Hi, >>I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? >> I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Please ensure that all the recommended FPO settings (e.g. allowWriteAffinity=yes in the FPO storage pool, readReplicaPolicy=local, restripeOnDiskFailure=yes) are set properly. Please find the FPO Best practices/tunings, in the links below: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20%28GPFS%29/page/Big%20Data%20Best%20practices https://www.ibm.com/developerworks/community/wikis/form/anonymous/api/wiki/fa32927c-e904-49cc-a4cc-870bcc8e307c/page/ab5c2792-feef-4a3a-a21b-d22c6f5d728a/attachment/80d5c300-7b39-4d6e-9596-84934fcc4638/media/Deploying_a_big_data_solution_using_IBM_Spectrum_Scale_v1.7.5.pdf >> For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). >> Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. With FPO, GPFS metadata (-m) and data replication (-r) needs to be enabled. The Write-affinity-Depth (WAD) setting defines the policy for directing writes. It indicates that the node writing the data directs the write to disks on its own node for the first copy and to the disks on other nodes for the second and third copies (if specified). readReplicaPolicy=local will enable the policy to read replicas from local disks. At the minimum, ensure that the networking used for GPFS is sized properly and has bandwidth 2X or 3X that of the local disk speeds to ensure FPO write bandwidth is not being constrained by GPFS replication over the network. For example, if 24 x Drives in RAID-0 results in ~4.8 GB/s (assuming ~200MB/s per drive) and GPFS metadata/data replication is set to 3 (-m 3 -r 3) then for optimal FPO write bandwidth, we need to ensure the network-interconnect between the FPO nodes is non-blocking/high-speed and can sustain ~14.4 GB/s ( data_replication_factor * local_storage_bandwidth). One possibility, is minimum of 2 x EDR Infiniband (configure GPFS verbsRdma/verbsPorts) or bonded 40GigE between the FPO nodes (for GPFS daemon-to-daemon communication). Application reads requiring FPO reads from remote GPFS node would as well benefit from high-speed network-interconnect between the FPO nodes. Regards, -Kums From: Evan Koutsandreou To: "gpfsug-discuss at spectrumscale.org" Date: 08/20/2017 11:06 PM Subject: [gpfsug-discuss] Shared nothing (FPO) throughout / bandwidth sizing Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi - I was wondering if there are any good performance sizing guides for a spectrum scale shared nothing architecture (FPO)? For example, each node might consist of 24x storage drives (locally attached JBOD, no RAID array). I don't have any production experience using spectrum scale in a "shared nothing configuration " and was hoping for bandwidth / throughput sizing guidance. Given a particular node configuration I want to be in a position to calculate the maximum bandwidth / throughput. Thank you _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Fri Aug 25 23:41:53 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 25 Aug 2017 18:41:53 -0400 Subject: [gpfsug-discuss] multicluster security In-Reply-To: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> References: <83936033-ce82-0a9b-3714-1dbea4c317db@nasa.gov> Message-ID: Hi Aaron, If cluster A uses the mmauth command to grant a file system read-only access to a remote cluster B, nodes on cluster B can only mount that file system with read-only access. But the only checking being done at the RPC level is the TLS authentication. This should prevent non-root users from initiating RPCs, since TLS authentication requires access to the local cluster's private key. However, a root user on cluster B, having access to cluster B's private key, might be able to craft RPCs that may allow one to work around the checks which are implemented at the file system level. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: Aaron Knister To: gpfsug main discussion list Date: 08/21/2017 11:04 PM Subject: [gpfsug-discuss] multicluster security Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Everyone, I have a theoretical question about GPFS multiclusters and security. Let's say I have clusters A and B. Cluster A is exporting a filesystem as read-only to cluster B. Where does the authorization burden lay? Meaning, does the security rely on mmfsd in cluster B to behave itself and enforce the conditions of the multi-cluster export? Could someone using the credentials on a compromised node in cluster B just start sending arbitrary nsd read/write commands to the nsds from cluster A (or something along those lines)? Do the NSD servers in cluster A do any sort of sanity or security checking on the I/O requests coming from cluster B to the NSDs they're serving to exported filesystems? I imagine any enforcement would go out the window with shared disks in a multi-cluster environment since a compromised node could just "dd" over the LUNs. Thanks! -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=oK_bEPbjuD7j6qLTHbe7HM4ujUlpcNYtX3tMW2QC7_w&s=BliMQ0pToLIIiO1jfyUp2Q3icewcONrcmHpsIj_hMtY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sat Aug 26 20:39:58 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sat, 26 Aug 2017 19:39:58 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Message-ID: Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From stockf at us.ibm.com Sun Aug 27 01:35:06 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Sat, 26 Aug 2017 20:35:06 -0400 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sun Aug 27 14:32:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sun, 27 Aug 2017 13:32:20 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> Fred / All, Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock > wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Sun Aug 27 23:07:17 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Sun, 27 Aug 2017 18:07:17 -0400 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> References: <08F87840-CA69-4DC9-A792-C5C27D9F998E@vanderbilt.edu> Message-ID: Hi Kevin, >> Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? I presume, by "mmrestripefs data loss bug" you are referring to APAR IV98609 (link below)? If yes, 4.2.3.4 contains the fix for APAR IV98609. http://www-01.ibm.com/support/docview.wss?uid=ssg1S1010487 Problems fixed in GPFS 4.2.3.4 (details in link below): https://www.ibm.com/developerworks/community/forums/html/topic?id=f3705faa-b6aa-415c-a3e6-1fe9d8293db1&ps=25 * This update addresses the following APARs: IV98545 IV98609 IV98640 IV98641 IV98643 IV98683 IV98684 IV98685 IV98686 IV98687 IV98701 IV99044 IV99059 IV99060 IV99062 IV99063. Regards, -Kums From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/27/2017 09:32 AM Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Fred / All, Thanks - important followup question ? does 4.2.3.4 contain the fix for the mmrestripefs data loss bug that was announced last week? Thanks again? Kevin On Aug 26, 2017, at 7:35 PM, Frederick Stock wrote: The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=McIf98wfiVqHU8ZygezLrQ&m=0rUCqrbJ4Ny44Rmr8x8HvX5q4yqS-4tkN02fiIm9ttg&s=FYfr0P3sVBhnGGsj33W-A9JoDj7X300yTt5D4y5rpJY&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Aug 28 13:26:35 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 28 Aug 2017 08:26:35 -0400 Subject: [gpfsug-discuss] sas avago/lsi hba reseller recommendation Message-ID: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact with Advance HPC? -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Mon Aug 28 13:36:16 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Mon, 28 Aug 2017 12:36:16 +0000 Subject: [gpfsug-discuss] sas avago/lsi hba reseller recommendation In-Reply-To: References: Message-ID: <28676C04-60E6-4AB6-8FEF-24EA719E8786@nasa.gov> Hi Eric, I shot you an email directly with contact info. -Aaron On August 28, 2017 at 08:26:56 EDT, J. Eric Wonderley wrote: We have several avago/lsi 9305-16e that I believe came from Advanced HPC. Can someone recommend a another reseller of these hbas or a contact with Advance HPC? -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Aug 29 15:30:25 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 29 Aug 2017 14:30:25 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Aug 29 16:53:51 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 29 Aug 2017 15:53:51 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Aug 29 18:52:41 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 29 Aug 2017 17:52:41 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: , Message-ID: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Buterbaugh, Kevin L Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 14:54:41 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 13:54:41 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Aug 30 14:56:29 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 30 Aug 2017 13:56:29 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Aug 30 15:06:00 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 30 Aug 2017 14:06:00 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Oh, the first one looks like the AFM issue I mentioned a couple of days back with Abaqus ... (if you use Abaqus on your AFM cache, then this is for you!) Simon From: > on behalf of "Buterbaugh, Kevin L" > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Wednesday, 30 August 2017 at 14:54 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From:gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Wed Aug 30 15:12:30 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Wed, 30 Aug 2017 14:12:30 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 30 15:21:09 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 30 Aug 2017 14:21:09 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: References: Message-ID: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 15:28:07 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 14:28:07 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> References: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> Message-ID: <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> Hi Bryan, NO - it has the fix for the mmrestripefs data loss bug, but you need the efix on top of 4.2.3-4 for the mmadddisk / mmdeldisk issue. Let me take this opportunity to also explain a workaround that has worked for us so far for that issue ? the basic problem is two-fold (on our cluster, at least). First, the /var/mmfs/gen/mmsdrfs file isn?t making it out to all nodes all the time. That is simple enough to fix (mmrefresh -fa) and verify that it?s fixed (md5sum /var/mmfs/gen/mmsdrfs). Second, however - and this is the real problem ? some nodes are never actually rereading that file and therefore have incorrect information *in memory*. This has been especially problematic for us as we are replacing a batch of 80 8 TB drives with bad firmware. I am therefore deleting and subsequently recreating NSDs *with the same name*. If a client node still has the ?old? information in memory then it unmounts the filesystem when I try to mmadddisk the new NSD. The workaround is to identify those nodes (mmfsadm dump nsd and grep for the identifier of the NSD(s) in question) and force them to reread the info (tsctl rereadnsd). HTH? Kevin On Aug 30, 2017, at 9:21 AM, Bryan Banister > wrote: Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C493f1f9e41e343324f1508d4efb25f4f%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636396996783027614&sdata=qYxCMMg9O31LzFg%2FQkCdQg8vV%2FgL2AuRk%2B6V2j76c7Y%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Aug 30 15:30:07 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 30 Aug 2017 14:30:07 +0000 Subject: [gpfsug-discuss] GPFS 4.2.3.4 question In-Reply-To: <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> References: <4ca675e4cf744ebba3371ade206e3f54@jumptrading.com> <34A83CEB-E203-41DE-BF3A-F2BACD42D731@vanderbilt.edu> Message-ID: <48dac1a1fc6945fdb0d8e94cb7269e3a@jumptrading.com> Thanks for the excellent description? I have my PMR open for the e-fix, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: Wednesday, August 30, 2017 9:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Hi Bryan, NO - it has the fix for the mmrestripefs data loss bug, but you need the efix on top of 4.2.3-4 for the mmadddisk / mmdeldisk issue. Let me take this opportunity to also explain a workaround that has worked for us so far for that issue ? the basic problem is two-fold (on our cluster, at least). First, the /var/mmfs/gen/mmsdrfs file isn?t making it out to all nodes all the time. That is simple enough to fix (mmrefresh -fa) and verify that it?s fixed (md5sum /var/mmfs/gen/mmsdrfs). Second, however - and this is the real problem ? some nodes are never actually rereading that file and therefore have incorrect information *in memory*. This has been especially problematic for us as we are replacing a batch of 80 8 TB drives with bad firmware. I am therefore deleting and subsequently recreating NSDs *with the same name*. If a client node still has the ?old? information in memory then it unmounts the filesystem when I try to mmadddisk the new NSD. The workaround is to identify those nodes (mmfsadm dump nsd and grep for the identifier of the NSD(s) in question) and force them to reread the info (tsctl rereadnsd). HTH? Kevin On Aug 30, 2017, at 9:21 AM, Bryan Banister > wrote: Ok, I?m completely confused? You?re saying 4.2.3-4 *has* the fix for adding/deleting NSDs? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: Wednesday, August 30, 2017 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Note: External Email ________________________________ Aha, I?ve just realised what you actually said, having seen Simon?s response and twigged. The defect 1020461 matches what IBM has told me in my PMR about adding/deleting NSDs. I?m not sure why the description mentions networking though! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sobey, Richard A Sent: 30 August 2017 14:56 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question No worries, I?ve got it sorted and hopefully about to grab the 4.2.3-4 efix2. Cheers for your help! Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Buterbaugh, Kevin L Sent: 30 August 2017 14:55 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Well, I?m not sure, which is why it?s taken me a while to respond. In the README that comes with the efix it lists: Defect APAR Description 1032655 None AFM: Fix Truncate filtering Write incorrectly 1020461 None FS can't be mounted after weird networking error That 1st one is obviously not it and that 2nd one doesn?t reference mmadddisk / mmdeldisk. Plus neither show an APAR number. Sorry I can?t be of more help? Kevin On Aug 29, 2017, at 12:52 PM, Sobey, Richard A > wrote: Thanks Kevin, that's good to know. Is there an apar I need to quote in my pmr? Get Outlook for Android ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org > on behalf of Buterbaugh, Kevin L > Sent: Tuesday, August 29, 2017 4:53:51 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question Hi Richard, Since I upgraded my cluster to GPFS 4.2.3.4 over the weekend IBM created an efix for it for the NSD deletion / creation fix. I?m sure they?ll give it to you, too? ;-) Kevin On Aug 29, 2017, at 9:30 AM, Sobey, Richard A > wrote: So I can upgrade to 4.2.3-4 to get the mmrestripe fix, or 4.2.3-3 efix3 to get the NSD deletion and creation fix? Not great when on Monday I?m doing a load of all this. What?s the recommendation? Is there a one size fits all patch? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: 27 August 2017 01:35 To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] GPFS 4.2.3.4 question The only change missing is the change delivered in 4.2.3 PTF3 efix3 which was provided on August 22. The problem had to do with NSD deletion and creation. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 08/26/2017 03:40 PM Subject: [gpfsug-discuss] GPFS 4.2.3.4 question Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, Does anybody know if GPFS 4.2.3.4, which came out today, contains all the patches that are in GPFS 4.2.3.3 efix3? If anybody does, and can respond, I?d greatly appreciate it. Our cluster is in a very, very bad state right now and we may need to just take it down and bring it back up. I was already planning on rolling out GPFS 4.2.3.3 efix 3 over the next few weeks anyway, so if I can just go to 4.2.3.4 that would be great? Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=p_1XEUyoJ7-VJxF_w8h9gJh8_Wj0Pey73LCLLoxodpw&m=7r9GsD1C2HiY4j21vPYIoQPHXePHxeMhzQeaw_ne4lM&s=-SFnqoJw--FN3wqClEEBGa9-XSLljgSseIU_SxGoWy0&e= _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C493f1f9e41e343324f1508d4efb25f4f%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636396996783027614&sdata=qYxCMMg9O31LzFg%2FQkCdQg8vV%2FgL2AuRk%2B6V2j76c7Y%3D&reserved=0 ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Aug 30 20:26:41 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 30 Aug 2017 19:26:41 +0000 Subject: [gpfsug-discuss] Permissions issue in GPFS 4.2.3-4? Message-ID: Hi All, We have a script that takes the output of mmlsfs and mmlsquota and formats a users? GPFS quota usage into something a little ?nicer? than what mmlsquota displays (and doesn?t display 50 irrelevant lines of output for filesets they don?t have access to). After upgrading to 4.2.3-4 over the weekend it started throwing errors it hadn?t before: awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsfs: Unexpected error from awk. Return code: 2 awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsfs: Unexpected error from awk. Return code: 2 Home (user): 11.82G 30G 40G 10807 200000 300000 awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) mmlsquota: Unexpected error from awk. Return code: 2 It didn?t take long to track down that the mmfs.cfg.show file had permissions of 600 and a chmod 644 of it (on our login gateways only, which is the only place users run that script anyway) fixed the problem. So I just wanted to see if this was a known issue in 4.2.3-4? Notice that the error appears to be coming from the GPFS commands my script runs, not my script itself ? I sure don?t call awk! ;-) Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Wed Aug 30 20:34:46 2017 From: david_johnson at brown.edu (David Johnson) Date: Wed, 30 Aug 2017 15:34:46 -0400 Subject: [gpfsug-discuss] Permissions issue in GPFS 4.2.3-4? In-Reply-To: References: Message-ID: <13019F3B-AF64-4D92-AAB1-4CF3A635383C@brown.edu> We ran into this back in mid February. Never really got a satisfactory answer how it got this way, the thought was that a bunch of nodes were expelled during an mmchconfig, and the files ended up with the wrong permissions. ? ddj > On Aug 30, 2017, at 3:26 PM, Buterbaugh, Kevin L wrote: > > Hi All, > > We have a script that takes the output of mmlsfs and mmlsquota and formats a users? GPFS quota usage into something a little ?nicer? than what mmlsquota displays (and doesn?t display 50 irrelevant lines of output for filesets they don?t have access to). After upgrading to 4.2.3-4 over the weekend it started throwing errors it hadn?t before: > > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsfs: Unexpected error from awk. Return code: 2 > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsfs: Unexpected error from awk. Return code: 2 > Home (user): 11.82G 30G 40G 10807 200000 300000 > awk: cmd. line:11: fatal: cannot open file `/var/mmfs/gen/mmfs.cfg.show' for reading (Permission denied) > mmlsquota: Unexpected error from awk. Return code: 2 > > It didn?t take long to track down that the mmfs.cfg.show file had permissions of 600 and a chmod 644 of it (on our login gateways only, which is the only place users run that script anyway) fixed the problem. > > So I just wanted to see if this was a known issue in 4.2.3-4? Notice that the error appears to be coming from the GPFS commands my script runs, not my script itself ? I sure don?t call awk! ;-) > > Thanks? > > Kevin > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: