From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 5 17:40:30 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Apr 2017 16:40:30 +0000 Subject: [gpfsug-discuss] Can't delete filesystem Message-ID: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Hi All, First off, I can open a PMR on this if I need to? I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. What do I need to do to resolve this issue on those 4 clients? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 5 17:47:36 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Wed, 5 Apr 2017 16:47:36 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: Do you have ILM (dsmrecalld and friends) running? They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] Sent: 05 April 2017 17:40 To: gpfsug main discussion list Subject: [gpfsug-discuss] Can't delete filesystem Hi All, First off, I can open a PMR on this if I need to? I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. What do I need to do to resolve this issue on those 4 clients? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 From valdis.kletnieks at vt.edu Wed Apr 5 17:54:16 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Wed, 05 Apr 2017 12:54:16 -0400 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: <7103.1491411256@turing-police.cc.vt.edu> On Wed, 05 Apr 2017 16:40:30 -0000, "Buterbaugh, Kevin L" said: > So, I have gone to all of the 4 clients and none of them say they have it > mounted according to either ???df??? or ???mount???. I???ve gone ahead and run both > ???mmunmount??? and ???umount -l??? on the filesystem anyway, but the mmdelfs still > fails saying that they have it mounted. I've over the years seen this a few times. Doing an 'mmshutdown/mmstartup' pair on the offending nodes has always cleared it up. I probably should have opened a PMR, but it always seems to happen when I'm up to in alligators with other issues. (Am I the only person who wonders why all complex software packages contain alligator-detector routines? :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 484 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 5 17:54:14 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Apr 2017 16:54:14 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Hi Simon, No, I do not. Let me also add that this is a filesystem that I migrated users off of and to another GPFS filesystem. I moved the last users this morning and then ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the simple solution is to use the ?-p? option to mmdelfs I?m fine with that. I?m just not sure what the right course of action is at this point. Thanks again? Kevin > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT Services) wrote: > > Do you have ILM (dsmrecalld and friends) running? > > They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] > Sent: 05 April 2017 17:40 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Can't delete filesystem > > Hi All, > > First off, I can open a PMR on this if I need to? > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. > > So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > Kevin > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Wed Apr 5 22:51:15 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 05 Apr 2017 21:51:15 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Message-ID: Maybe try mmumount -f on the remaining 4 nodes? -jf ons. 5. apr. 2017 kl. 18.54 skrev Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu>: > Hi Simon, > > No, I do not. > > Let me also add that this is a filesystem that I migrated users off of and > to another GPFS filesystem. I moved the last users this morning and then > ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the > simple solution is to use the ?-p? option to mmdelfs I?m fine with that. > I?m just not sure what the right course of action is at this point. > > Thanks again? > > Kevin > > > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT > Services) wrote: > > > > Do you have ILM (dsmrecalld and friends) running? > > > > They can also stop the filesystem being released (e.g. mmshutdown fails > if they are up). > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin > L [Kevin.Buterbaugh at Vanderbilt.Edu] > > Sent: 05 April 2017 17:40 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] Can't delete filesystem > > > > Hi All, > > > > First off, I can open a PMR on this if I need to? > > > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that > the filesystem is still mounted on 14 nodes and therefore can?t be > deleted. 10 of those nodes are my 10 GPFS servers and they have an > ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I > need to concentrate on ? i.e. once those other 4 clients no longer have it > mounted the internal mounts will resolve themselves. Correct me if I?m > wrong on that, please. > > > > So, I have gone to all of the 4 clients and none of them say they have > it mounted according to either ?df? or ?mount?. I?ve gone ahead and run > both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs > still fails saying that they have it mounted. > > > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > > > Kevin > > > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and > Education > > Kevin.Buterbaugh at vanderbilt.edu > - (615)875-9633 > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Thu Apr 6 02:54:07 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Thu, 6 Apr 2017 01:54:07 +0000 Subject: [gpfsug-discuss] AFM misunderstanding Message-ID: When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Apr 6 09:20:31 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 6 Apr 2017 08:20:31 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption Message-ID: We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon From vpuvvada at in.ibm.com Thu Apr 6 11:45:37 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Thu, 6 Apr 2017 16:15:37 +0530 Subject: [gpfsug-discuss] AFM misunderstanding In-Reply-To: References: Message-ID: Could you explain "bits of actual file" mentioned below ? Prefetch with ?metadata-only pulls everything (xattrs, ACLs etc..) except data. Doing " ls ?ltrs" shows file allocation size as zero if data prefetch not yet completed on them. ~Venkat (vpuvvada at in.ibm.com) From: Mark Bush To: gpfsug main discussion list Date: 04/06/2017 07:24 AM Subject: [gpfsug-discuss] AFM misunderstanding Sent by: gpfsug-discuss-bounces at spectrumscale.org When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Thu Apr 6 13:28:40 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Thu, 6 Apr 2017 12:28:40 +0000 Subject: [gpfsug-discuss] AFM misunderstanding In-Reply-To: References: Message-ID: <425C32E7-B752-4B61-BDF5-83C219D89ADB@siriuscom.com> I think I was missing a key piece in that I thought that just doing a mmafmctl fs1 prefetch ?j cache would start grabbing everything (data and metadata) but it appears that the ?list-file myfiles.txt is the trigger for the prefetch to work properly. I mistakenly assumed that omitting the ?list-file switch would prefetch all the data in the fileset. From: on behalf of Venkateswara R Puvvada Reply-To: gpfsug main discussion list Date: Thursday, April 6, 2017 at 5:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM misunderstanding Could you explain "bits of actual file" mentioned below ? Prefetch with ?metadata-onlypulls everything (xattrs, ACLs etc..) except data. Doing "ls ?ltrs" shows file allocation size as zero if data prefetch not yet completed on them. ~Venkat (vpuvvada at in.ibm.com) From: Mark Bush To: gpfsug main discussion list Date: 04/06/2017 07:24 AM Subject: [gpfsug-discuss] AFM misunderstanding Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Apr 6 15:33:18 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 6 Apr 2017 14:33:18 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Message-ID: Hi JF, I actually tried that - to no effect. Yesterday evening I rebooted the 4 clients and, as expected, the 10 servers released their internal mounts as well ? and then I was able to delete the filesystem successfully. Thanks for the suggestions, all? Kevin On Apr 5, 2017, at 4:51 PM, Jan-Frode Myklebust > wrote: Maybe try mmumount -f on the remaining 4 nodes? -jf ons. 5. apr. 2017 kl. 18.54 skrev Buterbaugh, Kevin L >: Hi Simon, No, I do not. Let me also add that this is a filesystem that I migrated users off of and to another GPFS filesystem. I moved the last users this morning and then ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the simple solution is to use the ?-p? option to mmdelfs I?m fine with that. I?m just not sure what the right course of action is at this point. Thanks again? Kevin > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Do you have ILM (dsmrecalld and friends) running? > > They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] > Sent: 05 April 2017 17:40 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Can't delete filesystem > > Hi All, > > First off, I can open a PMR on this if I need to? > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. > > So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > Kevin > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Apr 6 15:54:42 2017 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 6 Apr 2017 14:54:42 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption In-Reply-To: References: Message-ID: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> This is rather dependant on SS version. So what used to happen before 4.2.2.* is that a client would be unable to mount the filesystem in question and would give an error in the mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount the file system and then give errors on file access instead. (just tested this on 4.2.2.3) I'll have to read through the changelogs looking for this one. Depending on your policy for encryption then, this might be exactly what you want, but I REALLY REALLY dislike this behaviour. To me this means clients can now mount an encrypted FS now and then fail during operation. If I get a client node that comes up improperly, user work will start, and it will fail with "Operation not permitted" errors on file access. I imagine my batch system could run through a massive amount of jobs on a bad client without anyone noticing immeadiately. Yet another thing we now have to monitor now I guess. *shrug* A couple other gotcha's we've seen with Encryption: Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Ed Wahl OSC ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Thursday, April 06, 2017 4:20 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale Encryption We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Thu Apr 6 16:11:38 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 6 Apr 2017 15:11:38 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption Message-ID: Hi Ed, Thanks. We already have several SKLM servers (tape backups). For me, we plan to encrypt specific parts of the FS (probably by file-set), so as long as all that is needed is an empty RKM.conf file, sounds like it will work. I suppose I could have an MEK that is granted to all clients, but then never actually use it for encryption if RKM.conf needs at least one key (hack hack hack). (We are at 4.2.2-2 (mostly) or higher (a few nodes)). I *thought* the FEK was wrapped in the metadata with the MEK (possibly multiple times with different MEKs), so what the docs say about operation continuing with no SKLM server sounds sensible, but of course that might not be what actually happens I guess... Simon On 06/04/2017, 15:54, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Wahl, Edward" wrote: >This is rather dependant on SS version. > >So what used to happen before 4.2.2.* is that a client would be unable to >mount the filesystem in question and would give an error in the >mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount >the file system and then give errors on file access instead. (just >tested this on 4.2.2.3) I'll have to read through the changelogs looking >for this one. > >Depending on your policy for encryption then, this might be exactly what >you want, but I REALLY REALLY dislike this behaviour. > >To me this means clients can now mount an encrypted FS now and then fail >during operation. If I get a client node that comes up improperly, user >work will start, and it will fail with "Operation not permitted" errors >on file access. I imagine my batch system could run through a massive >amount of jobs on a bad client without anyone noticing immeadiately. Yet >another thing we now have to monitor now I guess. *shrug* > >A couple other gotcha's we've seen with Encryption: > >Encrypted file systems do not store data in large MD blocks. Makes >sense. This means large MD blocks aren't as useful as they are in >unencrypted FS, if you are using this. > >Having at least one backup SKLM server is a good idea. >"kmipServerUri[N+1]" in the conf. > >While the documentation claims the FS can continue operation once it >caches the MEK if an SKLM server goes away, in operation this does NOT >work as you may expect. Your users still need access to the FEKs for the >files your clients work on. Logs will fill with Key could not be >fetched. errors. > >Ed Wahl >OSC > >________________________________________ >From: gpfsug-discuss-bounces at spectrumscale.org >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson >(Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] >Sent: Thursday, April 06, 2017 4:20 AM >To: gpfsug-discuss at spectrumscale.org >Subject: [gpfsug-discuss] Spectrum Scale Encryption > >We are currently looking at adding encryption to our deployment for some >of our data sets and for some of our nodes. Apologies in advance if some >of this is a bit vague, we're not yet at the point where we can test this >stuff out, so maybe some of it will become clear when we try it out. > > >For a node that we don't want to have access to any encrypted data, what >do we need to set up? > >According to the docs: >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >s >cale.v4r22.doc/bl1adv_encryption_prep.htm > > >"After the file system is configured with encryption policy rules, the >file system is considered encrypted. From that point on, each node that >has access to that file system must have an RKM.conf file present. >Otherwise, the file system might not be mounted or might become >unmounted." > >So on a node which I don't want to have access to any encrypted files, do >I just need to have an empty RKM.conf file? > >(If this is the case, would be good to have this added to the docs) > > >Secondly ... (and maybe I'm misunderstanding the docs here) > >For the Policy >https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectr >u >m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm > > >KEYS ('Keyname'[, 'Keyname', ... ]) > > >KeyId:RkmId > > >RkmId should match the stanza name in RKM.conf? > >If so, it would be useful if the docs used the same names in the examples >(RKMKMIP3 vs rkmname3) > >And KeyId should match a "Key UUID" in SKLM? > > >Third. My understanding from talking to various IBM people is that we need >ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways >(probably), do we have to do any kind of node registration in ISKLM? Or is >this purely based on the certificates being distributed to clients and >keys are mapped in ISKLM to the client cert to determine if the node is >able to request the key? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Jon.Edwards at newbase.com.au Fri Apr 7 05:56:33 2017 From: Jon.Edwards at newbase.com.au (Jon Edwards) Date: Fri, 7 Apr 2017 04:56:33 +0000 Subject: [gpfsug-discuss] Spectrum scale sending cluster traffic across the management network Message-ID: <7929c064d6df4d7b88065b4d882daa98@newbase.com.au> Hi All, Just getting started with spectrum scale, Just wondering if anyone has come across the issue where when doing a mmcrfs or mmdelfs you get the error Failed to connect to file system daemon: Connection timed out mmdelfs: tsdelfs failed. mmdelfs: Command failed. Examine previous error messages to determine cause. When viewing the logs in /var/mmfs/gen/mmfslog on a node other than the one I am running the command on i get: 2017-04-07_14:03:13.354+1000: [N] Filtered log entry: 'connect to node 192.168.0.1:1191' occurred 10 times between 2017-04-07_11:38:19.058+1000 and 2017-04-07_11:54:58.649+1000 192.168.0.0/24 In this case is the management network configured on eth0 of all the nodes. It is failing because port 1191 is not allowed on this interface. The dns and hostname for each node resolves to a dedicated cluster network, lets say 10.0.0.0/24 (ETH1) For some reason when I run the mmcrfs or mmdelfs it tries to talk back over the management network instead of the cluster network which fails to connect due to firewall blocking cluster traffic over management. Anyone seen this before? Kind Regards, Jon Edwards Senior Systems Engineer NewBase Email: jon.edwards at newbase.com.au Ph: + 61 7 3216 0776 Fax: + 61 7 3216 0779 http://www.newbase.com.au Opinions contained in this e-mail do not necessarily reflect the opinions of NewBase Computer Services Pty Ltd. This e-mail is for the exclusive use of the addressee and should not be disseminated further or copied without permission of the sender. If you have received this message in error, please immediately notify the sender and delete the message from your computer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jon.Edwards at newbase.com.au Fri Apr 7 06:26:56 2017 From: Jon.Edwards at newbase.com.au (Jon Edwards) Date: Fri, 7 Apr 2017 05:26:56 +0000 Subject: [gpfsug-discuss] Spectrum scale sending cluster traffic across the management network Message-ID: <6e02ed91cb404d46b7b5cd3515ad8fe9@newbase.com.au> Please disregard, found the solution. Found the subnets= parameter for the cluster config mmchconfig subnets="192.168.0.0/24 192.168.1.0/24" Which forces it to use this subnet. Kind Regards, Jon Edwards | Senior Systems Engineer NewBase Ph: + 61 7 3216 0776 | Email: jon.edwards at newbase.com.au http://www.newbase.com.au From: Jon Edwards Sent: Friday, 7 April 2017 2:56 PM To: 'gpfsug-discuss at spectrumscale.org' Cc: 'Andrew Beattie' Subject: Spectrum scale sending cluster traffic across the management network Hi All, Just getting started with spectrum scale, Just wondering if anyone has come across the issue where when doing a mmcrfs or mmdelfs you get the error Failed to connect to file system daemon: Connection timed out mmdelfs: tsdelfs failed. mmdelfs: Command failed. Examine previous error messages to determine cause. When viewing the logs in /var/mmfs/gen/mmfslog on a node other than the one I am running the command on i get: 2017-04-07_14:03:13.354+1000: [N] Filtered log entry: 'connect to node 192.168.0.1:1191' occurred 10 times between 2017-04-07_11:38:19.058+1000 and 2017-04-07_11:54:58.649+1000 192.168.0.0/24 In this case is the management network configured on eth0 of all the nodes. It is failing because port 1191 is not allowed on this interface. The dns and hostname for each node resolves to a dedicated cluster network, lets say 10.0.0.0/24 (ETH1) For some reason when I run the mmcrfs or mmdelfs it tries to talk back over the management network instead of the cluster network which fails to connect due to firewall blocking cluster traffic over management. Anyone seen this before? Kind Regards, Jon Edwards Senior Systems Engineer NewBase Email: jon.edwards at newbase.com.au Ph: + 61 7 3216 0776 Fax: + 61 7 3216 0779 http://www.newbase.com.au Opinions contained in this e-mail do not necessarily reflect the opinions of NewBase Computer Services Pty Ltd. This e-mail is for the exclusive use of the addressee and should not be disseminated further or copied without permission of the sender. If you have received this message in error, please immediately notify the sender and delete the message from your computer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Fri Apr 7 15:00:09 2017 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 7 Apr 2017 10:00:09 -0400 Subject: [gpfsug-discuss] Spectrum Scale Encryption In-Reply-To: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> References: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> Message-ID: All, A few comments on the topics raised below. 1) All nodes that mount an encrypted file system, and also the nodes with management roles on the file system will need access to the keys have the proper setup (RKM.conf, etc). Edward is correct that there was some change in behavior, introduced in 4.2.1 . Before the change, a mount would fail unless RKM.conf is present on the node. In addition, once a policy with encryption rules was applied, nodes without the proper encryption setup would unmount the file system. With the change, the error gets delayed to when encrypted files are accessed. The change in behavior was introduced based on feedback that unmounting the file system in that case was too drastic in that scenario. >> So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? All nodes which mount an encrypted file system should have proper setup for encryption, even for a node from where only unencrypted files are being accessed. 2) >> Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Correct. Data is not stored in the inode for encrypted files. On the other hand, since encryption metadata is stored as an extended attribute in the inode, 4K inodes are still recommended -- especially in cases where a more complicated encryption policy is used. 3) >> Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Using a backup key server is strongly recommended. While it's true that the files may still be accessed for a while if the key server becomes unreachable, this was not something to be counted on. First because keys (MEK) may expire at any time, requiring the key to be retrieved from the key server again. Second because a file may require a key may be needed that has not been cached before. 4) >> Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? Correct. >> If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Correct. We'll review the documentation to ensure that the meaning of the RkmId in the examples is clear. 5) >> Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? I'll work on getting clarifications from the ISKLM folks on this aspect. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Wahl, Edward" To: gpfsug main discussion list Date: 04/06/2017 10:55 AM Subject: Re: [gpfsug-discuss] Spectrum Scale Encryption Sent by: gpfsug-discuss-bounces at spectrumscale.org This is rather dependant on SS version. So what used to happen before 4.2.2.* is that a client would be unable to mount the filesystem in question and would give an error in the mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount the file system and then give errors on file access instead. (just tested this on 4.2.2.3) I'll have to read through the changelogs looking for this one. Depending on your policy for encryption then, this might be exactly what you want, but I REALLY REALLY dislike this behaviour. To me this means clients can now mount an encrypted FS now and then fail during operation. If I get a client node that comes up improperly, user work will start, and it will fail with "Operation not permitted" errors on file access. I imagine my batch system could run through a massive amount of jobs on a bad client without anyone noticing immeadiately. Yet another thing we now have to monitor now I guess. *shrug* A couple other gotcha's we've seen with Encryption: Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Ed Wahl OSC ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Thursday, April 06, 2017 4:20 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale Encryption We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Fri Apr 7 15:58:29 2017 From: mweil at wustl.edu (Matt Weil) Date: Fri, 7 Apr 2017 09:58:29 -0500 Subject: [gpfsug-discuss] AFM gateways Message-ID: Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From vpuvvada at in.ibm.com Mon Apr 10 11:56:16 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Mon, 10 Apr 2017 16:26:16 +0530 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: gpfsug main discussion list Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sandra.McLaughlin at astrazeneca.com Mon Apr 10 12:20:53 2017 From: Sandra.McLaughlin at astrazeneca.com (McLaughlin, Sandra M) Date: Mon, 10 Apr 2017 11:20:53 +0000 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn't do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christian.Fey at sva.de Mon Apr 10 17:04:31 2017 From: Christian.Fey at sva.de (Fey, Christian) Date: Mon, 10 Apr 2017 16:04:31 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: <455e54150cd04cd8808619acbf7d8d2b@sva.de> Hi, I'm just dealing with a maybe similar issue that also seems to be related to the output of "tsctl shownodes up" (before CES i actually never had to do with this command). In my case the output of a "mmlscluster" for example shows the nodes like "node1.acme.local" but in " tsctl shownodes up" they are displayed as "node1.acme.local.acme.local" for example. This maybe causes a fresh CES implementation in a existing GPFS cluster to also not spread ip-adresses. It instead loops in the same way as it did in your case @jonathon. I think it tries to search for "node1.acme.local" but doesn't find it since tsctl shows it with doubled suffix. Can anyone explain, from where the "tsctl shownodes up" reads the data? Additionally does anyone have an idea why the dns suffix is doubled? Kind regards Christian -----Urspr?ngliche Nachricht----- Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Jonathon A Anderson Gesendet: Donnerstag, 23. M?rz 2017 16:02 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Achtung! Die Absender-Adresse ist m?glicherweise gef?lscht. Bitte ?berpr?fen Sie die Plausibilit?t der Email und lassen bei enthaltenen Anh?ngen und Links besondere Vorsicht walten. Wenden Sie sich im Zweifelsfall an das CIT unter cit at sva.de oder 06122 536 350. (Stichwort: DKIM Test Fehlgeschlagen) ---------------------------------------------------------------------------------------------------------------- Thanks! I?m looking forward to upgrading our CES nodes and resuming work on the project. ~jonathon On 3/23/17, 8:24 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Olaf Weiser" wrote: the issue is fixed, an APAR will be released soon - IV93100 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Simon Thompson (Research Computing - IT Services)" An:"gpfsug main discussion list" Datum:Di. 31.01.2017 21:07Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From service at metamodul.com Mon Apr 10 17:47:41 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 10 Apr 2017 18:47:41 +0200 (CEST) Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network Message-ID: <788130355.197989.1491842861235@email.1und1.de> An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Apr 10 17:58:36 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 10 Apr 2017 12:58:36 -0400 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: <788130355.197989.1491842861235@email.1und1.de> References: <788130355.197989.1491842861235@email.1und1.de> Message-ID: 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers wrote: > My understanding of the GPFS networks is not quite clear. > > For an GPFS setup i would like to use 2 Networks > > 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24 > > 2 Admin Network using for example: 192.168.1.0/24 network > > Questions > > 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - > Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. > network or is it sufficient for the tiebreaker node to be configured as > part of the admin 192.168.1 network ? > > 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 > network or is it sufficient for the remote cluster to access the 10.1.1 > network ? If so i assume that remotecluster commands and ping to/from > remote cluster are going via the Daemon network ? > > Note: > > I am aware and read https://www.ibm.com/developerworks/community/ > wikis/home?lang=en#!/wiki/General%20Parallel%20File% > 20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview > > -- > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 <+49%20177%204393994> > Mail: service at metamodul.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Mon Apr 10 18:13:08 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 10 Apr 2017 18:13:08 +0100 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de> Message-ID: <3a8f72c6-407a-0f4d-cf3c-f4698ca7b8e5@qsplace.co.uk> All nodes in a GPFS cluster need to be able to communicate over the data and admin network with the exception of remote clusters which can have their own separate admin network (for their own cluster that they are a member of) but still require communications over the daemon network. The networks can be routed and on different subnets, however the each member of the cluster will need to be able to communicate with every other member. With this in mind: 1) The quorum node will need to be accessible on both the 10.1.1.0/24 and 192.168.1.0/24 however again the network that the quorum node is on could be routed. 2) Remote clusters don't need access to the home clusters admin network, as they will use their own clusters admin network. As Eric has mentioned I would double check your 2+1 cluster suggestion, do you mean 2 x Servers with NSD's (with a quorum role) and 1 quorum node without NSD's? which gives you 3 quorum, or are you only going to have 1 quorum? If the latter that I would suggest using all 3 servers for quorum as they should be licensed as GPFS servers anyway due to their roles. -- Lauz On 10/04/2017 17:58, J. Eric Wonderley wrote: > 1) You want more that one quorum node on your server cluster. The > non-quorum node does need a daemon network interface exposed to the > client cluster as does the quorum nodes. > > 2) No. Admin network is for intra cluster communications...not inter > cluster(between clusters). Daemon interface(port 1191) is used for > communications between clusters. I think there is little benefit > gained by having designated an admin network...maybe someone can point > out benefits of an admin network. > > > > Eric Wonderley > > On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > > wrote: > > My understanding of the GPFS networks is not quite clear. > > For an GPFS setup i would like to use 2 Networks > > 1 Daemon (data) network using port 1191 using for example. > 10.1.1.0/24 > > 2 Admin Network using for example: 192.168.1.0/24 > network > > Questions > > 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) > Config - Does the Tiebreaker Node needs to have access to the > daemon(data) 10.1.1. network or is it sufficient for the > tiebreaker node to be configured as part of the admin 192.168.1 > network ? > > 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 > network or is it sufficient for the remote cluster to access the > 10.1.1 network ? If so i assume that remotecluster commands and > ping to/from remote cluster are going via the Daemon network ? > > Note: > > I am aware and read > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview > > > -- > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 > Mail: service at metamodul.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Apr 10 18:26:42 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 10 Apr 2017 17:26:42 +0000 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de>, Message-ID: If you have network congestion, then a separate admin network is of benefit. Maybe less important if you have 10GbE networks, but if (for example), you normally rely on IB to talk data, and gpfs fails back to the Ethernet (which may be only 1GbE), then you may have cluster issues, for example missing gpfs pings. Having a separate physical admin network can protect you from this. Having been bitten by this several years back, it's a good idea IMHO to have a separate admin network. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of J. Eric Wonderley [eric.wonderley at vt.edu] Sent: 10 April 2017 17:58 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > wrote: My understanding of the GPFS networks is not quite clear. For an GPFS setup i would like to use 2 Networks 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24 2 Admin Network using for example: 192.168.1.0/24 network Questions 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. network or is it sufficient for the tiebreaker node to be configured as part of the admin 192.168.1 network ? 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 network or is it sufficient for the remote cluster to access the 10.1.1 network ? If so i assume that remotecluster commands and ping to/from remote cluster are going via the Daemon network ? Note: I am aware and read https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Mon Apr 10 18:44:47 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 10 Apr 2017 19:44:47 +0200 (CEST) Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de>, Message-ID: <795203366.199195.1491846287405@email.1und1.de> An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Apr 10 19:02:30 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 10 Apr 2017 21:02:30 +0300 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: <795203366.199195.1491846287405@email.1und1.de> References: <788130355.197989.1491842861235@email.1und1.de>, <795203366.199195.1491846287405@email.1und1.de> Message-ID: Hi Out of curiosity. Are you using Failure groups and doing replication of data/metadata too? If you you do need to deal with the file system descriptors as well on the 3rd node. Thanks From: Hans-Joachim Ehlers To: gpfsug main discussion list Date: 10/04/2017 20:44 Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry for not being clear. The setup is of course a 3 Node Cluster where each node is a quorum node - 2 NSD Server and 1 TieBreaker/Quorum Buster node. For me it was not clear if the Tiebreaker/Quorum Buster node - which does nothing in terms of data serving - must be part of the daemon/data network or not. So i get the understanding that a Tiebreaker Node must be also part of the Daemon network. Thx a lot to all Hajo "Simon Thompson (IT Research Support)" hat am 10. April 2017 um 19:26 geschrieben: If you have network congestion, then a separate admin network is of benefit. Maybe less important if you have 10GbE networks, but if (for example), you normally rely on IB to talk data, and gpfs fails back to the Ethernet (which may be only 1GbE), then you may have cluster issues, for example missing gpfs pings. Having a separate physical admin network can protect you from this. Having been bitten by this several years back, it's a good idea IMHO to have a separate admin network. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of J. Eric Wonderley [eric.wonderley at vt.edu] Sent: 10 April 2017 17:58 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > wrote: My understanding of the GPFS networks is not quite clear. For an GPFS setup i would like to use 2 Networks 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24< http://10.1.1.0/24> 2 Admin Network using for example: 192.168.1.0/24 network Questions 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. network or is it sufficient for the tiebreaker node to be configured as part of the admin 192.168.1 network ? 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 network or is it sufficient for the remote cluster to access the 10.1.1 network ? If so i assume that remotecluster commands and ping to/from remote cluster are going via the Daemon network ? Note: I am aware and read https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Apr 10 21:15:38 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 10 Apr 2017 15:15:38 -0500 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mitsugi at linux.vnet.ibm.com Tue Apr 11 05:29:16 2017 From: mitsugi at linux.vnet.ibm.com (Masanori Mitsugi) Date: Tue, 11 Apr 2017 13:29:16 +0900 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Message-ID: Hello, Does anyone have experience to do mmapplypolicy against billion files for ILM/HSM? Currently I'm planning/designing * 1 Scale filesystem (5-10 PB) * 10-20 filesets which includes 1 billion files each And our biggest concern is "How log does it take for mmapplypolicy policy scan against billion files?" I know it depends on how to write the policy, but I don't have no billion files policy scan experience, so I'd like to know the order of time (min/hour/day...). It would be helpful if anyone has experience of such large number of files scan and let me know any considerations or points for policy design. -- Masanori Mitsugi mitsugi at linux.vnet.ibm.com From zgiles at gmail.com Tue Apr 11 05:49:10 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 11 Apr 2017 00:49:10 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com From olaf.weiser at de.ibm.com Tue Apr 11 07:51:48 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 11 Apr 2017 08:51:48 +0200 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <455e54150cd04cd8808619acbf7d8d2b@sva.de> References: <455e54150cd04cd8808619acbf7d8d2b@sva.de> Message-ID: An HTML attachment was scrubbed... URL: From ckrafft at de.ibm.com Tue Apr 11 09:24:35 2017 From: ckrafft at de.ibm.com (Christoph Krafft) Date: Tue, 11 Apr 2017 10:24:35 +0200 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Message-ID: Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A788784.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From p.childs at qmul.ac.uk Tue Apr 11 09:57:44 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Tue, 11 Apr 2017 08:57:44 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London From jonathan at buzzard.me.uk Tue Apr 11 11:21:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Tue, 11 Apr 2017 11:21:05 +0100 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <1491906065.4102.87.camel@buzzard.me.uk> On Tue, 2017-04-11 at 00:49 -0400, Zachary Giles wrote: [SNIP] > * Then throw ~8 well tuned Infiniband attached nodes at it using -N, > If they're the same as the NSD servers serving the flash, even better. > Exactly how much are you going to gain from Infiniband over 40Gbps or even 100Gbps Ethernet? Not a lot I would have thought. Even with flash all your latency is going to be in the flash not the Ethernet. Unless you have a compute cluster and need Infiniband for the MPI traffic, it is surely better to stick to Ethernet. Infiniband is rather esoteric, what I call a minority sport best avoided if at all possible. Even if you have an Infiniband fabric, I would argue that give current core counts and price points for 10Gbps Ethernet, that actually you are better off keeping your storage traffic on the Ethernet, and reserving the Infiniband for MPI duties. That is 10Gbps Ethernet to the compute nodes and 40/100Gbps Ethernet on the storage nodes. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From zgiles at gmail.com Tue Apr 11 12:50:26 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 11 Apr 2017 07:50:26 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: <1491906065.4102.87.camel@buzzard.me.uk> References: <1491906065.4102.87.camel@buzzard.me.uk> Message-ID: Yeah, that can be true. I was just trying to show the size/shape that can achieve this. There's a good chance 10G or 40G ethernet would yield similar results, especially if you're running the policy on the NSD servers. On Tue, Apr 11, 2017 at 6:21 AM, Jonathan Buzzard wrote: > On Tue, 2017-04-11 at 00:49 -0400, Zachary Giles wrote: > > [SNIP] > >> * Then throw ~8 well tuned Infiniband attached nodes at it using -N, >> If they're the same as the NSD servers serving the flash, even better. >> > > Exactly how much are you going to gain from Infiniband over 40Gbps or > even 100Gbps Ethernet? Not a lot I would have thought. Even with flash > all your latency is going to be in the flash not the Ethernet. > > Unless you have a compute cluster and need Infiniband for the MPI > traffic, it is surely better to stick to Ethernet. Infiniband is rather > esoteric, what I call a minority sport best avoided if at all possible. > > Even if you have an Infiniband fabric, I would argue that give current > core counts and price points for 10Gbps Ethernet, that actually you are > better off keeping your storage traffic on the Ethernet, and reserving > the Infiniband for MPI duties. That is 10Gbps Ethernet to the compute > nodes and 40/100Gbps Ethernet on the storage nodes. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com From stockf at us.ibm.com Tue Apr 11 12:53:33 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 11 Apr 2017 07:53:33 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: As Zachary noted the location of your metadata is the key and for the scanning you have planned flash is necessary. If you have the resources you may consider setting up your flash in a mirrored RAID configuration (RAID1/RAID10) and have GPFS only keep one copy of metadata since the underlying storage is replicating it via the RAID. This should improve metadata write performance but likely has little impact on your scanning, assuming you are just reading through the metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Zachary Giles To: gpfsug main discussion list Date: 04/11/2017 12:49 AM Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Sent by: gpfsug-discuss-bounces at spectrumscale.org It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Apr 11 16:18:01 2017 From: chair at spectrumscale.org (Spectrum Scale UG Chair (Simon Thompson)) Date: Tue, 11 Apr 2017 16:18:01 +0100 Subject: [gpfsug-discuss] May Meeting Registration Message-ID: Hi all, Just a reminder that the next UK user group meeting is taking place on 9th/10th May. If you are planning on attending, please do register at: https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi stration-32113696932 (or try https://goo.gl/tRptru ) As last year, this is a 2 day event and we're planning a fun evening event on the Tuesday night at Manchester Museum of Science. Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, OCF and Seagate for helping make this happen! We also still have some customer talk slots to fill, so please let me know if you are interested in speaking. Thanks Simon From bbanister at jumptrading.com Tue Apr 11 16:29:25 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:29:25 +0000 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <1e86aa0c2e4344f19cb5eedf8f03efa9@jumptrading.com> A word of caution, be careful about where you run this kind of policy scan as the sort process can consume all memory on your hosts and that could lead to issues with the OS deciding to kill off GPFS or other similar bad things can occur. I recommend restricting the ILM policy scan to a subset of servers, no quorum nodes, and ensuring at least one NSD server is available for all NSDs in the file system(s). Watch the memory consumption on your nodes during the sort operations to see if you need to tune that down in the mmapplypolicy options. Hope that helps, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: Tuesday, April 11, 2017 6:54 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM As Zachary noted the location of your metadata is the key and for the scanning you have planned flash is necessary. If you have the resources you may consider setting up your flash in a mirrored RAID configuration (RAID1/RAID10) and have GPFS only keep one copy of metadata since the underlying storage is replicating it via the RAID. This should improve metadata write performance but likely has little impact on your scanning, assuming you are just reading through the metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Zachary Giles > To: gpfsug main discussion list > Date: 04/11/2017 12:49 AM Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi > wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.leach at ed.ac.uk Tue Apr 11 16:32:41 2017 From: k.leach at ed.ac.uk (Kieran Leach) Date: Tue, 11 Apr 2017 16:32:41 +0100 Subject: [gpfsug-discuss] May Meeting Registration In-Reply-To: References: Message-ID: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> Hi Simon, would you be interested in a customer talk about the RDF (http://rdf.ac.uk/). We manage the RDF at EPCC, providing a 23PB filestore to complement ARCHER (the national research HPC service) and other UK Research HPC services. This is of course a GPFS system. If you've any questions or want more info please let me know but I thought I'd get an email off to you while I remember. Cheers Kieran On 11/04/17 16:18, Spectrum Scale UG Chair (Simon Thompson) wrote: > Hi all, > > Just a reminder that the next UK user group meeting is taking place on > 9th/10th May. If you are planning on attending, please do register at: > > https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi > stration-32113696932 > > > (or try https://goo.gl/tRptru ) > > As last year, this is a 2 day event and we're planning a fun evening event > on the Tuesday night at Manchester Museum of Science. > > Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, > OCF and Seagate for helping make this happen! > > We also still have some customer talk slots to fill, so please let me know > if you are interested in speaking. > > Thanks > > Simon > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From k.leach at ed.ac.uk Tue Apr 11 16:33:29 2017 From: k.leach at ed.ac.uk (Kieran Leach) Date: Tue, 11 Apr 2017 16:33:29 +0100 Subject: [gpfsug-discuss] May Meeting Registration In-Reply-To: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> References: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> Message-ID: Apologies all, wrong reply button. Cheers Kieran On 11/04/17 16:32, Kieran Leach wrote: > Hi Simon, > would you be interested in a customer talk about the RDF > (http://rdf.ac.uk/). We manage the RDF at EPCC, providing a 23PB > filestore to complement ARCHER (the national research HPC service) and > other UK Research HPC services. This is of course a GPFS system. If > you've any questions or want more info please let me know but I > thought I'd get an email off to you while I remember. > > Cheers > > Kieran > > On 11/04/17 16:18, Spectrum Scale UG Chair (Simon Thompson) wrote: >> Hi all, >> >> Just a reminder that the next UK user group meeting is taking place on >> 9th/10th May. If you are planning on attending, please do register at: >> >> https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi >> >> stration-32113696932 >> >> >> (or try https://goo.gl/tRptru ) >> >> As last year, this is a 2 day event and we're planning a fun evening >> event >> on the Tuesday night at Manchester Museum of Science. >> >> Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, >> OCF and Seagate for helping make this happen! >> >> We also still have some customer talk slots to fill, so please let me >> know >> if you are interested in speaking. >> >> Thanks >> >> Simon >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From makaplan at us.ibm.com Tue Apr 11 16:36:47 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Apr 2017 11:36:47 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: As primary developer of mmapplypolicy, please allow me to comment: 1) Fast access to metadata in system pool is most important, as several have commented on. These days SSD is the favorite, but you can still go with "spinning" media. If you do go with disks, it's extremely important to spread your metadata over independent disk "arms" -- so you can have many concurrent seeks in progress at the same time. IOW, if there is a virtualization/mapping layer, watchout that your logical disks don't get mapped to the same physical disk. 2) Crucial to use both -g and -N :: -g /gpfs-not-necessarily-the-same-fs-as-Im-scanning/tempdir and -N several-nodes-that-will-be-accessing-the-system-pool 3a) If at all possible, encourage your data and application designers to "pack" their directories with lots of files. Keep in mind that, mmapplypolicy will read every directory. The more directories, the more seeks, more time spent waiting for IO. OTOH, in more typical Unix/Linux usage, we tend to low average number of files per directory. 3b) As admin, you may not be able to change your data design to pack hundreds of files per directory, BUT you can make sure you are running a sufficiently modern release of Spectrum Scale that supports "data in inode" -- "Data in inode" also means "directory entries in inode" -- which means practically any small directory, up to a few hundred files, will fit in an an inode -- which means mmapplypolicy can read small directories with one seek, instead of two. (Someone will please remind us of the release number that first supported "directories in inode".) 4) Sorry, Fred, but the recommendation to use RAID mirroring of metadata on SSD, is not necessarily, important for metadata scanning. In fact it may work against you. If you use GPFS replication of metadata - that can work for you -- since then GPFS can direct read operations to either copy, preferring a locally attached copy, depending on how storage is attached to node, etc, etc. Choice of how to replicate metadata - either using GPFS replication or the RAID controller - is probably best made based on reliability and recoverability requirements. 5) YMMV - We'd love to hear/see your performance results for mmapplypolicy, especially if they're good. Even if they're bad, come back here for more tuning tips! -- marc of Spectrum Scale (ne GPFS) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Apr 11 16:51:56 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:51:56 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From S.J.Thompson at bham.ac.uk Tue Apr 11 16:55:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 11 Apr 2017 15:55:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jonathon.anderson at colorado.edu Tue Apr 11 16:56:56 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Tue, 11 Apr 2017 15:56:56 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: Bryan, That looks like a really useful set of presentation slides! Thanks for sharing! Which one in particular is the one Yuri gave that you?re referring to? ~jonathon On 4/11/17, 9:51 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bbanister at jumptrading.com Tue Apr 11 16:59:51 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:59:51 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Problem Determination and GPFS Internals. My security group won't let me go to the google docs site from my work compute... I'm sure there is malicious malware on that site!! j/k, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathon A Anderson Sent: Tuesday, April 11, 2017 10:57 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories Bryan, That looks like a really useful set of presentation slides! Thanks for sharing! Which one in particular is the one Yuri gave that you?re referring to? ~jonathon On 4/11/17, 9:51 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From p.childs at qmul.ac.uk Tue Apr 11 20:35:40 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Tue, 11 Apr 2017 19:35:40 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Can you remember what version you were running? Don't worry if you can't remember. It looks like ibm may have withdrawn 4.2.1 from fix central and wish to forget its existences. Never a good sign, 4.2.0, 4.2.2, 4.2.3 and even 3.5, so maybe upgrading is worth a try. I've looked at all the standard trouble shouting guides and got nowhere hence why I asked. But another set of slides always helps. Thank-you for the help, still head scratching.... Which only makes the issue more random. Peter Childs Research Storage ITS Research and Teaching Support Queen Mary, University of London ---- Simon Thompson (IT Research Support) wrote ---- We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mitsugi at linux.vnet.ibm.com Wed Apr 12 02:51:03 2017 From: mitsugi at linux.vnet.ibm.com (Masanori Mitsugi) Date: Wed, 12 Apr 2017 10:51:03 +0900 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <0851d194-088e-d93a-303d-ceb0de3dbaa8@linux.vnet.ibm.com> Marc, Zachary, Fred, Bryan, Thank you for providing great advice! It's pretty useful for me to tune our policy with best performance. As for "directories in inode", we plan to use latest version, so I believe we can leverage this function. -- Masanori Mitsugi mitsugi at linux.vnet.ibm.com From vpuvvada at in.ibm.com Wed Apr 12 10:53:25 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 12 Apr 2017 15:23:25 +0530 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> References: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Message-ID: Gateway node requires server license. ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: Date: 04/11/2017 01:46 AM Subject: Re: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: gpfsug main discussion list Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Wed Apr 12 15:52:48 2017 From: mweil at wustl.edu (Matt Weil) Date: Wed, 12 Apr 2017 09:52:48 -0500 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Message-ID: yes it tells you that when you attempt to make the node a gateway and is does not have a server license designation. On 4/12/17 4:53 AM, Venkateswara R Puvvada wrote: Gateway node requires server license. ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: Date: 04/11/2017 01:46 AM Subject: Re: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Apr 12 22:01:45 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 12 Apr 2017 14:01:45 -0700 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> On 4/11/17 8:36 AM, Marc A Kaplan wrote: > > 5) YMMV - We'd love to hear/see your performance results for > mmapplypolicy, especially if they're good. Even if they're bad, come > back here for more tuning tips! I have a filesystem that currently has 267919775 (roughly quarter billion, 250 million) used inodes. The metadata is on SSD behind a DDN 12K. We do use 4K inodes, and files smaller than 4K fit into the inodes. Here is the command I use to apply a policy: mmapplypolicy gsfs0 -P policy.txt -N scg-gs0,scg-gs1,scg-gs2,scg-gs3,scg-gs4,scg-gs5,scg-gs6,scg-gs7 -g /srv/gsfs0/admin_stuff/ -I test -B 500 -A 61 -a 4 That takes approximately 10 minutes to do the whole scan. The "-B 500 -A 61 -a 4" numbers we determined just by trying different values with the same policy file and seeing the resulting scan duration. 10mins is short enough to do almost "interactive" type of file list policies and look at the results. E.g. list all files over 1TB in size. This was a couple of years ago, probably on a different GPFS version, but on same storage and NSD hardware, so now I just copy those parameters. You should probably not just copy them but try some other values yourself. Regards, Alex From makaplan at us.ibm.com Wed Apr 12 23:43:20 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 12 Apr 2017 18:43:20 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> References: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> Message-ID: >>>Here is the command I use to apply a policy: mmapplypolicy gsfs0 -P policy.txt -N scg-gs0,scg-gs1,scg-gs2,scg-gs3,scg-gs4,scg-gs5,scg-gs6,scg-gs7 -g /srv/gsfs0/admin_stuff/ -I test -B 500 -A 61 -a 4 That takes approximately 10 minutes to do the whole scan. The "-B 500 -A 61 -a 4" numbers we determined just by trying different values with the same policy file and seeing the resulting scan duration. <<< That's pretty good. BUT, FYI, the -A number-of-buckets parameter should be scaled with the total number of files you expect to find in the argument filesystem or directory. If you don't set it the command will default to number-of-inodes-allocated / million, but capped at a minimum of 7 and a maximum of 4096. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Apr 13 11:35:19 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 13 Apr 2017 10:35:19 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: , Message-ID: After a load more debugging, and switching off the quota's the issue looks to be quota related. in that the issue has gone away since I switched quota's off. I will need to switch them back on, but at least we know the issue is not the network and is likely to be fixed by upgrading..... Peter Childs ITS Research Infrastructure Queen Mary, University of London ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Peter Childs Sent: Tuesday, April 11, 2017 8:35:40 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories Can you remember what version you were running? Don't worry if you can't remember. It looks like ibm may have withdrawn 4.2.1 from fix central and wish to forget its existences. Never a good sign, 4.2.0, 4.2.2, 4.2.3 and even 3.5, so maybe upgrading is worth a try. I've looked at all the standard trouble shouting guides and got nowhere hence why I asked. But another set of slides always helps. Thank-you for the help, still head scratching.... Which only makes the issue more random. Peter Childs Research Storage ITS Research and Teaching Support Queen Mary, University of London ---- Simon Thompson (IT Research Support) wrote ---- We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Fri Apr 14 08:34:06 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 14 Apr 2017 15:34:06 +0800 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? In-Reply-To: References: Message-ID: If you can use " mmchconfig usePersistentReserve=yes" successfully, then it is supported, we will check the compatibility during the command, and you can also use "tsprinquiry device(no /dev prefix)" check the vendor output. Thanks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Christoph Krafft" To: "gpfsug main discussion list" Cc: Achim Christ , Petra Christ Date: 04/11/2017 04:25 PM Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A696179.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A223532.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From scale at us.ibm.com Fri Apr 14 08:34:06 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 14 Apr 2017 15:34:06 +0800 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? In-Reply-To: References: Message-ID: If you can use " mmchconfig usePersistentReserve=yes" successfully, then it is supported, we will check the compatibility during the command, and you can also use "tsprinquiry device(no /dev prefix)" check the vendor output. Thanks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Christoph Krafft" To: "gpfsug main discussion list" Cc: Achim Christ , Petra Christ Date: 04/11/2017 04:25 PM Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A696179.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A223532.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sun Apr 16 14:47:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sun, 16 Apr 2017 13:47:20 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Message-ID: Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Sun Apr 16 17:20:15 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Sun, 16 Apr 2017 16:20:15 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Message-ID: <252ABBB2-7E94-41F6-AD76-B6D836E5C916@nuance.com> I think the first thing I would do is turn up the ?-L? level to a large value (like ?6?) and see what it tells you about files that are being chosen and which ones aren?t being migrated and why. You could run it in test mode, write the output to a file and see what it says. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of "Buterbaugh, Kevin L" Reply-To: gpfsug main discussion list Date: Sunday, April 16, 2017 at 8:47 AM To: gpfsug main discussion list Subject: [EXTERNAL] [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sun Apr 16 20:15:40 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sun, 16 Apr 2017 15:15:40 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From makaplan at us.ibm.com Sun Apr 16 20:39:21 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sun, 16 Apr 2017 15:39:21 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Correction: So that's why it chooses to migrate "only" 67TB.... (67000 GB) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Apr 17 16:24:02 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 17 Apr 2017 15:24:02 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Mon Apr 17 19:49:12 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 17 Apr 2017 11:49:12 -0700 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: <09e154ef-15ed-3217-db65-51e693e28faa@stanford.edu> Hi Kevin, IMHO, safe to just run it again. You can also run it with '-I test -L 6' again and look through the output. But I don't think you can "break" anything by having it scan and/or move data. Can you post the full command line that you use to run it? The behavior you describe is odd; you say it prints out the "files migrated successfully" message, but the files didn't actually get migrated? Turn up the debug param and have it print every file as it is moving it or something. Regards, Alex On 4/17/17 8:24 AM, Buterbaugh, Kevin L wrote: > Hi Marc, > > I do understand what you?re saying about mmapplypolicy deciding it only > needed to move ~1.8 million files to fill the capacity pool to ~98% > full. However, it is now more than 24 hours since the mmapplypolicy > finished ?successfully? and: > > Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) > eon35Ansd 58.2T 35 No Yes 29.66T ( > 51%) 64.16G ( 0%) > eon35Dnsd 58.2T 35 No Yes 29.66T ( > 51%) 64.61G ( 0%) > ------------- > -------------------- ------------------- > (pool total) 116.4T 59.33T ( > 51%) 128.8G ( 0%) > > And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the > partially redacted command line: > > /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g another gpfs filesystem> -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy > -N some,list,of,NSD,server,nodes > > And here?s that policy file: > > define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) > define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) > > RULE 'OldStuff' > MIGRATE FROM POOL 'gpfs23data' > TO POOL 'gpfs23capacity' > LIMIT(98) > WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) > > RULE 'INeedThatAfterAll' > MIGRATE FROM POOL 'gpfs23capacity' > TO POOL 'gpfs23data' > LIMIT(75) > WHERE (access_age < 14) > > The one thing that has changed is that formerly I only ran the migration > in one direction at a time ? i.e. I used to have those two rules in two > separate files and would run an mmapplypolicy using the OldStuff rule > the 1st weekend of the month and run the other rule the other weekends > of the month. This is the 1st weekend that I attempted to run an > mmapplypolicy that did both at the same time. Did I mess something up > with that? > > I have not run it again yet because we also run migrations on the other > filesystem that we are still in the process of migrating off of. So > gpfs23 goes 1st and as soon as it?s done the other filesystem migration > kicks off. I don?t like to run two migrations simultaneously if at all > possible. The 2nd migration ran until this morning, when it was > unfortunately terminated by a network switch crash that has also had me > tied up all morning until now. :-( > > And yes, there is something else going on ? well, was going on - the > network switch crash killed this too ? I have been running an rsync on > one particular ~80TB directory tree from the old filesystem to gpfs23. > I understand that the migration wouldn?t know about those files and > that?s fine ? I just don?t understand why mmapplypolicy said it was > going to fill the capacity pool to 98% but didn?t do it ? wait, > mmapplypolicy hasn?t gone into politics, has it?!? ;-) > > Thanks - and again, if I should open a PMR for this please let me know... > > Kevin > >> On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > > wrote: >> >> Let's look at how mmapplypolicy does the reckoning. >> Before it starts, it see your pools as: >> >> [I] GPFS Current Data Pool Utilization in KB and % >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 55365193728 124983549952 44.297984614% >> gpfs23data 166747037696 343753326592 48.507759721% >> system 0 0 >> 0.000000000% (no user data) >> [I] 75142046 of 209715200 inodes used: 35.830520%. >> >> Your rule says you want to migrate data to gpfs23capacity, up to 98% full: >> >> RULE 'OldStuff' >> MIGRATE FROM POOL 'gpfs23data' >> TO POOL 'gpfs23capacity' >> LIMIT(98) WHERE ... >> >> We scan your files and find and reckon... >> [I] Summary of Rule Applicability and File Choices: >> Rule# Hit_Cnt KB_Hit Chosen KB_Chosen >> KB_Ill Rule >> 0 5255960 237675081344 1868858 67355430720 >> 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO >> POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) >> >> So yes, 5.25Million files match the rule, but the utility chooses >> 1.868Million files that add up to 67,355GB and figures that if it >> migrates those to gpfs23capacity, >> (and also figuring the other migrations by your second rule)then >> gpfs23 will end up 97.9999% full. >> We show you that with our "predictions" message. >> >> Predicted Data Pool Utilization in KB and %: >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 122483878944 124983549952 97.999999993% >> gpfs23data 104742360032 343753326592 30.470209865% >> >> So that's why it chooses to migrate "only" 67GB.... >> >> See? Makes sense to me. >> >> Questions: >> Did you run with -I yes or -I defer ? >> >> Were some of the files illreplicated or illplaced? >> >> Did you give the cluster-wide space reckoning protocols time to see >> the changes? mmdf is usually "behind" by some non-neglible amount of >> time. >> >> What else is going on? >> If you're moving or deleting or creating data by other means while >> mmapplypolicy is running -- it doesn't "know" about that! >> >> Run it again! >> >> >> >> >> >> From: "Buterbaugh, Kevin L" > > >> To: gpfsug main discussion list >> > > >> Date: 04/16/2017 09:47 AM >> Subject: [gpfsug-discuss] mmapplypolicy didn't migrate >> everything it should have - why not? >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Hi All, >> >> First off, I can open a PMR for this if I need to. Second, I am far >> from an mmapplypolicy guru. With that out of the way ? I have an >> mmapplypolicy job that didn?t migrate anywhere close to what it could >> / should have. From the log file I have it create, here is the part >> where it shows the policies I told it to invoke: >> >> [I] Qos 'maintenance' configured as inf >> [I] GPFS Current Data Pool Utilization in KB and % >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 55365193728 124983549952 44.297984614% >> gpfs23data 166747037696 343753326592 48.507759721% >> system 0 0 >> 0.000000000% (no user data) >> [I] 75142046 of 209715200 inodes used: 35.830520%. >> [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. >> Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC >> Parsed 2 policy rules. >> >> RULE 'OldStuff' >> MIGRATE FROM POOL 'gpfs23data' >> TO POOL 'gpfs23capacity' >> LIMIT(98) >> WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND >> (KB_ALLOCATED > 3584)) >> >> RULE 'INeedThatAfterAll' >> MIGRATE FROM POOL 'gpfs23capacity' >> TO POOL 'gpfs23data' >> LIMIT(75) >> WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) >> >> And then the log shows it scanning all the directories and then says, >> "OK, here?s what I?m going to do": >> >> [I] Summary of Rule Applicability and File Choices: >> Rule# Hit_Cnt KB_Hit Chosen KB_Chosen >> KB_Ill Rule >> 0 5255960 237675081344 1868858 67355430720 >> 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO >> POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) >> 1 611 236745504 611 236745504 >> 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL >> 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) >> >> [I] Filesystem objects with no applicable rules: 414911602. >> >> [I] GPFS Policy Decisions and File Choice Totals: >> Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; >> Predicted Data Pool Utilization in KB and %: >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 122483878944 124983549952 97.999999993% >> gpfs23data 104742360032 343753326592 30.470209865% >> system 0 0 >> 0.000000000% (no user data) >> >> Notice that it says it?s only going to migrate less than 2 million of >> the 5.25 million candidate files!! And sure enough, that?s all it did: >> >> [I] A total of 1869469 files have been migrated, deleted or processed >> by an EXTERNAL EXEC/script; >> 0 'skipped' files and/or errors. >> >> And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere >> near 98% full: >> >> Disks in storage pool: gpfs23capacity (Maximum disk size allowed is >> 519 TB) >> eon35Ansd 58.2T 35 No Yes 29.54T ( >> 51%) 63.93G ( 0%) >> eon35Dnsd 58.2T 35 No Yes 29.54T ( >> 51%) 64.39G ( 0%) >> ------------- >> -------------------- ------------------- >> (pool total) 116.4T 59.08T ( >> 51%) 128.3G ( 0%) >> >> I don?t understand why it only migrated a small subset of what it >> could / should have? >> >> We are doing a migration from one filesystem (gpfs21) to gpfs23 and I >> really need to stuff my gpfs23capacity pool as full of data as I can >> to keep the migration going. Any ideas anyone? Thanks in advance? >> >> ? >> Kevin Buterbaugh - Senior System Administrator >> Vanderbilt University - Advanced Computing Center for Research and >> Education >> _Kevin.Buterbaugh at vanderbilt.edu_ >> - (615)875-9633 From makaplan at us.ibm.com Mon Apr 17 21:11:18 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 17 Apr 2017 16:11:18 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Mon Apr 17 21:18:42 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 17 Apr 2017 16:18:42 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Oops... If you want to see the list of what would be migrated '-I test -L 2' If you want to migrate and see each file migrated '-I yes -L 2' I don't recommend -L 4 or higher, unless you want to see the files that do not match your rules. -L 3 will show you all the files that match the rules, including those that are NOT chosen for migration. See the command gu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Apr 17 22:16:57 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 17 Apr 2017 21:16:57 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan > wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Apr 18 14:31:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 18 Apr 2017 13:31:20 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> Message-ID: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Hi All, but especially Marc, I ran the mmapplypolicy again last night and, unfortunately, it again did not fill the capacity pool like it said it would. From the log file: [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 3632859 181380873184 1620175 61434283936 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 88 99230048 88 99230048 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 442962867. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 61533513984KB: 1620263 of 3632947 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878464 124983549952 97.999999609% gpfs23data 128885076416 343753326592 37.493477574% system 0 0 0.000000000% (no user data) [I] 2017-04-18 at 02:52:48.402 Policy execution. 0 files dispatched. And the tail end of the log file says that it moved those files: [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. But mmdf (and how quickly the mmapplypolicy itself ran) say otherwise: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.73T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.73T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.45T ( 51%) 128.8G ( 0%) Ideas? Or is it time for me to open a PMR? Thanks? Kevin On Apr 17, 2017, at 4:16 PM, Buterbaugh, Kevin L > wrote: Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan > wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Tue Apr 18 14:56:43 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 18 Apr 2017 09:56:43 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: Kevin, Here's a silly theory: Have you tried putting a weight value in? I wonder if during migration it hits some large file that would go over the threshold and stops. With a weight flag you could move all small files in first or by lack of heat etc to pack the tier more tightly. Just something else to try before the PMR process. Zach On Apr 18, 2017 9:32 AM, "Buterbaugh, Kevin L" < Kevin.Buterbaugh at vanderbilt.edu> wrote: Hi All, but especially Marc, I ran the mmapplypolicy again last night and, unfortunately, it again did not fill the capacity pool like it said it would. From the log file: [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 3632859 181380873184 1620175 61434283936 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 88 99230048 88 99230048 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 442962867. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 61533513984KB: 1620263 of 3632947 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878464 124983549952 97.999999609% gpfs23data 128885076416 343753326592 37.493477574% system 0 0 0.000000000% (no user data) [I] 2017-04-18 at 02:52:48.402 Policy execution. 0 files dispatched. And the tail end of the log file says that it moved those files: [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. But mmdf (and how quickly the mmapplypolicy itself ran) say otherwise: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.73T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.73T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.45T ( 51%) 128.8G ( 0%) Ideas? Or is it time for me to open a PMR? Thanks? Kevin On Apr 17, 2017, at 4:16 PM, Buterbaugh, Kevin L < Kevin.Buterbaugh at Vanderbilt.Edu> wrote: Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ------------------------------ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan <*makaplan at us.ibm.com* > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" <*Kevin.Buterbaugh at Vanderbilt.Edu* > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: *gpfsug-discuss-bounces at spectrumscale.org* ------------------------------ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. >From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education *Kevin.Buterbaugh at vanderbilt.edu* - (615)875-9633 <(615)%20875-9633> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at *spectrumscale.org* *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at *spectrumscale.org* *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 16:11:19 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 11:11:19 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Tue Apr 18 16:31:12 2017 From: david_johnson at brown.edu (David D. Johnson) Date: Tue, 18 Apr 2017 11:31:12 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> I have an observation, which may merely serve to show my ignorance: Is it significant that the words "EXTERNAL EXEC/script? are seen below? If migrating between storage pools within the cluster, I would expect the PIT engine to do the migration. When doing HSM (off cluster, tape libraries, etc) is where I would expect to need a script to actually do the work. > [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. > [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; > 0 'skipped' files and/or errors. ? ddj Dave Johnson Brown University > On Apr 18, 2017, at 11:11 AM, Marc A Kaplan wrote: > > ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? > > ------ > > Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. > > So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? > > Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... > > While we're waiting for that... Here's what I suggest next. > > Add a clause ... > > SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) > > before the WHERE clause to each of your rules. > > Re-run the command with options '-I test -L 2' and collect the output. > > We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... > > You should see 1.6 million lines that look kind of like this: > > /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) > > Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed > add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). > > That sanity checks the policy arithmetic. Let's assume that's okay. > > Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as > find some of the biggest of those files and check that they really are that big.... > > At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... > and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... > > HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are > not recognized by mmapplypolicy as sharing storage... > This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? > > The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... > Optimistically that means it works fine for most customers... > > So sorry, something unusual about your installation or usage... > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 17:06:16 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 12:06:16 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> Message-ID: That is a summary message. It says one way or another, the command has dealt with 1.6 million files. For the case under discussion there are no EXTERNAL pools, nor any DELETions, just intra-GPFS MIGRATions. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Apr 18 17:32:24 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 18 Apr 2017 16:32:24 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> Hi Marc, Two things: 1. I have a PMR open now. 2. You *may* have identified the problem ? I?m still checking ? but files with hard links may be our problem. I wrote a simple Perl script to interate over the log file I had mmapplypolicy create. Here?s the code (don?t laugh, I?m a SysAdmin, not a programmer, and I whipped this out in < 5 minutes ? and yes, I realize the fact that I used Perl instead of Python shows my age as well ): #!/usr/bin/perl # use strict; use warnings; my $InputFile = "/tmp/mmapplypolicy.gpfs23.log"; my $TotalFiles = 0; my $TotalLinks = 0; my $TotalSize = 0; open INPUT, $InputFile or die "Couldn\'t open $InputFile for read: $!\n"; while () { next unless /MIGRATED/; $TotalFiles++; my $FileName = (split / /)[3]; if ( -f $FileName ) { # some files may have been deleted since mmapplypolicy ran my ($NumLinks, $FileSize) = (stat($FileName))[3,7]; $TotalLinks += $NumLinks; $TotalSize += $FileSize; } } close INPUT; print "Number of files / links = $TotalFiles / $TotalLinks, Total size = $TotalSize\n"; exit 0; And here?s what it kicked out: Number of files / links = 1620263 / 80818483, Total size = 53966202814094 1.6 million files but 80 million hard links!!! I?m doing some checking right now, but it appears that it is one particular group - and therefore one particular fileset - that is responsible for this ? they?ve got thousands of files with 50 or more hard links each ? and they?re not inconsequential in size. IIRC (and keep in mind I?m far from a GPFS policy guru), there is a way to say something to the effect of ?and the path does not contain /gpfs23/fileset/path? ? may need a little help getting that right. I?ll post this information to the ticket as well but wanted to update the list. This wouldn?t be the first time we were an ?edge case? for something in GPFS? ;-) Thanks... Kevin On Apr 18, 2017, at 10:11 AM, Marc A Kaplan > wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 17:56:11 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 12:56:11 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? hard links! A workaround In-Reply-To: <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> Message-ID: Kevin, Wow. Never underestimate the power of ... Anyhow try this as a fix. Add the clause SIZE(KB_ALLOCATED/NLINK) to your MIGRATE rules. This spreads the total actual size over each hardlink... From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/18/2017 12:33 PM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Two things: 1. I have a PMR open now. 2. You *may* have identified the problem ? I?m still checking ? but files with hard links may be our problem. I wrote a simple Perl script to interate over the log file I had mmapplypolicy create. Here?s the code (don?t laugh, I?m a SysAdmin, not a programmer, and I whipped this out in < 5 minutes ? and yes, I realize the fact that I used Perl instead of Python shows my age as well ): #!/usr/bin/perl # use strict; use warnings; my $InputFile = "/tmp/mmapplypolicy.gpfs23.log"; my $TotalFiles = 0; my $TotalLinks = 0; my $TotalSize = 0; open INPUT, $InputFile or die "Couldn\'t open $InputFile for read: $!\n"; while () { next unless /MIGRATED/; $TotalFiles++; my $FileName = (split / /)[3]; if ( -f $FileName ) { # some files may have been deleted since mmapplypolicy ran my ($NumLinks, $FileSize) = (stat($FileName))[3,7]; $TotalLinks += $NumLinks; $TotalSize += $FileSize; } } close INPUT; print "Number of files / links = $TotalFiles / $TotalLinks, Total size = $TotalSize\n"; exit 0; And here?s what it kicked out: Number of files / links = 1620263 / 80818483, Total size = 53966202814094 1.6 million files but 80 million hard links!!! I?m doing some checking right now, but it appears that it is one particular group - and therefore one particular fileset - that is responsible for this ? they?ve got thousands of files with 50 or more hard links each ? and they?re not inconsequential in size. IIRC (and keep in mind I?m far from a GPFS policy guru), there is a way to say something to the effect of ?and the path does not contain /gpfs23/fileset/path? ? may need a little help getting that right. I?ll post this information to the ticket as well but wanted to update the list. This wouldn?t be the first time we were an ?edge case? for something in GPFS? ;-) Thanks... Kevin On Apr 18, 2017, at 10:11 AM, Marc A Kaplan wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 14:12:16 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 13:12:16 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> Message-ID: <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Hi All, I think we *may* be able to wrap this saga up? ;-) Dave - in regards to your question, all I know is that the tail end of the log file is ?normal? for all the successful pool migrations I?ve done in the past few years. It looks like the hard links were the problem. We have one group with a fileset on our filesystem that they use for backing up Linux boxes in their lab. That one fileset has thousands and thousands (I haven?t counted, but based on the output of that Perl script I wrote it could well be millions) of files with anywhere from 50 to 128 hard links each ? those files ranged from a few KB to a few MB in size. From what Marc said, my understanding is that with the way I had my policy rule written mmapplypolicy was seeing each of those as separate files and therefore thinking it was moving 50 to 128 times as much space to the gpfs23capacity pool as it really was for those files. Marc can correct me or clarify further if necessary. He directed me to add: SIZE(KB_ALLOCATED/NLINK) to both of my migrate rules in my policy file. I did so and kicked off another mmapplypolicy last night, which is still running. However, the prediction section now says: [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 40050141920KB: 2051495 of 2051495 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 104098980256 124983549952 83.290145220% gpfs23data 168478368352 343753326592 49.011414674% system 0 0 0.000000000% (no user data) So now it?s going to move every file it can that matches my policies because it?s figured out that a lot of those are hard links ? and I don?t have enough files matching the criteria to fill the gpfs23capacity pool to the 98% limit like mmapplypolicy thought I did before. According to the log file, it?s happily chugging along migrating files, and mmdf agrees that my gpfs23capacity pool is gradually getting more full (I have it QOSed, of course): Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 25.33T ( 44%) 68.13G ( 0%) eon35Dnsd 58.2T 35 No Yes 25.33T ( 44%) 68.49G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 50.66T ( 44%) 136.6G ( 0%) My sincere thanks to all who took the time to respond to my questions. Of course, that goes double for Marc. We (Vanderbilt) seem to have a long tradition of finding some edge cases in GPFS going all the way back to when we originally moved off of an NFS server to GPFS (2.2, 2.3?) back in 2005. I was creating individual tarballs of each users? home directory on the NFS server, copying the tarball to one of the NSD servers, and untarring it there (don?t remember why we weren?t rsync?ing, but there was a reason). Everything was working just fine except for one user. Every time I tried to untar her home directory on GPFS it barfed part of the way thru ? turns out that until then IBM hadn?t considered that someone would want to put 6 million files in one directory. Gotta love those users! ;-) Kevin On Apr 18, 2017, at 10:31 AM, David D. Johnson > wrote: I have an observation, which may merely serve to show my ignorance: Is it significant that the words "EXTERNAL EXEC/script? are seen below? If migrating between storage pools within the cluster, I would expect the PIT engine to do the migration. When doing HSM (off cluster, tape libraries, etc) is where I would expect to need a script to actually do the work. [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. ? ddj Dave Johnson Brown University On Apr 18, 2017, at 11:11 AM, Marc A Kaplan > wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 15:37:29 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 10:37:29 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Apr 19 17:18:50 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 19 Apr 2017 16:18:50 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Hey Marc, I'm having some issues where a simple ILM list policy never completes, but I have yet to open a PMR or enable additional logging. But I was wondering if there are known reasons that this would not complete, such as when there is a symbolic link that creates a loop within the directory structure or something simple like that. Do you know of any cases like this, Marc, that I should try to find in my file systems? Thanks in advance! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, April 19, 2017 9:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Wed Apr 19 17:23:12 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Wed, 19 Apr 2017 19:23:12 +0300 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu><458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Hi Maybe the temp list file - fill the FS that they build on. Try to monitor the FS where the temp filelist is created. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Bryan Banister To: gpfsug main discussion list Date: 04/19/2017 07:19 PM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hey Marc, I?m having some issues where a simple ILM list policy never completes, but I have yet to open a PMR or enable additional logging. But I was wondering if there are known reasons that this would not complete, such as when there is a symbolic link that creates a loop within the directory structure or something simple like that. Do you know of any cases like this, Marc, that I should try to find in my file systems? Thanks in advance! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, April 19, 2017 9:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From makaplan at us.ibm.com Wed Apr 19 18:10:28 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 13:10:28 -0400 Subject: [gpfsug-discuss] mmapplypolicy not terminating properly? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu><458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: (Bryan B asked...) Open a PMR. The first response from me will be ... Run the mmapplypolicy command again, except with additional option `-d 017` and collect output with something equivalent to `2>&1 | tee /tmp/save-all-command-output-here-to-be-passed-along-to-IBM-service ` If you are convinced that mmapplypolicy is "looping" or "hung" - wait another 2 minutes, terminate, and then pass along the saved-all-command-output. -d 017 will dump a lot of additional diagnostics -- If you want to narrow it by baby steps we could try `-d 03` first and see if there are enough clues in that. To answer two of your questions: 1. mmapplypolicy does not follow symlinks, so no "infinite loop" possible with symlinks. 2a. loops in directory are file system bugs in GPFS, (in fact in any posixish file system), (mm)fsck! 2b. mmapplypolicy does impose a limit on total length of pathnames, so even if there is a loop in the directory, mmapplypolicy will "trim" the directory walk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 20:53:42 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 19:53:42 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data Message-ID: Hi All, We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Wed Apr 19 20:59:18 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 19 Apr 2017 19:59:18 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: Message-ID: Hi I'll give my opinion. Worth what you pay for. Do as many as you can, six in this case for the good reason you mentioned. But play with the callbacks so the migration happens on watermarks when it happens. Otherwise you might hit no space till your next policy run. The second is well documented on the redbook AFAIK Cheers -- Cheers > On 19 Apr 2017, at 22.54, Buterbaugh, Kevin L wrote: > > Hi All, > > We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. > > Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. > > So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. > > Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 19 21:05:49 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 19 Apr 2017 20:05:49 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] Sent: 19 April 2017 20:53 To: gpfsug main discussion list Subject: [gpfsug-discuss] RAID config for SSD's used for data Hi All, We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 From aaron.s.knister at nasa.gov Wed Apr 19 21:13:14 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 19 Apr 2017 16:13:14 -0400 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: On 4/19/17 4:05 PM, Simon Thompson (IT Research Support) wrote: > By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... > > And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) You mean like HAWC but for writes larger than 64K? ;-) Or I guess "HARC" as it might be called for a read cache... -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From luis.bolinches at fi.ibm.com Wed Apr 19 21:20:20 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 19 Apr 2017 20:20:20 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: Message-ID: I assume you are making the joke of external LROC. But not sure I would use external storage for LROC, as the whole point is to have really fast storage as close to the node (L for local) as possible. Maybe those SSD that will get replaced with the fancy external storage? -- Cheers > On 19 Apr 2017, at 23.13, Aaron Knister wrote: > > > >> On 4/19/17 4:05 PM, Simon Thompson (IT Research Support) wrote: >> By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... >> >> And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) > > You mean like HAWC but for writes larger than 64K? ;-) > > Or I guess "HARC" as it might be called for a read cache... > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 21:49:56 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 16:49:56 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 22:12:35 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 21:12:35 +0000 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Hi Marc, But the limitation on GPFS replication is that I can set replication separately for metadata and data, but no matter whether I have one data pool or ten data pools they all must have the same replication, correct? And believe me I *love* GPFS replication ? I would hope / imagine that I am one of the few people on this mailing list who has actually gotten to experience a ?fire scenario? ? electrical fire, chemical suppressant did it?s thing, and everything in the data center had a nice layer of soot, ash, and chemical suppressant on and in it and therefore had to be professionally cleaned. Insurance bought us enough disk space that we could (temporarily) turn on GPFS data replication and clean storage arrays one at a time! But in my current hypothetical scenario I?m stretching the budget just to get that one storage array with 12 x 1.8 TB SSD?s in it. Two are out of the question. My current metadata that I?ve got on SSDs is on RAID 1 mirrors and has GPFS replication set to 2. I thought the multiple RAID 1 mirrors approach was the way to go for SSDs for data as well, as opposed to one big RAID 6 LUN, but wanted to get the advice of those more knowledgeable than me. Thanks! Kevin On Apr 19, 2017, at 3:49 PM, Marc A Kaplan > wrote: As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: * Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. * GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Apr 19 22:23:15 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 19 Apr 2017 14:23:15 -0700 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> On 04/19/2017 12:53 PM, Buterbaugh, Kevin L wrote: > > So you?re considering the purchase of a dual-controller FC storage array > with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage > would be in its? own storage pool and that pool would be the default > location for I/O for your main filesystem ? at least for smaller files. > You intend to use mmapplypolicy nightly to move data to / from this > pool and the spinning disk pools. We did this and failed in interesting (but in retrospect obvious) ways. You will want to ensure that your users cannot fill your write target pool within a day. The faster the storage, the more likely that is to happen. Or else your users will get ENOSPC. You will want to ensure that your pools can handle the additional I/O from the migration in aggregate with all the user I/O. Or else your users will see worse performance from the fast pool than the slow pool while the migration is running. You will want to make sure that the write throughput of your slow pool is faster than the read throughput of your fast pool. In our case, the fast pool was undersized in capacity, and oversized in terms of performance. And overall the filesystem was oversubscribed (~100 10GbE clients, 8 x 10GbE NSD servers) So the fast pool would fill very quickly. Then I would switch the placement policy to the big slow pool and performance would drop dramatically, and then if I ran a migration it would either (depending on parameters) take up all the I/O to the slow pool (leaving none for the users), or else take forever (weeks) because the user I/O was maxing out the slow pool. Things should better today with QoS stuff, but your relative pool capacities (in our case it was like 1% fast, 99% slow) and your relative pool performance (in our case, slow pool had fewer IOPS than fast pool) are still going to matter a lot. -- Alex Chekholko chekh at stanford.edu From makaplan at us.ibm.com Wed Apr 19 22:58:24 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 17:58:24 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Kevin asked: " ... data pools they all must have the same replication, correct?" Actually no! You can use policy RULE ... SET POOL 'x' REPLICATE(2) to set the replication factor when a file is created. Use mmchattr or mmapplypolicy to change the replication factor after creation. You specify the maximum data replication factor when you create the file system (1,2,3), but any given file can have replication factor set to 1 or 2 or 3. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From kums at us.ibm.com Wed Apr 19 23:03:33 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Wed, 19 Apr 2017 18:03:33 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Hi, >> As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: >>Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. >>This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. As you pointed out, the RAID choices for GPFS may not be simple and we need to take into consideration factors such as storage subsystem configuration/capabilities such as if all drives are homogenous or there is mix of drives. If all the drives are homogeneous, then create dataAndMetadata NSDs across RAID-6 and if the storage controller supports write-cache + write-cache mirroring (WC + WM) then enable this (WC +WM) can alleviate read-modify-write for small writes (typical in metadata). If there is MIX of SSD and HDD (e.g. 15K RPM), then we need to take into consideration the aggregate IOPS of RAID-1 SSD volumes vs. RAID-6 HDDs before separating data and metadata into separate media. For example, if the storage subsystem has 2 x SSDs and ~300 x 15K RPM or NL_SAS HDDs then most likely aggregate IOPS of RAID-6 HDD volumes will be higher than RAID-1 SSD volumes. It would be recommended to also assess the I/O performance on different configuration (dataAndMetadata vs dataOnly/metadataOnly NSDs) with some application workload + production scenarios before deploying the final solution. >> GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more >>robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. For high-resiliency (for e.g. metadataOnly) and if there are multiple storage across different failure domains (different racks/rooms/DC etc), it will be good to enable BOTH hardware RAID-1 as well as GPFS metadata replication enabled (at the minimum, -m 2). If there is single shared storage for GPFS file-system storage and metadata is separated from data, then RAID-1 would minimize administrative overhead compared to GPFS replication in the event of drive failure (since with GPFS replication across single SSD would require mmdeldisk/mmdelnsd/mmcrnsd/mmadddisk every time disk goes faulty and needs to be replaced). Best, -Kums From: Marc A Kaplan/Watson/IBM at IBMUS To: gpfsug main discussion list Date: 04/19/2017 04:50 PM Subject: Re: [gpfsug-discuss] RAID config for SSD's - potential pitfalls Sent by: gpfsug-discuss-bounces at spectrumscale.org As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 23:41:19 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 18:41:19 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Kums is our performance guru, so weigh that appropriately and relative to my own remarks... Nevertheless, I still think RAID-5or6 is a poor choice for GPFS metadata. The write cache will NOT mitigate the read-modify-write problem of a workload that has a random or hop-scotch access pattern of small writes. In the end you've still got to read and write several times more disk blocks than you actually set out to modify. Same goes for any large amount of data that will be written in a pattern of non-sequential small writes. (Define a small write as less than a full RAID stripe). For sure, non-volatile write caches are a good thing - but not a be all end all solution. Relying on RAID-1 to protect your metadata may well be easier to administer, but still GPFS replication can be more robust. Doing both - belt and suspenders is fine -- if you can afford it. Either is buying 2x storage, both is 4x. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Apr 20 00:16:08 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 19 Apr 2017 23:16:08 +0000 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Message-ID: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Thu Apr 20 01:10:51 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 19 Apr 2017 20:10:51 -0400 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values In-Reply-To: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> References: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> Message-ID: Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Apr 20 01:21:04 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 20 Apr 2017 00:21:04 +0000 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Message-ID: Hi Eric Looks like your assumption is correct - no negative values from ?mmfsadm eventsExporter mmpmon vfss?. I don?t normally view these via ?mmfsadm?, I use the zimon stats. But, It?s a bug that should be fixed. What?s the best way to get this fixed? root at cnt-r01r07u15 ~]# mmfsadm eventsExporter mmpmon vfss _response_ begin mmpmon vfss _mmpmon::vfss_ _n_ 10.30.100.193 _nn_ cnt-r01r07u15 _rc_ 0 _t_ 1492647309 _tu_ 311964 _access_ 8472897 56529.874886 _close_ 1460223848 49854.938090 _create_ 2101927 155055.515041 _fclear_ 0 0.000000 _fsync_ 20 0.024288 _fsync_range_ 0 0.000000 _ftrunc_ 0 0.000000 _getattr_ 859626332 101183.720281 _link_ 2175473 625.343799 _lockctl_ 17326 5.229828 _lookup_ 200378610 1201985.264220 _map_lloff_ 220854519 8561.860515 _mkdir_ 817943 217390.170859 _mknod_ 3 0.001422 _open_ 1460217712 134812.649162 _read_ 3883163461 3971457.463527 _write_ 186078410 137927.496812 _mmapRead_ 17108947 10665.929860 _mmapWrite_ 0 0.000000 _aioRead_ 0 0.000000 _aioWrite_ 0 0.000000 _readdir_ 142262897 6999.189450 _readlink_ 485337171 2111.634286 _readpage_ 3646233600 14346.331414 _remove_ 4241324 93277.463798 _rename_ 350679 19334.235924 _rmdir_ 342042 2736.048976 _setacl_ 0 0.000000 _setattr_ 3709289 16963.901179 _symlink_ 161336 8522.670079 _unmap_ 3929805828 1735.740690 _writepage_ 0 0.000000 _tsfattr_ 0 0.000000 _tsfsattr_ 0 0.000000 _flock_ 0 0.000000 _setxattr_ 119 0.001042 _getxattr_ 4077218348 628418.213008 _listxattr_ 0 0.000000 _removexattr_ 15 0.000042 _encode_fh_ 0 0.000000 _decode_fh_ 0 0.000000 _get_dentry_ 0 0.000000 _get_parent_ 0 0.000000 _mount_ 0 0.000000 _statfs_ 2625497 214.309671 _sync_ 0 0.000000 _vget_ 0 0.000000 _response_ end Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Wednesday, April 19, 2017 at 7:10 PM To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Thu Apr 20 02:03:16 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 19 Apr 2017 21:03:16 -0400 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values In-Reply-To: References: Message-ID: Thanks Bob. Yes, it looks good for the hypothesis. ZIMon gets its VFSS stats from the mmpmon code that we just exercised with "mmfsadm eventsExporter mmpmon vfss"; so the ZIMon stats are also probably correct. Having said that, I agree with you that the "mmfsadm vfsstats" problem is a bug that should be fixed. If you would like to open a PMR so an APAR gets generated, it might help speed the routing of the PMR if you include in the PMR text our email exchange, and highlight Eric Agar is the GPFS developer with whom you've already discussed this issue. You could also mention that I believe I have no need for a gpfs snap. Having an APAR will help ensure the fix makes it into a PTF for the release you are using. If you do not want to open a PMR, I still intend to fix the problem in the development stream. Thanks again. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Cc: IBM Spectrum Scale/Poughkeepsie/IBM at IBMUS Date: 04/19/2017 08:21 PM Subject: Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Hi Eric Looks like your assumption is correct - no negative values from ?mmfsadm eventsExporter mmpmon vfss?. I don?t normally view these via ?mmfsadm?, I use the zimon stats. But, It?s a bug that should be fixed. What?s the best way to get this fixed? root at cnt-r01r07u15 ~]# mmfsadm eventsExporter mmpmon vfss _response_ begin mmpmon vfss _mmpmon::vfss_ _n_ 10.30.100.193 _nn_ cnt-r01r07u15 _rc_ 0 _t_ 1492647309 _tu_ 311964 _access_ 8472897 56529.874886 _close_ 1460223848 49854.938090 _create_ 2101927 155055.515041 _fclear_ 0 0.000000 _fsync_ 20 0.024288 _fsync_range_ 0 0.000000 _ftrunc_ 0 0.000000 _getattr_ 859626332 101183.720281 _link_ 2175473 625.343799 _lockctl_ 17326 5.229828 _lookup_ 200378610 1201985.264220 _map_lloff_ 220854519 8561.860515 _mkdir_ 817943 217390.170859 _mknod_ 3 0.001422 _open_ 1460217712 134812.649162 _read_ 3883163461 3971457.463527 _write_ 186078410 137927.496812 _mmapRead_ 17108947 10665.929860 _mmapWrite_ 0 0.000000 _aioRead_ 0 0.000000 _aioWrite_ 0 0.000000 _readdir_ 142262897 6999.189450 _readlink_ 485337171 2111.634286 _readpage_ 3646233600 14346.331414 _remove_ 4241324 93277.463798 _rename_ 350679 19334.235924 _rmdir_ 342042 2736.048976 _setacl_ 0 0.000000 _setattr_ 3709289 16963.901179 _symlink_ 161336 8522.670079 _unmap_ 3929805828 1735.740690 _writepage_ 0 0.000000 _tsfattr_ 0 0.000000 _tsfsattr_ 0 0.000000 _flock_ 0 0.000000 _setxattr_ 119 0.001042 _getxattr_ 4077218348 628418.213008 _listxattr_ 0 0.000000 _removexattr_ 15 0.000042 _encode_fh_ 0 0.000000 _decode_fh_ 0 0.000000 _get_dentry_ 0 0.000000 _get_parent_ 0 0.000000 _mount_ 0 0.000000 _statfs_ 2625497 214.309671 _sync_ 0 0.000000 _vget_ 0 0.000000 _response_ end Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Wednesday, April 19, 2017 at 7:10 PM To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Apr 20 09:11:15 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 20 Apr 2017 10:11:15 +0200 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: Some thoughts: you give typical cumulative usage values. However, a fast pool might matter most for spikes of the traffic. Do you have spikes driving your current system to the edge? Then: using the SSD pool for writes is straightforward (placement), using it for reads will only pay off if data are either pre-fetched to the pool somehow, or read more than once before getting migrated back to the HDD pool(s). Write traffic is less than read as you wrote. RAID1 vs RAID6: RMW penalty of parity-based RAIDs was mentioned, which strikes at writes smaller than the full stripe width of your RAID - what type of write I/O do you have (or expect)? (This may also be important for choosing the quality of SSDs, with RMW in mind you will have a comparably huge amount of data written on the SSD devices if your I/O traffic consists of myriads of small IOs and you organized the SSDs in a RAID5 or RAID6) I suppose your current system is well set to provide the required aggregate throughput. Now, what kind of improvement do you expect? How are the clients connected? Would they have sufficient network bandwidth to see improvements at all? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 gpfsug-discuss-bounces at spectrumscale.org wrote on 04/19/2017 09:53:42 PM: > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/19/2017 09:54 PM > Subject: [gpfsug-discuss] RAID config for SSD's used for data > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > Hi All, > > We currently have what I believe is a fairly typical setup ? > metadata for our GPFS filesystems is the only thing in the system > pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). > Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB > usable space. > > Now lets just say that you have a little bit of money to spend. > Your I/O demands aren?t great - in fact, they?re way on the low end > ? typical (cumulative) usage is 200 - 600 MB/sec read, less than > that for writes. But while GPFS has always been great and therefore > you don?t need to Make GPFS Great Again, you do want to provide your > users with the best possible environment. > > So you?re considering the purchase of a dual-controller FC storage > array with 12 or so 1.8 TB SSD?s in it, with the idea being that > that storage would be in its? own storage pool and that pool would > be the default location for I/O for your main filesystem ? at least > for smaller files. You intend to use mmapplypolicy nightly to move > data to / from this pool and the spinning disk pools. > > Given all that ? would you configure those disks as 6 RAID 1 mirrors > and have 6 different primary NSD servers or would it be feasible to > configure one big RAID 6 LUN? I?m thinking the latter is not a good > idea as there could only be one primary NSD server for that one LUN, > but given that: 1) I have no experience with this, and 2) I have > been wrong once or twice before (), I?m looking for advice. Thanks! > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jonathan at buzzard.me.uk Thu Apr 20 10:25:40 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 20 Apr 2017 10:25:40 +0100 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> References: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> Message-ID: <1492680340.4102.120.camel@buzzard.me.uk> On Wed, 2017-04-19 at 14:23 -0700, Alex Chekholko wrote: > On 04/19/2017 12:53 PM, Buterbaugh, Kevin L wrote: > > > > So you?re considering the purchase of a dual-controller FC storage array > > with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage > > would be in its? own storage pool and that pool would be the default > > location for I/O for your main filesystem ? at least for smaller files. > > You intend to use mmapplypolicy nightly to move data to / from this > > pool and the spinning disk pools. > > We did this and failed in interesting (but in retrospect obvious) ways. > You will want to ensure that your users cannot fill your write target > pool within a day. The faster the storage, the more likely that is to > happen. Or else your users will get ENOSPC. Eh? Seriously you should have a fail over rule so that when your "fast" pool is filled up it starts allocating in the "slow" pool (nice good names that are descriptive and less than 8 characters including termination character). Now there are issues when you get close to very full so you need to set the fail over to as sizeable bit less than the full size, 95% is a good starting point. The pool names size is important because if the fast pool is less than eight characters and the slow is more because you called in "nearline" (which is 9 including termination character) once the files get moved they get backed up again by TSM, yeah!!! The 95% bit comes about from this. Imagine you had 12KB left in the fast pool and you go to write a file. You open the file with 0B in size and then start writing. At 12KB you run out of space in the fast pool and as the file can only be in one pool you get a ENOSPC, and the file gets canned. This then starts repeating on a regular basis. So if you start allocating at significantly less than 100%, say 95% where that 5% is larger than the largest file you expect that file works, but all subsequent files get allocated in the slow pool, till you flush the fast pool. Something like this as the last two rules in your policy should do the trick. /* by default new files to the fast disk unless full, then to slow */ RULE 'new' SET POOL 'fast' LIMIT(95) RULE 'spillover' SET POOL 'slow' However in general your fast pool needs to have sufficient capacity to take your daily churn and then some. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From jonathan at buzzard.me.uk Thu Apr 20 10:32:20 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 20 Apr 2017 10:32:20 +0100 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: <1492680740.4102.126.camel@buzzard.me.uk> On Wed, 2017-04-19 at 20:05 +0000, Simon Thompson (IT Research Support) wrote: > By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... > > And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) > If you have sized the "fast" pool correctly then the "slow" pool will be spending most of it's time doing diddly squat, aka under 10 IOPS per second unless you are flushing the pool of old files to make space. I have graphs that show this. Then two things happen, if you are just reading the file then fine, probably coming from the cache or the disks are not very busy anyway so you won't notice. If you happen to *change* the file and start doing things actively with it again, then because most programs approach this by creating an entirely new file with a temporary name, then doing a rename and delete shuffle so a crash will leave you with a valid file somewhere then the changed version ends up on the fast disk by virtue of being a new file. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From p.childs at qmul.ac.uk Thu Apr 20 12:38:09 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 20 Apr 2017 11:38:09 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Simon, We've managed to resolve this issue by switching off quota's and switching them back on again and rebuilding the quota file. Can I check if you run quota's on your cluster. See you 2 weeks in Manchester Thanks in advance. Peter Childs Research Storage Expert ITS Research Infrastructure Queen Mary, University of London Phone: 020 7882 8393 ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support) Sent: Tuesday, April 11, 2017 4:55:35 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From kenneth.waegeman at ugent.be Thu Apr 20 15:53:29 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Thu, 20 Apr 2017 16:53:29 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Message-ID: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for > writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on > the order of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that > reducing maxMBpS from 10000 to 100 fixed the problem! Digging further > I found that reducing prefetchThreads from default=72 to 32 also fixed > it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > >: > > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the > load up on one socket, you push all the interrupt handling to the > other socket where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org > > [gpfsug-discuss-bounces at spectrumscale.org > ] on behalf of > Aaron Knister [aaron.s.knister at nasa.gov > ] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going > out to > > the clients. I was having a really hard time getting anything > resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do > better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load > I saw > > an almost 4x performance jump which is pretty much goes against > every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated > crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling > shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 > processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I > still have > > to run something to drive up the CPU load and then performance > improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm > curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Apr 20 16:04:20 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Thu, 20 Apr 2017 15:04:20 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> , <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Apr 20 16:07:32 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 20 Apr 2017 17:07:32 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: Hi Kennmeth, is prefetching off or on at your storage backend? Raw sequential is very different from GPFS sequential at the storage device ! GPFS does its own prefetching, the storage would never know what sectors sequential read at GPFS level maps to at storage level! Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/20/2017 04:53 PM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From marcusk at nz1.ibm.com Fri Apr 21 02:21:51 2017 From: marcusk at nz1.ibm.com (Marcus Koenig1) Date: Fri, 21 Apr 2017 14:21:51 +1300 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: Hi Kennmeth, we also had similar performance numbers in our tests. Native was far quicker than through GPFS. When we learned though that the client tested the performance on the FS at a big blocksize (512k) with small files - we were able to speed it up significantly using a smaller FS blocksize (obviously we had to recreate the FS). So really depends on how you do your tests. Cheers, Marcus Koenig Lab Services Storage & Power Specialist IBM Australia & New Zealand Advanced Technical Skills IBM Systems-Hardware |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | | |Mobile: +64 21 67 34 27 | | | |E-mail: marcusk at nz1.ibm.com | | | | | | | | | | | | | | | |82 Wyndham Street | | | |Auckland, AUK 1010 | | | |New Zealand | | | | | | | | | | | | | | | | | | | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| From: "Uwe Falke" To: gpfsug main discussion list Date: 04/21/2017 03:07 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Kennmeth, is prefetching off or on at your storage backend? Raw sequential is very different from GPFS sequential at the storage device ! GPFS does its own prefetching, the storage would never know what sectors sequential read at GPFS level maps to at storage level! Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/20/2017 04:53 PM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17773863.gif Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17405449.jpg Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17997200.gif Type: image/gif Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Fri Apr 21 08:25:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 21 Apr 2017 09:25:22 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From kenneth.waegeman at ugent.be Fri Apr 21 10:43:25 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:43:25 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> Message-ID: <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Interesting. Could you share a little more about your architecture? Is > it possible to mount the fs on an NSD server and do some dd's from the > fs on the NSD server? If that gives you decent performance perhaps try > NSDPERF next > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf > > > -Aaron > > > > > On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman > wrote: >> >> Hi, >> >> >> Having an issue that looks the same as this one: >> >> We can do sequential writes to the filesystem at 7,8 GB/s total , >> which is the expected speed for our current storage >> backend. While we have even better performance with sequential reads >> on raw storage LUNS, using GPFS we can only reach 1GB/s in total >> (each nsd server seems limited by 0,5GB/s) independent of the number >> of clients >> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev >> params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as >> discussed in this thread, but nothing seems to impact this read >> performance. >> >> Any ideas? >> >> Thanks! >> >> Kenneth >> >> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>> I just had a similar experience from a sandisk infiniflash system >>> SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for >>> writes. and 250-300 Mbyte/s on sequential reads!! Random reads were >>> on the order of 2 Gbyte/s. >>> >>> After a bit head scratching snd fumbling around I found out that >>> reducing maxMBpS from 10000 to 100 fixed the problem! Digging >>> further I found that reducing prefetchThreads from default=72 to 32 >>> also fixed it, while leaving maxMBpS at 10000. Can now also read at >>> 3,2 GByte/s. >>> >>> Could something like this be the problem on your box as well? >>> >>> >>> >>> -jf >>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >>> >: >>> >>> Well, I'm somewhat scrounging for hardware. This is in our test >>> environment :) And yep, it's got the 2U gpu-tray in it although even >>> without the riser it has 2 PCIe slots onboard (excluding the >>> on-board >>> dual-port mezz card) so I think it would make a fine NSD server even >>> without the riser. >>> >>> -Aaron >>> >>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT >>> Services) >>> wrote: >>> > Maybe its related to interrupt handlers somehow? You drive the >>> load up on one socket, you push all the interrupt handling to >>> the other socket where the fabric card is attached? >>> > >>> > Dunno ... (Though I am intrigued you use idataplex nodes as >>> NSD servers, I assume its some 2U gpu-tray riser one or something !) >>> > >>> > Simon >>> > ________________________________________ >>> > From: gpfsug-discuss-bounces at spectrumscale.org >>> >>> [gpfsug-discuss-bounces at spectrumscale.org >>> ] on behalf of >>> Aaron Knister [aaron.s.knister at nasa.gov >>> ] >>> > Sent: 17 February 2017 15:52 >>> > To: gpfsug main discussion list >>> > Subject: [gpfsug-discuss] bizarre performance behavior >>> > >>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>> > connections coming in and 1x FDR10 and 1x QDR connection going >>> out to >>> > the clients. I was having a really hard time getting anything >>> resembling >>> > sensible performance out of it (4-5Gb/s writes but maybe >>> 1.2Gb/s for >>> > reads). The back-end is a DDN SFA12K and I *know* it can do >>> better than >>> > that. >>> > >>> > I don't remember quite how I figured this out but simply by >>> running >>> > "openssl speed -multi 16" on the nsd server to drive up the >>> load I saw >>> > an almost 4x performance jump which is pretty much goes >>> against every >>> > sysadmin fiber in me (i.e. "drive up the cpu load with >>> unrelated crap to >>> > quadruple your i/o performance"). >>> > >>> > This feels like some type of C-states frequency scaling >>> shenanigans that >>> > I haven't quite ironed down yet. I booted the box with the >>> following >>> > kernel parameters "intel_idle.max_cstate=0 >>> processor.max_cstate=0" which >>> > didn't seem to make much of a difference. I also tried setting the >>> > frequency governer to userspace and setting the minimum >>> frequency to >>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I >>> still have >>> > to run something to drive up the CPU load and then performance >>> improves. >>> > >>> > I'm wondering if this could be an issue with the C1E state? >>> I'm curious >>> > if anyone has seen anything like this. The node is a dx360 M4 >>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>> > >>> > -Aaron >>> > >>> > -- >>> > Aaron Knister >>> > NASA Center for Climate Simulation (Code 606.2) >>> > Goddard Space Flight Center >>> > (301) 286-2776 >>> > _______________________________________________ >>> > gpfsug-discuss mailing list >>> > gpfsug-discuss at spectrumscale.org >>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > _______________________________________________ >>> > gpfsug-discuss mailing list >>> > gpfsug-discuss at spectrumscale.org >>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 10:50:55 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:50:55 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <2b0824a1-e1a2-8dd8-4a55-a57d7b00e09f@ugent.be> Hi, prefetching was already disabled at our storage backend, but a good thing to recheck :) thanks! On 20/04/17 17:07, Uwe Falke wrote: > Hi Kennmeth, > > is prefetching off or on at your storage backend? > Raw sequential is very different from GPFS sequential at the storage > device ! > GPFS does its own prefetching, the storage would never know what sectors > sequential read at GPFS level maps to at storage level! > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Andreas Hasse, Thorsten Moehring > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Kenneth Waegeman > To: gpfsug main discussion list > Date: 04/20/2017 04:53 PM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > > Having an issue that looks the same as this one: > We can do sequential writes to the filesystem at 7,8 GB/s total , which is > the expected speed for our current storage > backend. While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd > server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in > this thread, but nothing seems to impact this read performance. > Any ideas? > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. > and 250-300 Mbyte/s on sequential reads!! Random reads were on the order > of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that reducing > maxMBpS from 10000 to 100 fixed the problem! Digging further I found that > reducing prefetchThreads from default=72 to 32 also fixed it, while > leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > : > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: >> Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? >> Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, > I assume its some 2U gpu-tray riser one or something !) >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] >> Sent: 17 February 2017 15:52 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] bizarre performance behavior >> >> This is a good one. I've got an NSD server with 4x 16GB fibre >> connections coming in and 1x FDR10 and 1x QDR connection going out to >> the clients. I was having a really hard time getting anything resembling >> sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for >> reads). The back-end is a DDN SFA12K and I *know* it can do better than >> that. >> >> I don't remember quite how I figured this out but simply by running >> "openssl speed -multi 16" on the nsd server to drive up the load I saw >> an almost 4x performance jump which is pretty much goes against every >> sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to >> quadruple your i/o performance"). >> >> This feels like some type of C-states frequency scaling shenanigans that >> I haven't quite ironed down yet. I booted the box with the following >> kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which >> didn't seem to make much of a difference. I also tried setting the >> frequency governer to userspace and setting the minimum frequency to >> 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have >> to run something to drive up the CPU load and then performance improves. >> >> I'm wondering if this could be an issue with the C1E state? I'm curious >> if anyone has seen anything like this. The node is a dx360 M4 >> (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From kenneth.waegeman at ugent.be Fri Apr 21 10:52:58 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:52:58 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> Hi, Tried these settings, but sadly I'm not seeing any changes. Thanks, Kenneth On 21/04/17 09:25, Olaf Weiser wrote: > pls check > workerThreads (assuming you 're > 4.2.2) start with 128 .. increase > iteratively > pagepool at least 8 G > ignorePrefetchLunCount=yes (1) > > then you won't see a difference and GPFS is as fast or even faster .. > > > > From: "Marcus Koenig1" > To: gpfsug main discussion list > Date: 04/21/2017 03:24 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi Kennmeth, > > we also had similar performance numbers in our tests. Native was far > quicker than through GPFS. When we learned though that the client > tested the performance on the FS at a big blocksize (512k) with small > files - we were able to speed it up significantly using a smaller FS > blocksize (obviously we had to recreate the FS). > > So really depends on how you do your tests. > > *Cheers,* > * > Marcus Koenig* > Lab Services Storage & Power Specialist/ > IBM Australia & New Zealand Advanced Technical Skills/ > IBM Systems-Hardware > ------------------------------------------------------------------------ > > *Mobile:*+64 21 67 34 27* > E-mail:*_marcusk at nz1.ibm.com_ > > 82 Wyndham Street > Auckland, AUK 1010 > New Zealand > > > > > > > > > > Inactive hide details for "Uwe Falke" ---04/21/2017 03:07:48 AM---Hi > Kennmeth, is prefetching off or on at your storage backe"Uwe Falke" > ---04/21/2017 03:07:48 AM---Hi Kennmeth, is prefetching off or on at > your storage backend? > > From: "Uwe Falke" > To: gpfsug main discussion list > Date: 04/21/2017 03:07 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Hi Kennmeth, > > is prefetching off or on at your storage backend? > Raw sequential is very different from GPFS sequential at the storage > device ! > GPFS does its own prefetching, the storage would never know what sectors > sequential read at GPFS level maps to at storage level! > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Andreas Hasse, Thorsten Moehring > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Kenneth Waegeman > To: gpfsug main discussion list > Date: 04/20/2017 04:53 PM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > > Having an issue that looks the same as this one: > We can do sequential writes to the filesystem at 7,8 GB/s total , > which is > the expected speed for our current storage > backend. While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd > server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in > this thread, but nothing seems to impact this read performance. > Any ideas? > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. > and 250-300 Mbyte/s on sequential reads!! Random reads were on the order > of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that reducing > maxMBpS from 10000 to 100 fixed the problem! Digging further I found that > reducing prefetchThreads from default=72 to 32 also fixed it, while > leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, > I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Fri Apr 21 13:58:26 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 21 Apr 2017 08:58:26 -0400 Subject: [gpfsug-discuss] bizarre performance behavior - prefetchThreads In-Reply-To: <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> Message-ID: Seems counter-logical, but we have testimony that you may need to reduce the prefetchThreads parameter. Of all the parameters, that's the one that directly affects prefetching, so worth trying. Jan-Frode Myklebust wrote: ...Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s.... I can speculate that having prefetchThreads to high may create a contention situation where more threads causes overall degradation in system performance. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From aaron.s.knister at nasa.gov Fri Apr 21 14:10:49 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Fri, 21 Apr 2017 13:10:49 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister <aaron.s.knister at nasa.gov>: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Fri Apr 21 14:18:34 2017 From: david_johnson at brown.edu (David D Johnson) Date: Fri, 21 Apr 2017 09:18:34 -0400 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <02C0BD31-E743-4F1C-91E7-20555099CBF5@brown.edu> We had some luck making the client and server IB performance consistently decent by configuring tuned with the profile "latency-performance". The key is the line /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=1 which prevents cpu from going to sleep just when the next burst of IB traffic is about to arrive. -- ddj Dave Johnson Brown University CCV On Apr 21, 2017, at 9:10 AM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > > Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. > > A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: >> Hi, >> We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. >> We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. >> When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. >> >> I'll use the nsdperf tool to see if we can find anything, >> >> thanks! >> >> K >> >> On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: >>> Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf >>> >>> -Aaron >>> >>> >>> >>> >>> On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: >>>> Hi, >>>> >>>> >>>> Having an issue that looks the same as this one: >>>> >>>> We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage >>>> backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>>> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. >>>> Any ideas? >>>> >>>> Thanks! >>>> >>>> Kenneth >>>> >>>> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>>>> I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. >>>>> >>>>> After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. >>>>> >>>>> Could something like this be the problem on your box as well? >>>>> >>>>> >>>>> >>>>> -jf >>>>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister < aaron.s.knister at nasa.gov >: >>>>> Well, I'm somewhat scrounging for hardware. This is in our test >>>>> environment :) And yep, it's got the 2U gpu-tray in it although even >>>>> without the riser it has 2 PCIe slots onboard (excluding the on-board >>>>> dual-port mezz card) so I think it would make a fine NSD server even >>>>> without the riser. >>>>> >>>>> -Aaron >>>>> >>>>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) >>>>> wrote: >>>>> > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? >>>>> > >>>>> > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) >>>>> > >>>>> > Simon >>>>> > ________________________________________ >>>>> > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org ] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov ] >>>>> > Sent: 17 February 2017 15:52 >>>>> > To: gpfsug main discussion list >>>>> > Subject: [gpfsug-discuss] bizarre performance behavior >>>>> > >>>>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>>>> > connections coming in and 1x FDR10 and 1x QDR connection going out to >>>>> > the clients. I was having a really hard time getting anything resembling >>>>> > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for >>>>> > reads). The back-end is a DDN SFA12K and I *know* it can do better than >>>>> > that. >>>>> > >>>>> > I don't remember quite how I figured this out but simply by running >>>>> > "openssl speed -multi 16" on the nsd server to drive up the load I saw >>>>> > an almost 4x performance jump which is pretty much goes against every >>>>> > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to >>>>> > quadruple your i/o performance"). >>>>> > >>>>> > This feels like some type of C-states frequency scaling shenanigans that >>>>> > I haven't quite ironed down yet. I booted the box with the following >>>>> > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which >>>>> > didn't seem to make much of a difference. I also tried setting the >>>>> > frequency governer to userspace and setting the minimum frequency to >>>>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have >>>>> > to run something to drive up the CPU load and then performance improves. >>>>> > >>>>> > I'm wondering if this could be an issue with the C1E state? I'm curious >>>>> > if anyone has seen anything like this. The node is a dx360 M4 >>>>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>>>> > >>>>> > -Aaron >>>>> > >>>>> > -- >>>>> > Aaron Knister >>>>> > NASA Center for Climate Simulation (Code 606.2) >>>>> > Goddard Space Flight Center >>>>> > (301) 286-2776 >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Fri Apr 21 15:01:33 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 21 Apr 2017 14:01:33 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) Turbo Mode - Enable QPI Link Frequency - Max Performance Operating Mode - Maximum Performance >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To: gpfsug main discussion list Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From bbanister at jumptrading.com Fri Apr 21 16:01:54 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 21 Apr 2017 15:01:54 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <7dcbac92e19043faa7968702d852668f@jumptrading.com> I think we have a new topic and new speaker for the next UG meeting at SC! Kums presenting "Performance considerations for Spectrum Scale"!! Kums, I have to say you do have a lot to offer here... ;o) -Bryan Disclaimer: There are some selfish reasons of me wanting to hang out with you again involved in this suggestion From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kumaran Rajaram Sent: Friday, April 21, 2017 9:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] bizarre performance behavior Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) * Turbo Mode - Enable * QPI Link Frequency - Max Performance * Operating Mode - Maximum Performance * >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). * [cid:image001.gif at 01D2BA86.4D4B4C10] [cid:image002.gif at 01D2BA86.4D4B4C10] [cid:image003.gif at 01D2BA86.4D4B4C10] Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" > To: gpfsug main discussion list > Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 61023 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 85131 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 84819 bytes Desc: image003.gif URL: From g.mangeot at gmail.com Fri Apr 21 16:04:58 2017 From: g.mangeot at gmail.com (Guillaume Mangeot) Date: Fri, 21 Apr 2017 17:04:58 +0200 Subject: [gpfsug-discuss] HA on snapshot scheduling in GPFS GUI Message-ID: Hi, I'm looking for a way to get the GUI working in HA to schedule snapshots. I have 2 servers with gpfs.gui service running on them. I checked a bit with lssnaprule in /usr/lpp/mmfs/gui/cli and the file /var/lib/mmfs/gui/snapshots.json But it doesn't look to be shared between all the GUI servers. Is there a way to get GPFS GUI working in HA to schedule snapshots? (keeping the coherency: avoiding to trigger snapshots on both servers in the same time) Regards, Guillaume Mangeot DDN Storage -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 16:33:16 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 17:33:16 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <41475044-c195-5561-c94a-b54ee30c7e68@ugent.be> On 21/04/17 15:10, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Fantastic news! It might also be worth running "cpupower monitor" or > "turbostat" on your NSD servers while you're running dd tests from the > clients to see what CPU frequency your cores are actually running at. Thanks! I verified with turbostat and cpuinfo, our cpus are running in high performance mode and frequency is always at highest level. > > A typical NSD server workload (especially with IB verbs and for reads) > can be pretty light on CPU which might not prompt your CPU crew > governor to up the frequency (which can affect throughout). If your > frequency scaling governor isn't kicking up the frequency of your CPUs > I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: >> >> Hi, >> >> We are running a test setup with 2 NSD Servers backed by 4 Dell >> Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of >> the 4 powervaults, nsd02 is primary serving LUNS of controller B. >> >> We are testing from 2 testing machines connected to the nsds with >> infiniband, verbs enabled. >> >> When we do dd from the NSD servers, we see indeed performance going >> to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is >> able to get the data at a decent speed. Since we can write from the >> clients at a good speed, I didn't suspect the communication between >> clients and nsds being the issue, especially since total performance >> stays the same using 1 or multiple clients. >> >> I'll use the nsdperf tool to see if we can find anything, >> >> thanks! >> >> K >> >> On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE >> CORP] wrote: >>> Interesting. Could you share a little more about your architecture? >>> Is it possible to mount the fs on an NSD server and do some dd's >>> from the fs on the NSD server? If that gives you decent performance >>> perhaps try NSDPERF next >>> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf >>> >>> >>> >>> -Aaron >>> >>> >>> >>> >>> On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman >>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> Having an issue that looks the same as this one: >>>> >>>> We can do sequential writes to the filesystem at 7,8 GB/s total , >>>> which is the expected speed for our current storage >>>> backend. While we have even better performance with sequential >>>> reads on raw storage LUNS, using GPFS we can only reach 1GB/s in >>>> total (each nsd server seems limited by 0,5GB/s) independent of the >>>> number of clients >>>> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev >>>> params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. >>>> as discussed in this thread, but nothing seems to impact this read >>>> performance. >>>> >>>> Any ideas? >>>> >>>> Thanks! >>>> >>>> Kenneth >>>> >>>> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>>>> I just had a similar experience from a sandisk infiniflash system >>>>> SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for >>>>> writes. and 250-300 Mbyte/s on sequential reads!! Random reads >>>>> were on the order of 2 Gbyte/s. >>>>> >>>>> After a bit head scratching snd fumbling around I found out that >>>>> reducing maxMBpS from 10000 to 100 fixed the problem! Digging >>>>> further I found that reducing prefetchThreads from default=72 to >>>>> 32 also fixed it, while leaving maxMBpS at 10000. Can now also >>>>> read at 3,2 GByte/s. >>>>> >>>>> Could something like this be the problem on your box as well? >>>>> >>>>> >>>>> >>>>> -jf >>>>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >>>>> >: >>>>> >>>>> Well, I'm somewhat scrounging for hardware. This is in our test >>>>> environment :) And yep, it's got the 2U gpu-tray in it >>>>> although even >>>>> without the riser it has 2 PCIe slots onboard (excluding the >>>>> on-board >>>>> dual-port mezz card) so I think it would make a fine NSD >>>>> server even >>>>> without the riser. >>>>> >>>>> -Aaron >>>>> >>>>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT >>>>> Services) >>>>> wrote: >>>>> > Maybe its related to interrupt handlers somehow? You drive >>>>> the load up on one socket, you push all the interrupt handling >>>>> to the other socket where the fabric card is attached? >>>>> > >>>>> > Dunno ... (Though I am intrigued you use idataplex nodes as >>>>> NSD servers, I assume its some 2U gpu-tray riser one or >>>>> something !) >>>>> > >>>>> > Simon >>>>> > ________________________________________ >>>>> > From: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> [gpfsug-discuss-bounces at spectrumscale.org >>>>> ] on behalf >>>>> of Aaron Knister [aaron.s.knister at nasa.gov >>>>> ] >>>>> > Sent: 17 February 2017 15:52 >>>>> > To: gpfsug main discussion list >>>>> > Subject: [gpfsug-discuss] bizarre performance behavior >>>>> > >>>>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>>>> > connections coming in and 1x FDR10 and 1x QDR connection >>>>> going out to >>>>> > the clients. I was having a really hard time getting >>>>> anything resembling >>>>> > sensible performance out of it (4-5Gb/s writes but maybe >>>>> 1.2Gb/s for >>>>> > reads). The back-end is a DDN SFA12K and I *know* it can do >>>>> better than >>>>> > that. >>>>> > >>>>> > I don't remember quite how I figured this out but simply by >>>>> running >>>>> > "openssl speed -multi 16" on the nsd server to drive up the >>>>> load I saw >>>>> > an almost 4x performance jump which is pretty much goes >>>>> against every >>>>> > sysadmin fiber in me (i.e. "drive up the cpu load with >>>>> unrelated crap to >>>>> > quadruple your i/o performance"). >>>>> > >>>>> > This feels like some type of C-states frequency scaling >>>>> shenanigans that >>>>> > I haven't quite ironed down yet. I booted the box with the >>>>> following >>>>> > kernel parameters "intel_idle.max_cstate=0 >>>>> processor.max_cstate=0" which >>>>> > didn't seem to make much of a difference. I also tried >>>>> setting the >>>>> > frequency governer to userspace and setting the minimum >>>>> frequency to >>>>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I >>>>> still have >>>>> > to run something to drive up the CPU load and then >>>>> performance improves. >>>>> > >>>>> > I'm wondering if this could be an issue with the C1E state? >>>>> I'm curious >>>>> > if anyone has seen anything like this. The node is a dx360 M4 >>>>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>>>> > >>>>> > -Aaron >>>>> > >>>>> > -- >>>>> > Aaron Knister >>>>> > NASA Center for Climate Simulation (Code 606.2) >>>>> > Goddard Space Flight Center >>>>> > (301) 286-2776 >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 16:42:34 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 17:42:34 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> Hi, We already verified this on our nsds: [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed QpiSpeed=maxdatarate [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode turbomode=enable [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ?-SysProfile SysProfile=perfoptimized so sadly this is not the issue. Also the output of the verbs commands look ok, there are connections from the client to the nsds are there is data being read and writen. Thanks again! Kenneth On 21/04/17 16:01, Kumaran Rajaram wrote: > Hi, > > Try enabling the following in the BIOS of the NSD servers (screen > shots below) > > * Turbo Mode - Enable > * QPI Link Frequency - Max Performance > * Operating Mode - Maximum Performance > * > > >>>>While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total > (each nsd server seems limited by 0,5GB/s) independent of the > number of clients > > >>We are testing from 2 testing machines connected to the nsds > with infiniband, verbs enabled. > > > Also, It will be good to verify that all the GPFS nodes have Verbs > RDMA started using "mmfsadm test verbs status" and that the NSD > client-server communication from client to server during "dd" is > actually using Verbs RDMA using "mmfsadm test verbs conn" command (on > NSD client doing dd). If not, then GPFS might be using TCP/IP network > over which the cluster is configured impacting performance (If this is > the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and > resolve). > > * > > > > > > > Regards, > -Kums > > > > > > > From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" > > To: gpfsug main discussion list > Date: 04/21/2017 09:11 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Fantastic news! It might also be worth running "cpupower monitor" or > "turbostat" on your NSD servers while you're running dd tests from the > clients to see what CPU frequency your cores are actually running at. > > A typical NSD server workload (especially with IB verbs and for reads) > can be pretty light on CPU which might not prompt your CPU crew > governor to up the frequency (which can affect throughout). If your > frequency scaling governor isn't kicking up the frequency of your CPUs > I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: > > Hi, > > We are running a test setup with 2 NSD Servers backed by 4 Dell > Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of > the 4 powervaults, nsd02 is primary serving LUNS of controller B. > > We are testing from 2 testing machines connected to the nsds with > infiniband, verbs enabled. > > When we do dd from the NSD servers, we see indeed performance going to > 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is > able to get the data at a decent speed. Since we can write from the > clients at a good speed, I didn't suspect the communication between > clients and nsds being the issue, especially since total performance > stays the same using 1 or multiple clients. > > I'll use the nsdperf tool to see if we can find anything, > > thanks! > > K > > On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: > Interesting. Could you share a little more about your architecture? Is > it possible to mount the fs on an NSD server and do some dd's from the > fs on the NSD server? If that gives you decent performance perhaps try > NSDPERF next > _https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf_ > > > -Aaron > > > > > On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman > __ wrote: > > Hi, > > Having an issue that looks the same as this one: > > We can do sequential writes to the filesystem at 7,8 GB/s total , > which is the expected speed for our current storage > backend. While we have even better performance with sequential reads > on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each > nsd server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed > in this thread, but nothing seems to impact this read performance. > > Any ideas? > > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for > writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on > the order of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that > reducing maxMBpS from 10000 to 100 fixed the problem! Digging further > I found that reducing prefetchThreads from default=72 to 32 also fixed > it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > <_aaron.s.knister at nasa.gov_ >: > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load > up on one socket, you push all the interrupt handling to the other > socket where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: _gpfsug-discuss-bounces at spectrumscale.org_ > [_gpfsug-discuss-bounces at spectrumscale.org_ > ] on behalf of Aaron > Knister [_aaron.s.knister at nasa.gov_ ] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at _spectrumscale.org_ > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at _spectrumscale.org_ > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at _spectrumscale.org_ _ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From kums at us.ibm.com Fri Apr 21 21:27:49 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 21 Apr 2017 20:27:49 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov><9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> Message-ID: Hi Kenneth, As it was mentioned earlier, it will be good to first verify the raw network performance between the NSD client and NSD server using the nsdperf tool that is built with RDMA support. g++ -O2 -DRDMA -o nsdperf -lpthread -lrt -libverbs -lrdmacm nsdperf.C In addition, since you have 2 x NSD servers it will be good to perform NSD client file-system performance test with just single NSD server (mmshutdown the other server, assuming all the NSDs have primary, server NSD server configured + Quorum will be intact when a NSD server is brought down) to see if it helps to improve the read performance + if there are variations in the file-system read bandwidth results between NSD_server#1 'active' vs. NSD_server #2 'active' (with other NSD server in GPFS "down" state). If there is significant variation, it can help to isolate the issue to particular NSD server (HW or IB issue?). You can issue "mmdiag --waiters" on NSD client as well as NSD servers during your dd test, to verify if there are unsual long GPFS waiters. In addition, you may issue Linux "perf top -z" command on the GPFS node to see if there is high CPU usage by any particular call/event (for e.g., If GPFS config parameter verbsRdmaMaxSendBytes has been set to low value from the default 16M, then it can cause RDMA completion threads to go CPU bound ). Please verify some performance scenarios detailed in Chapter 22 in Spectrum Scale Problem Determination Guide (link below). https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/pdf/scale_pdg.pdf?view=kc Thanks, -Kums From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/21/2017 11:43 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, We already verified this on our nsds: [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed QpiSpeed=maxdatarate [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode turbomode=enable [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ?-SysProfile SysProfile=perfoptimized so sadly this is not the issue. Also the output of the verbs commands look ok, there are connections from the client to the nsds are there is data being read and writen. Thanks again! Kenneth On 21/04/17 16:01, Kumaran Rajaram wrote: Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) Turbo Mode - Enable QPI Link Frequency - Max Performance Operating Mode - Maximum Performance >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To: gpfsug main discussion list Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org[ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From frank.tower at outlook.com Thu Apr 20 13:27:13 2017 From: frank.tower at outlook.com (Frank Tower) Date: Thu, 20 Apr 2017 12:27:13 +0000 Subject: [gpfsug-discuss] Protocol node recommendations Message-ID: Hi, We have here around 2PB GPFS where users access oney through GPFS client (used by an HPC cluster), but we will have to setup protocols nodes. We will have to share GPFS data to ~ 1000 users, where each users will have different access usage, meaning: - some will do large I/O (e.g: store 1TB files) - some will read/write more than 10k files in a raw - other will do only sequential read I already read the following wiki page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node IBM Spectrum Scale Wiki - Sizing Guidance for Protocol Node www.ibm.com developerWorks wikis allow groups of people to jointly create and maintain content through contribution and collaboration. Wikis apply the wisdom of crowds to ... But I wondering if some people have recommendations regarding hardware sizing and software tuning for such situation ? Or better, if someone already such setup ? Thank you by advance, Frank. -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Apr 22 05:30:29 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Sat, 22 Apr 2017 00:30:29 -0400 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: <52354.1492835429@turing-police.cc.vt.edu> On Thu, 20 Apr 2017 12:27:13 -0000, Frank Tower said: > - some will do large I/O (e.g: store 1TB files) > - some will read/write more than 10k files in a raw > - other will do only sequential read > But I wondering if some people have recommendations regarding hardware sizing > and software tuning for such situation ? The three most critical pieces of info are missing here: 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? 2) How many of the users are likely to be active at the same time? 1,000 users, each of whom are active an hour a week is entirely different from 200 users that are each active 140 hours a week. 3) What SLA/performance target are they expecting? If they want large 1TB I/O and 100MB/sec is acceptable, that's different than if they have a business need to go at 1.2GB/sec.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From frank.tower at outlook.com Sat Apr 22 07:34:44 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 06:34:44 +0000 Subject: [gpfsug-discuss] Protocol node recommendations Message-ID: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sat Apr 22 09:50:11 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sat, 22 Apr 2017 08:50:11 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower : > Hi, > > We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with > GPFS client on each node. > > We will have to open GPFS to all our users over CIFS and kerberized NFS > with ACL support for both protocol for around +1000 users > > All users have different use case and needs: > - some will do random I/O through a large set of opened files (~5k files) > - some will do large write with 500GB-1TB files > - other will arrange sequential I/O with ~10k opened files > > NFS and CIFS will share the same server, so I through to use SSD drive, at > least 128GB memory with 2 sockets. > > Regarding tuning parameters, I thought at: > > maxFilesToCache 10000 > syncIntervalStrict yes > workerThreads (8*core) > prefetchPct 40 (for now and update if needed) > > I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering > if someone could share his experience/best practice regarding hardware > sizing and/or tuning parameters. > > Thank by advance, > Frank > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sat Apr 22 19:47:59 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 18:47:59 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: <52354.1492835429@turing-police.cc.vt.edu> References: , <52354.1492835429@turing-police.cc.vt.edu> Message-ID: Hi, Thank for your answer. > 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? True, here the list: - 800 users that have 1 workstation through 1Gb/s ethernet and will use NFS/CIFS - 200 users that have 2 workstation through 1Gb/s ethernet, few have 10Gb/s ethernet and will use NFS/CIFS > 2) How many of the users are likely to be active at the same time? 1,000 > users, each of whom are active an hour a week is entirely different from > 200 users that are each active 140 hours a week. True again, around 200 users will actively use GPFS through NFS/CIFS during night and day, but we cannot control if people will use 2 workstations or more :( We will have peak during day with an average of 700 'workstations' > 3) What SLA/performance target are they expecting? If they want > large 1TB I/O and 100MB/sec is acceptable, that's different than if they > have a business need to go at 1.2GB/sec.... We just want to provide at normal throughput through an 1GB/s network. Users are aware of such situation and will mainly use HPC cluster for high speed and heavy computation. But they would like to do 'light' computation on their desktop. The main topic here is to sustain 'normal' throughput for all users during peak. Thank for your help. ________________________________ From: valdis.kletnieks at vt.edu Sent: Saturday, April 22, 2017 6:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Protocol node recommendations On Thu, 20 Apr 2017 12:27:13 -0000, Frank Tower said: > - some will do large I/O (e.g: store 1TB files) > - some will read/write more than 10k files in a raw > - other will do only sequential read > But I wondering if some people have recommendations regarding hardware sizing > and software tuning for such situation ? The three most critical pieces of info are missing here: 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? 2) How many of the users are likely to be active at the same time? 1,000 users, each of whom are active an hour a week is entirely different from 200 users that are each active 140 hours a week. 3) What SLA/performance target are they expecting? If they want large 1TB I/O and 100MB/sec is acceptable, that's different than if they have a business need to go at 1.2GB/sec.... -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sat Apr 22 20:22:23 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 19:22:23 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sun Apr 23 11:07:38 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sun, 23 Apr 2017 10:07:38 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower : > Hi, > > > Thank for the recommendations. > > Now we deal with the situation of: > > > - take 3 nodes with round robin DNS that handle both protocols > > - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and > NFS services. > > > Regarding your recommendations, 256GB memory node could be a plus if we > mix both protocols for such case. > > > Is the spreadsheet publicly available or do we need to ask IBM ? > > > Thank for your help, > > Frank. > > > ------------------------------ > *From:* Jan-Frode Myklebust > *Sent:* Saturday, April 22, 2017 10:50 AM > *To:* gpfsug-discuss at spectrumscale.org > *Subject:* Re: [gpfsug-discuss] Protocol node recommendations > > That's a tiny maxFilesToCache... > > I would start by implementing the settings from > /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your > protocoll nodes, and leave further tuning to when you see you have issues. > > Regarding sizing, we have a spreadsheet somewhere where you can input some > workload parameters and get an idea for how many nodes you'll need. Your > node config seems fine, but one node seems too few to serve 1000+ users. We > support max 3000 SMB connections/node, and I believe the recommendation is > 4000 NFS connections/node. > > > -jf > l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower : > >> Hi, >> >> We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with >> GPFS client on each node. >> >> We will have to open GPFS to all our users over CIFS and kerberized NFS >> with ACL support for both protocol for around +1000 users >> >> All users have different use case and needs: >> - some will do random I/O through a large set of opened files (~5k files) >> - some will do large write with 500GB-1TB files >> - other will arrange sequential I/O with ~10k opened files >> >> NFS and CIFS will share the same server, so I through to use SSD >> drive, at least 128GB memory with 2 sockets. >> >> Regarding tuning parameters, I thought at: >> >> maxFilesToCache 10000 >> syncIntervalStrict yes >> workerThreads (8*core) >> prefetchPct 40 (for now and update if needed) >> >> I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering >> if someone could share his experience/best practice regarding hardware >> sizing and/or tuning parameters. >> >> Thank by advance, >> Frank >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rreuscher at verizon.net Sun Apr 23 17:43:44 2017 From: rreuscher at verizon.net (Robert Reuscher) Date: Sun, 23 Apr 2017 11:43:44 -0500 Subject: [gpfsug-discuss] LUN expansion Message-ID: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> We run GPFS on z/Linux and have been using ECKD devices for disks. We are looking at implementing some new filesystems on FCP LUNS. One of the features of a LUN is we can expand a LUN instead of adding new LUNS, where as with ECKD devices. From what I?ve found searching to see if GPFS filesystem can be expanding to see the expanded LUN, it doesn?t seem that this will work, you have to add new LUNS (or new disks) and then add them to the filesystem. Everything I?ve found is at least 2-3 old (most of it much older), and just want to check that this is still is true before we make finalize our LUN/GPFS procedures. Robert Reuscher NR5AR -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sun Apr 23 22:27:50 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sun, 23 Apr 2017 21:27:50 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfadden at us.ibm.com Sun Apr 23 23:44:56 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Sun, 23 Apr 2017 22:44:56 +0000 Subject: [gpfsug-discuss] LUN expansion In-Reply-To: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> References: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> Message-ID: An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Mon Apr 24 10:11:25 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 24 Apr 2017 09:11:25 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: What's your SSD going to help with... will you implement it as a LROC device? Otherwise I can't see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust ; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Mon Apr 24 11:28:08 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 24 Apr 2017 12:28:08 +0200 (CEST) Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Message-ID: <416417651.114582.1493029688959@email.1und1.de> An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Apr 24 12:14:17 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 12:14:17 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <416417651.114582.1493029688959@email.1und1.de> References: <416417651.114582.1493029688959@email.1und1.de> Message-ID: <1493032457.11896.20.camel@buzzard.me.uk> On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From service at metamodul.com Mon Apr 24 13:21:09 2017 From: service at metamodul.com (service at metamodul.com) Date: Mon, 24 Apr 2017 14:21:09 +0200 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Message-ID: Hi Jonathan todays hardware is so powerful that imho it might make sense to split a CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). I think that such a server is a little bit to big ?just to be a single NSD server. Note that i use for each GPFS service a dedicated node. So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. Inhm 4xS822L could handle this and a little bit more quite well. Of course blade technology could be used or 1U server. With kind regards Hajo --? Unix Systems Engineer MetaModul GmbH +49 177 4393994
-------- Urspr?ngliche Nachricht --------
Von: Jonathan Buzzard
Datum:2017.04.24 13:14 (GMT+01:00)
An: gpfsug main discussion list
Betreff: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale
On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Apr 24 13:42:51 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 24 Apr 2017 15:42:51 +0300 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: Hi As tastes vary, I would not partition it so much for the backend. Assuming there is little to nothing overhead on the CPU at PHYP level, which it depends. On the protocols nodes, due the CTDB keeping locks together across all nodes (SMB), you would get more performance on bigger & less number of CES nodes than more and smaller. Certainly a 822 is quite a server if we go back to previous generations but I would still keep a simple backend (NSd servers), simple CES (less number of nodes the merrier) & then on the client part go as micro partitions as you like/can as the effect on the cluster is less relevant in the case of resources starvation. But, it depends on workloads, SLA and money so I say try, establish a baseline and it fills the requirements, go for it. If not change till does. Have fun From: "service at metamodul.com" To: gpfsug main discussion list Date: 24/04/2017 15:21 Subject: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Jonathan todays hardware is so powerful that imho it might make sense to split a CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). I think that such a server is a little bit to big just to be a single NSD server. Note that i use for each GPFS service a dedicated node. So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. Inhm 4xS822L could handle this and a little bit more quite well. Of course blade technology could be used or 1U server. With kind regards Hajo -- Unix Systems Engineer MetaModul GmbH +49 177 4393994 -------- Urspr?ngliche Nachricht -------- Von: Jonathan Buzzard Datum:2017.04.24 13:14 (GMT+01:00) An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Apr 24 14:04:26 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 14:04:26 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <1493039066.11896.30.camel@buzzard.me.uk> On Mon, 2017-04-24 at 14:21 +0200, service at metamodul.com wrote: > Hi Jonathan > todays hardware is so powerful that imho it might make sense to split > a CEC into more "piece". For example the IBM S822L has up to 2x12 > cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). > I think that such a server is a little bit to big just to be a single > NSD server. So don't buy it for an NSD server then :-) > Note that i use for each GPFS service a dedicated node. > So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup > nodes and at least 3 test server a total of 11 server is needed. > Inhm 4xS822L could handle this and a little bit more quite well. > I think you are missing the point somewhat. Well by several country miles and quite possibly an ocean or two to be honest. Spectrum scale is supposed to be a "scale out" solution. More storage required add more arrays. More bandwidth add more servers etc. If you are just going to scale it all up on a *single* server then you might as well forget GPFS and do an old school standard scale up solution. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From janfrode at tanso.net Mon Apr 24 14:14:20 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 24 Apr 2017 15:14:20 +0200 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: I agree with Luis -- why so many nodes? """ So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. """ If this is your whole cluster, why not just 3x P822L/P812L running single partition per node, hosting a cluster of 3x protocol-nodes that does both direct FC for disk access, and also run backups on same nodes ? No complications, full hw performance. Then separate node for test, or separate partition on same nodes with dedicated adapters. But back to your original question. My experience is that LPAR/NPIV works great, but it's a bit annoying having to also have VIOs. Hope we'll get FC SR-IOV eventually.. Also LPAR/Dedicated-adapters naturally works fine. VMWare/RDM can be a challenge in some failure situations. It likes to pause VMs in APD or PDL situations, which will affect all VMs with access to it :-o VMs without direct disk access is trivial. -jf On Mon, Apr 24, 2017 at 2:42 PM, Luis Bolinches wrote: > Hi > > As tastes vary, I would not partition it so much for the backend. Assuming > there is little to nothing overhead on the CPU at PHYP level, which it > depends. On the protocols nodes, due the CTDB keeping locks together across > all nodes (SMB), you would get more performance on bigger & less number of > CES nodes than more and smaller. > > Certainly a 822 is quite a server if we go back to previous generations > but I would still keep a simple backend (NSd servers), simple CES (less > number of nodes the merrier) & then on the client part go as micro > partitions as you like/can as the effect on the cluster is less relevant in > the case of resources starvation. > > But, it depends on workloads, SLA and money so I say try, establish a > baseline and it fills the requirements, go for it. If not change till does. > Have fun > > > > From: "service at metamodul.com" > To: gpfsug main discussion list > Date: 24/04/2017 15:21 > Subject: Re: [gpfsug-discuss] Used virtualization technologies for > GPFS/Spectrum Scale > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Hi Jonathan > todays hardware is so powerful that imho it might make sense to split a > CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 > PCI3 slots ( 4?16 lans & 5?8 lan ). > I think that such a server is a little bit to big just to be a single NSD > server. > Note that i use for each GPFS service a dedicated node. > So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes > and at least 3 test server a total of 11 server is needed. > Inhm 4xS822L could handle this and a little bit more quite well. > > Of course blade technology could be used or 1U server. > > With kind regards > Hajo > > -- > Unix Systems Engineer > MetaModul GmbH > +49 177 4393994 <+49%20177%204393994> > > > -------- Urspr?ngliche Nachricht -------- > Von: Jonathan Buzzard > Datum:2017.04.24 13:14 (GMT+01:00) > An: gpfsug main discussion list > Betreff: Re: [gpfsug-discuss] Used virtualization technologies for > GPFS/Spectrum Scale > > On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > > @All > > > > > > does anybody uses virtualization technologies for GPFS Server ? If yes > > what kind and why have you selected your soulution. > > > > I think currently about using Linux on Power using 40G SR-IOV for > > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > > also assign only a certain amount of CPUs to GPFS. ( Lower license > > cost / You pay for what you use) > > > > > > I must admit that i am not familar how "good" KVM/ESX in respect to > > direct assignment of hardware is. Thus the question to the group > > > > For the most part GPFS is used at scale and in general all the > components are redundant. As such why you would want to allocate less > than a whole server into a production GPFS system in somewhat beyond me. > > That is you will have a bunch of NSD servers in the system and if one > crashes, well the other NSD's take over. Similar for protocol nodes, and > in general the total file system size is going to hundreds of TB > otherwise why bother with GPFS. > > I guess there is currently potential value at sticking the GUI into a > virtual machine to get redundancy. > > On the other hand if you want a test rig, then virtualization works > wonders. I have put GPFS on a single Linux box, using LV's for the disks > and mapping them into virtual machines under KVM. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______ > ________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Apr 24 16:29:56 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 24 Apr 2017 11:29:56 -0400 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <131241.1493047796@turing-police.cc.vt.edu> On Mon, 24 Apr 2017 14:21:09 +0200, "service at metamodul.com" said: > todays hardware is so powerful that imho it might make sense to split a CEC > into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots > ( 4?16 lans & 5?8 lan ). We look at it the other way around: Today's hardware is so powerful that you can build a cluster out of a stack of fairly low-end 1U servers (we have one cluster that's built out of Dell r630s). And it's more robust against hardware failures than a VM based solution - although the 822 seems to allow hot-swap of PCI cards, a dead socket or DIMM will still kill all the VMs when you go to replace it. If one 1U out of 4 goes down due to a bad DIMM (which has happened to us more often than a bad PCI card) you can just power it down and replace it.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From service at metamodul.com Mon Apr 24 17:11:25 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 24 Apr 2017 18:11:25 +0200 (CEST) Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <1961501377.286669.1493050285874@email.1und1.de> > Jan-Frode Myklebust hat am 24. April 2017 um 15:14 geschrieben: > I agree with Luis -- why so many nodes? Many ? IMHO it is not that much. I do not like to have one server doing more than one task. Thus a NSD Server does only serves GPFS. A Protocol server serves either NFS or SMB but not both except IBM says it would be better to run NFS/SMB on the same node. A backup server runs also on its "own" hardware. So i would need at least 4 NSD Server since if 1 fails i am losing only 25% of my "performance" and still having a 4/5 quorum. Nice in case an Update of a NSD failed. Each protocol service requires at least 2 nodes and the backup service as well. I can only say that with that approach i never had problems. I have be running into problems each time i did not followed that apporach. But of course YMMV But keep in mind that each service might requires different GPFS configuration or even slightly different hardware. Saying so i am a fan of having many GPFS Server ( NSD, Protocol , Backup a.s.o ) and i do not understand why not to use many nodes ^_^ Cheers Hajo From jonathan at buzzard.me.uk Mon Apr 24 17:24:29 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 17:24:29 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <131241.1493047796@turing-police.cc.vt.edu> References: <131241.1493047796@turing-police.cc.vt.edu> Message-ID: <1493051069.11896.39.camel@buzzard.me.uk> On Mon, 2017-04-24 at 11:29 -0400, valdis.kletnieks at vt.edu wrote: > On Mon, 24 Apr 2017 14:21:09 +0200, "service at metamodul.com" said: > > > todays hardware is so powerful that imho it might make sense to split a CEC > > into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots > > ( 4?16 lans & 5?8 lan ). > > We look at it the other way around: Today's hardware is so powerful that > you can build a cluster out of a stack of fairly low-end 1U servers (we > have one cluster that's built out of Dell r630s). And it's more robust > against hardware failures than a VM based solution - although the 822 seems > to allow hot-swap of PCI cards, a dead socket or DIMM will still kill all > the VMs when you go to replace it. If one 1U out of 4 goes down due to > a bad DIMM (which has happened to us more often than a bad PCI card) you > can just power it down and replace it.... Hate to say but the 822 will happily keep trucking when the CPU (assuming it has more than one) fails and similar with the DIMM's. In fact mirrored DIMM's is reasonably common on x86 machines these days, though very few people ever use it. That said CPU failures are incredibly rare in my experience. The only time I have ever come across a failed CPU was on a pSeries machine and then it was only because the backup was running really slow (it was running TSM) that prompted us to look closer and see what had happened. Monitoring (Zenoss) was not setup to register the event because like when does a CPU fail and the machine keep running! I am not 100% sure on the 822 put I suspect that the DIMM's and any socketed CPU's can be hot swapped in addition to the PCI card's which I have personally done on pSeries machines. However it is a stupidly over priced solution to run GPFS, because there are better or at the very least vastly cheaper ways to get the same level of reliability. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From valdis.kletnieks at vt.edu Mon Apr 24 18:58:17 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 24 Apr 2017 13:58:17 -0400 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <1493051069.11896.39.camel@buzzard.me.uk> References: <131241.1493047796@turing-police.cc.vt.edu> <1493051069.11896.39.camel@buzzard.me.uk> Message-ID: <7337.1493056697@turing-police.cc.vt.edu> On Mon, 24 Apr 2017 17:24:29 +0100, Jonathan Buzzard said: > Hate to say but the 822 will happily keep trucking when the CPU > (assuming it has more than one) fails and similar with the DIMM's. In How about when you go to replace the DIMM? You able to hot-swap the memory without anything losing its mind? (I know this is possible in the Z/series world, but those usually have at least 2-3 more zeros in the price tag). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From luis.bolinches at fi.ibm.com Mon Apr 24 19:08:32 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 24 Apr 2017 21:08:32 +0300 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <7337.1493056697@turing-police.cc.vt.edu> References: <131241.1493047796@turing-police.cc.vt.edu> <1493051069.11896.39.camel@buzzard.me.uk> <7337.1493056697@turing-police.cc.vt.edu> Message-ID: Hi 822 is an entry scale out Power machine, it has limited RAS compared with the high end ones (870/880). The 822 needs to be down for CPU / DIMM replacement: https://www.ibm.com/support/knowledgecenter/5148-21L/p8eg3/p8eg3_83x_8rx_kickoff.htm . And it is not a end user task. You can argue that, I owuld but it is the current statement and you pay for support for these kind of stuff. From: valdis.kletnieks at vt.edu To: gpfsug main discussion list Date: 24/04/2017 20:58 Subject: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Sent by: gpfsug-discuss-bounces at spectrumscale.org On Mon, 24 Apr 2017 17:24:29 +0100, Jonathan Buzzard said: > Hate to say but the 822 will happily keep trucking when the CPU > (assuming it has more than one) fails and similar with the DIMM's. In How about when you go to replace the DIMM? You able to hot-swap the memory without anything losing its mind? (I know this is possible in the Z/series world, but those usually have at least 2-3 more zeros in the price tag). [attachment "attqolcz.dat" deleted by Luis Bolinches/Finland/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Mon Apr 24 22:12:14 2017 From: frank.tower at outlook.com (Frank Tower) Date: Mon, 24 Apr 2017 21:12:14 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , , Message-ID: >From what I've read from the Wiki: 'The NFS protocol performance is largely dependent on the base system performance of the protocol node hardware and network. This includes multiple factors the type and number of CPUs, the size of the main memory in the nodes, the type of disk drives used (HDD, SSD, etc.) and the disk configuration (RAID-level, replication etc.). In addition, NFS protocol performance can be impacted by the overall load of the node (such as number of clients accessing, snapshot creation/deletion and more) and administrative tasks (for example filesystem checks or online re-striping of disk arrays).' Nowadays, SSD is worst to invest. LROC could be an option in the future, but we need to quantify NFS/CIFS workload first. Are you using LROC with your GPFS installation ? Best, Frank. ________________________________ From: Sobey, Richard A Sent: Monday, April 24, 2017 11:11 AM To: gpfsug main discussion list; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations What?s your SSD going to help with? will you implement it as a LROC device? Otherwise I can?t see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust ; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 09:19:10 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 08:19:10 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , , Message-ID: I tried it on one node but investing in what could be up to ?5000 in SSDs when we don't know the gains isn't something I can argue. Not that LROC will hurt the environment but my users may not see any benefit. My cluster is the complete opposite of busy (relative to people saying they're seeing sustained 800MB/sec throughput), I just need it stable. Richard From: Frank Tower [mailto:frank.tower at outlook.com] Sent: 24 April 2017 22:12 To: Sobey, Richard A ; gpfsug main discussion list ; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations >From what I've read from the Wiki: 'The NFS protocol performance is largely dependent on the base system performance of the protocol node hardware and network. This includes multiple factors the type and number of CPUs, the size of the main memory in the nodes, the type of disk drives used (HDD, SSD, etc.) and the disk configuration (RAID-level, replication etc.). In addition, NFS protocol performance can be impacted by the overall load of the node (such as number of clients accessing, snapshot creation/deletion and more) and administrative tasks (for example filesystem checks or online re-striping of disk arrays).' Nowadays, SSD is worst to invest. LROC could be an option in the future, but we need to quantify NFS/CIFS workload first. Are you using LROC with your GPFS installation ? Best, Frank. ________________________________ From: Sobey, Richard A > Sent: Monday, April 24, 2017 11:11 AM To: gpfsug main discussion list; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations What's your SSD going to help with... will you implement it as a LROC device? Otherwise I can't see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust >; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Apr 25 09:23:32 2017 From: chair at spectrumscale.org (Spectrum Scale UG Chair (Simon Thompson)) Date: Tue, 25 Apr 2017 09:23:32 +0100 Subject: [gpfsug-discuss] User group meeting May 9th/10th 2017 Message-ID: The UK user group is now just 2 weeks away! Its time to register ... https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi stration-32113696932 (or https://goo.gl/tRptru) Remember user group meetings are free to attend, and this year's 2 day meeting is packed full of sessions and several of the breakout sessions are cloud-focussed looking at how Spectrum Scale can be used with cloud deployments. And as usual, we have the ever popular Sven speaking with his views from the Research topics. Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, OCF and Seagate for helping make this happen! We need to finalise numbers for the evening event soon, so make sure you book your place now! Simon From S.J.Thompson at bham.ac.uk Tue Apr 25 12:20:39 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 11:20:39 +0000 Subject: [gpfsug-discuss] NFS issues Message-ID: Hi, We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes. We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs. In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine. In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't. If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues. So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround? We are currently running the packages nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often. On the clients, we use -o sync,hard BTW as in the IBM docs. On a client showing the issues, we'll see in dmesg, NFS related messages like: [Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out Which explains the client hang on certain mount points. The symptoms feel very much like those logged in this Gluster/ganesha bug: https://bugzilla.redhat.com/show_bug.cgi?id=1354439 Thanks Simon From Mark.Bush at siriuscom.com Tue Apr 25 14:27:38 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 13:27:38 +0000 Subject: [gpfsug-discuss] Perfmon and GUI Message-ID: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 14:44:59 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 13:44:59 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> References: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> Message-ID: I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.ouwehand at vumc.nl Tue Apr 25 14:51:22 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Tue, 25 Apr 2017 13:51:22 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: <5594921EA5B3674AB44AD9276126AAF40170DD3159@sp-mx-mbx42> Hello, At first a short introduction. My name is Jaap Jan Ouwehand, I work at a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical (office, research and clinical data) business process. We have three large GPFS filesystems for different purposes. We also had such a situation with cNFS. A failover (IPtakeover) was technically good, only clients experienced "stale filehandles". We opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few months later, the solution appeared to be in the fsid option. An NFS filehandle is built by a combination of fsid and a hash function on the inode. After a failover, the fsid value can be different and the client has a "stale filehandle". To avoid this, the fsid value can be statically specified. See: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_nfslin.htm Maybe there is also a value in Ganesha that changes after a failover. Certainly since most sessions will be re-established after a failback. Maybe you see more debug information with tcpdump. Kind regards, ? Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT E: jj.ouwehand at vumc.nl W: www.vumc.com -----Oorspronkelijk bericht----- Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson (IT Research Support) Verzonden: dinsdag 25 april 2017 13:21 Aan: gpfsug-discuss at spectrumscale.org Onderwerp: [gpfsug-discuss] NFS issues Hi, We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes. We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs. In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine. In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't. If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues. So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround? We are currently running the packages nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often. On the clients, we use -o sync,hard BTW as in the IBM docs. On a client showing the issues, we'll see in dmesg, NFS related messages like: [Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out Which explains the client hang on certain mount points. The symptoms feel very much like those logged in this Gluster/ganesha bug: https://bugzilla.redhat.com/show_bug.cgi?id=1354439 Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Tue Apr 25 15:06:04 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 14:06:04 +0000 Subject: [gpfsug-discuss] NFS issues Message-ID: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical >(office, research and clinical data) business process. We have three >large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We opened >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months >later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson >(IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: gpfsug-discuss at spectrumscale.org >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and failover >seems to work fine with SMB clients. We figured we could use the same >name and IPs and run Ganesha on the protocol servers, however we are >seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it all >works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by manually >suspending a CES node, but not always and some clients mounting from the >IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not >responding, timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Mark.Bush at siriuscom.com Tue Apr 25 15:13:58 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 14:13:58 +0000 Subject: [gpfsug-discuss] Perfmon and GUI Message-ID: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Tue Apr 25 15:29:07 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 14:29:07 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 15:31:13 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 14:31:13 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: No worries Mark. We don?t use NFS here (yet) so I can?t help there. Glad I could help. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 15:29 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Tue Apr 25 18:04:41 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 25 Apr 2017 17:04:41 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk>: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Ouwehand, JJ" j.ouwehand at vumc.nl> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: jj.ouwehand at vumc.nl > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: gpfsug-discuss at spectrumscale.org > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoang.nguyen at seagate.com Tue Apr 25 18:12:19 2017 From: hoang.nguyen at seagate.com (Hoang Nguyen) Date: Tue, 25 Apr 2017 10:12:19 -0700 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: I have a customer with a slightly different issue but sounds somewhat related. If you stop and stop the NFS service on a CES node or update an existing export which will restart Ganesha. Some of their NFS clients do not reconnect in a very similar fashion as you described. I haven't been able to reproduce it on my test system repeatedly but using soft NFS mounts seems to help. Seems like it happens more often to clients currently running NFS IO during the outage. But I'm interested to see what you guys uncover. Thanks, Hoang On Tue, Apr 25, 2017 at 7:06 AM, Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk> wrote: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Ouwehand, JJ" j.ouwehand at vumc.nl> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ibm.com_support_ > knowledgecenter_STXKQY-5F4.2.2_com.ibm.spectrum&d=DwICAg&c= > IGDlg0lD0b-nebmJJ0Kp8A&r=erT0ET1g1dsvTDYndRRTAAZ6Dneebt > G6F47PIUMDXFw&m=K3iXrW2N_HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s= > PIXnA0UQbneTHMRxvUcmsvZK6z5V2XU4jR_GIVaZP5Q&e= . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: jj.ouwehand at vumc.nl > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: gpfsug-discuss at spectrumscale.org > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://urldefense.proofpoint.com/v2/url?u=https- > 3A__bugzilla.redhat.com_show-5Fbug.cgi-3Fid-3D1354439&d= > DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=erT0ET1g1dsvTDYndRRTAAZ6Dneebt > G6F47PIUMDXFw&m=K3iXrW2N_HcdrGDuKmRWFjypuPLPJDIm9VosFII > sFoI&s=KN5WKk1vLEt0Y_17nVQeDi1lK5mSQUZQ7lPtQK3FBG4&e= > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_ > listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_ > listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > -- Hoang Nguyen *? *Sr Staff Engineer Seagate Technology office: +1 (858) 751-4487 mobile: +1 (858) 284-7846 www.seagate.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Apr 25 18:30:40 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 17:30:40 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: , Message-ID: I did some digging in the mmcesfuncs to see what happens server side on fail over. Basically the server losing the IP is supposed to terminate all sessions and the receiver server sends ACK tickles. My current supposition is that for whatever reason, the losing server isn't releasing something and the client still has hold of a connection which is mostly dead. The tickle then fails to the client from the new server. This would explain why failing the IP back to the original server usually brings the client back to life. This is only my working theory at the moment as we can't reliably reproduce this. Next time it happens we plan to grab some netstat from each side. Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the server that received the IP and see if that fixes it (i.e. the receiver server didn't tickle properly). (Usage extracted from mmcesfuncs which is ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) for anyone interested. Then try and kill he sessions on the losing server to check if there is stuff still open and re-tickle the client. If we can get steps to workaround, I'll log a PMR. I suppose I could do that now, but given its non deterministic and we want to be 100% sure it's not us doing something wrong, I'm inclined to wait until we do some more testing. I agree with the suggestion that it's probably IO pending nodes that are affected, but don't have any data to back that up yet. We did try with a read workload on a client, but may we need either long IO blocked reads or writes (from the GPFS end). We also originally had soft as the default option, but saw issues then and the docs suggested hard, so we switched and also enabled sync (we figured maybe it was NFS client with uncommited writes), but neither have resolved the issues entirely. Difficult for me to say if they improved the issue though given its sporadic. Appreciate people's suggestions! Thanks Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode Myklebust [janfrode at tanso.net] Sent: 25 April 2017 18:04 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" on behalf of j.ouwehand at vumc.nl> wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical >(office, research and clinical data) business process. We have three >large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We opened >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months >later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson >(IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: gpfsug-discuss at spectrumscale.org >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and failover >seems to work fine with SMB clients. We figured we could use the same >name and IPs and run Ganesha on the protocol servers, however we are >seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it all >works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by manually >suspending a CES node, but not always and some clients mounting from the >IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not >responding, timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Greg.Lehmann at csiro.au Wed Apr 26 00:46:35 2017 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Tue, 25 Apr 2017 23:46:35 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: , Message-ID: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Are you using infiniband or Ethernet? I'm wondering if IBM have solved the gratuitous arp issue which we see with our non-protocols NFS implementation. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: Wednesday, 26 April 2017 3:31 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I did some digging in the mmcesfuncs to see what happens server side on fail over. Basically the server losing the IP is supposed to terminate all sessions and the receiver server sends ACK tickles. My current supposition is that for whatever reason, the losing server isn't releasing something and the client still has hold of a connection which is mostly dead. The tickle then fails to the client from the new server. This would explain why failing the IP back to the original server usually brings the client back to life. This is only my working theory at the moment as we can't reliably reproduce this. Next time it happens we plan to grab some netstat from each side. Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the server that received the IP and see if that fixes it (i.e. the receiver server didn't tickle properly). (Usage extracted from mmcesfuncs which is ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) for anyone interested. Then try and kill he sessions on the losing server to check if there is stuff still open and re-tickle the client. If we can get steps to workaround, I'll log a PMR. I suppose I could do that now, but given its non deterministic and we want to be 100% sure it's not us doing something wrong, I'm inclined to wait until we do some more testing. I agree with the suggestion that it's probably IO pending nodes that are affected, but don't have any data to back that up yet. We did try with a read workload on a client, but may we need either long IO blocked reads or writes (from the GPFS end). We also originally had soft as the default option, but saw issues then and the docs suggested hard, so we switched and also enabled sync (we figured maybe it was NFS client with uncommited writes), but neither have resolved the issues entirely. Difficult for me to say if they improved the issue though given its sporadic. Appreciate people's suggestions! Thanks Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode Myklebust [janfrode at tanso.net] Sent: 25 April 2017 18:04 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" on behalf of j.ouwehand at vumc.nl> wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at >a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >critical (office, research and clinical data) business process. We have >three large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We >opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >months later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: >gpfsug-discuss-bounces at spectrumscale.orgspectrumscale.org> >[mailto:gpfsug-discuss-bounces at spectrumscale.orgbounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: >gpfsug-discuss at spectrumscale.orgg> >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and >failover seems to work fine with SMB clients. We figured we could use >the same name and IPs and run Ganesha on the protocol servers, however >we are seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it >all works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by >manually suspending a CES node, but not always and some clients >mounting from the IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related >messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server >MYNFSSERVER.bham.ac.uk not responding, >timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Mark.Bush at siriuscom.com Wed Apr 26 14:26:08 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Wed, 26 Apr 2017 13:26:08 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: My saga has come to an end. Turns out to get perf stats for NFS you need the gpfs.pm-ganesha package - duh. I typically do manual installs of scale so I just missed this one as it was buried in /usr/lpp/mmfs/4.2.3.0/zimon_rpms/rhel7. Anyway, package installed and now I get NFS stats in the gui and from cli. From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 9:31 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI No worries Mark. We don?t use NFS here (yet) so I can?t help there. Glad I could help. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 15:29 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 26 15:20:30 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 26 Apr 2017 14:20:30 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: Nope, the clients are all L3 connected, so not an arp issue. Two things we have observed: 1. It triggers when one of the CES IPs moves and quickly moves back again. The move occurs because the NFS server goes into grace: 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 2 nodeid -1 ip 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 recovery release ip 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 2017-04-25 20:37:42 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:37:44 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:37:44 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 4 nodeid 2 ip We can't see in any of the logs WHY ganesha is going into grace. Any suggestions on how to debug this further? (I.e. If we can stop the grace issues, we can solve the problem mostly). 2. Our clients are using LDAP which is bound to the CES IPs. If we shutdown nslcd on the client we can get the client to recover once all the TIME_WAIT connections have gone. Maybe this was a bad choice on our side to bind to the CES IPs - we figured it would handily move the IPs for us, but I guess the mmcesfuncs isn't aware of this and so doesn't kill the connections to the IP as it goes away. So two approaches we are going to try. Reconfigure the nslcd on a couple of clients and see if they still show up the issues when fail-over occurs. Second is to work out why the NFS servers are going into grace in the first place. Simon On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Greg.Lehmann at csiro.au" wrote: >Are you using infiniband or Ethernet? I'm wondering if IBM have solved >the gratuitous arp issue which we see with our non-protocols NFS >implementation. > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >Thompson (IT Research Support) >Sent: Wednesday, 26 April 2017 3:31 AM >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] NFS issues > >I did some digging in the mmcesfuncs to see what happens server side on >fail over. > >Basically the server losing the IP is supposed to terminate all sessions >and the receiver server sends ACK tickles. > >My current supposition is that for whatever reason, the losing server >isn't releasing something and the client still has hold of a connection >which is mostly dead. The tickle then fails to the client from the new >server. > >This would explain why failing the IP back to the original server usually >brings the client back to life. > >This is only my working theory at the moment as we can't reliably >reproduce this. Next time it happens we plan to grab some netstat from >each side. > >Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >server that received the IP and see if that fixes it (i.e. the receiver >server didn't tickle properly). (Usage extracted from mmcesfuncs which is >ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >for anyone interested. > >Then try and kill he sessions on the losing server to check if there is >stuff still open and re-tickle the client. > >If we can get steps to workaround, I'll log a PMR. I suppose I could do >that now, but given its non deterministic and we want to be 100% sure >it's not us doing something wrong, I'm inclined to wait until we do some >more testing. > >I agree with the suggestion that it's probably IO pending nodes that are >affected, but don't have any data to back that up yet. We did try with a >read workload on a client, but may we need either long IO blocked reads >or writes (from the GPFS end). > >We also originally had soft as the default option, but saw issues then >and the docs suggested hard, so we switched and also enabled sync (we >figured maybe it was NFS client with uncommited writes), but neither have >resolved the issues entirely. Difficult for me to say if they improved >the issue though given its sporadic. > >Appreciate people's suggestions! > >Thanks > >Simon >________________________________________ >From: gpfsug-discuss-bounces at spectrumscale.org >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >Myklebust [janfrode at tanso.net] >Sent: 25 April 2017 18:04 >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] NFS issues > >I *think* I've seen this, and that we then had open TCP connection from >client to NFS server according to netstat, but these connections were not >visible from netstat on NFS-server side. > >Unfortunately I don't remember what the fix was.. > > > > -jf > >tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >>: >Hi, > >From what I can see, Ganesha uses the Export_Id option in the config file >(which is managed by CES) for this. I did find some reference in the >Ganesha devs list that if its not set, then it would read the FSID from >the GPFS file-system, either way they should surely be consistent across >all the nodes. The posts I found were from someone with an IBM email >address, so I guess someone in the IBM teams. > >I checked a couple of my protocol nodes and they use the same Export_Id >consistently, though I guess that might not be the same as the FSID value. > >Perhaps someone from IBM could comment on if FSID is likely to the cause >of my problems? > >Thanks > >Simon > >On 25/04/2017, 14:51, >"gpfsug-discuss-bounces at spectrumscale.orgectrumscale.org> on behalf of Ouwehand, JJ" >ectrumscale.org> on behalf of >j.ouwehand at vumc.nl> wrote: > >>Hello, >> >>At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>critical (office, research and clinical data) business process. We have >>three large GPFS filesystems for different purposes. >> >>We also had such a situation with cNFS. A failover (IPtakeover) was >>technically good, only clients experienced "stale filehandles". We >>opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>months later, the solution appeared to be in the fsid option. >> >>An NFS filehandle is built by a combination of fsid and a hash function >>on the inode. After a failover, the fsid value can be different and the >>client has a "stale filehandle". To avoid this, the fsid value can be >>statically specified. See: >> >>https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>. >>scale.v4r22.doc/bl1adm_nfslin.htm >> >>Maybe there is also a value in Ganesha that changes after a failover. >>Certainly since most sessions will be re-established after a failback. >>Maybe you see more debug information with tcpdump. >> >> >>Kind regards, >> >>Jaap Jan Ouwehand >>ICT Specialist (Storage & Linux) >>VUmc - ICT >>E: jj.ouwehand at vumc.nl >>W: www.vumc.com >> >> >> >>-----Oorspronkelijk bericht----- >>Van: >>gpfsug-discuss-bounces at spectrumscale.org>spectrumscale.org> >>[mailto:gpfsug-discuss-bounces at spectrumscale.org>bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>Verzonden: dinsdag 25 april 2017 13:21 >>Aan: >>gpfsug-discuss at spectrumscale.org>g> >>Onderwerp: [gpfsug-discuss] NFS issues >> >>Hi, >> >>We have recently started deploying NFS in addition our existing SMB >>exports on our protocol nodes. >> >>We use a RR DNS name that points to 4 VIPs for SMB services and >>failover seems to work fine with SMB clients. We figured we could use >>the same name and IPs and run Ganesha on the protocol servers, however >>we are seeing issues with NFS clients when IP failover occurs. >> >>In normal operation on a client, we might see several mounts from >>different IPs obviously due to the way the DNS RR is working, but it >>all works fine. >> >>In a failover situation, the IP will move to another node and some >>clients will carry on, others will hang IO to the mount points referred >>to by the IP which has moved. We can *sometimes* trigger this by >>manually suspending a CES node, but not always and some clients >>mounting from the IP moving will be fine, others won't. >> >>If we resume a node an it fails back, the clients that are hanging will >>usually recover fine. We can reboot a client prior to failback and it >>will be fine, stopping and starting the ganesha service on a protocol >>node will also sometimes resolve the issues. >> >>So, has anyone seen this sort of issue and any suggestions for how we >>could either debug more or workaround? >> >>We are currently running the packages >>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >> >>At one point we were seeing it a lot, and could track it back to an >>underlying GPFS network issue that was causing protocol nodes to be >>expelled occasionally, we resolved that and the issues became less >>apparent, but maybe we just fixed one failure mode so see it less often. >> >>On the clients, we use -o sync,hard BTW as in the IBM docs. >> >>On a client showing the issues, we'll see in dmesg, NFS related >>messages >>like: >>[Wed Apr 12 16:59:53 2017] nfs: server >>MYNFSSERVER.bham.ac.uk not responding, >>timed out >> >>Which explains the client hang on certain mount points. >> >>The symptoms feel very much like those logged in this Gluster/ganesha >>bug: >>https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >> >> >>Thanks >> >>Simon >> >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Wed Apr 26 15:27:03 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 26 Apr 2017 14:27:03 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: Would it help to lower the grace time? mmnfs configuration change LEASE_LIFETIME=10 mmnfs configuration change GRACE_PERIOD=10 -jf ons. 26. apr. 2017 kl. 16.20 skrev Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk>: > Nope, the clients are all L3 connected, so not an arp issue. > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > > >Are you using infiniband or Ethernet? I'm wondering if IBM have solved > >the gratuitous arp issue which we see with our non-protocols NFS > >implementation. > > > >-----Original Message----- > >From: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon > >Thompson (IT Research Support) > >Sent: Wednesday, 26 April 2017 3:31 AM > >To: gpfsug main discussion list > >Subject: Re: [gpfsug-discuss] NFS issues > > > >I did some digging in the mmcesfuncs to see what happens server side on > >fail over. > > > >Basically the server losing the IP is supposed to terminate all sessions > >and the receiver server sends ACK tickles. > > > >My current supposition is that for whatever reason, the losing server > >isn't releasing something and the client still has hold of a connection > >which is mostly dead. The tickle then fails to the client from the new > >server. > > > >This would explain why failing the IP back to the original server usually > >brings the client back to life. > > > >This is only my working theory at the moment as we can't reliably > >reproduce this. Next time it happens we plan to grab some netstat from > >each side. > > > >Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the > >server that received the IP and see if that fixes it (i.e. the receiver > >server didn't tickle properly). (Usage extracted from mmcesfuncs which is > >ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) > >for anyone interested. > > > >Then try and kill he sessions on the losing server to check if there is > >stuff still open and re-tickle the client. > > > >If we can get steps to workaround, I'll log a PMR. I suppose I could do > >that now, but given its non deterministic and we want to be 100% sure > >it's not us doing something wrong, I'm inclined to wait until we do some > >more testing. > > > >I agree with the suggestion that it's probably IO pending nodes that are > >affected, but don't have any data to back that up yet. We did try with a > >read workload on a client, but may we need either long IO blocked reads > >or writes (from the GPFS end). > > > >We also originally had soft as the default option, but saw issues then > >and the docs suggested hard, so we switched and also enabled sync (we > >figured maybe it was NFS client with uncommited writes), but neither have > >resolved the issues entirely. Difficult for me to say if they improved > >the issue though given its sporadic. > > > >Appreciate people's suggestions! > > > >Thanks > > > >Simon > >________________________________________ > >From: gpfsug-discuss-bounces at spectrumscale.org > >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode > >Myklebust [janfrode at tanso.net] > >Sent: 25 April 2017 18:04 > >To: gpfsug main discussion list > >Subject: Re: [gpfsug-discuss] NFS issues > > > >I *think* I've seen this, and that we then had open TCP connection from > >client to NFS server according to netstat, but these connections were not > >visible from netstat on NFS-server side. > > > >Unfortunately I don't remember what the fix was.. > > > > > > > > -jf > > > >tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) > >>: > >Hi, > > > >From what I can see, Ganesha uses the Export_Id option in the config file > >(which is managed by CES) for this. I did find some reference in the > >Ganesha devs list that if its not set, then it would read the FSID from > >the GPFS file-system, either way they should surely be consistent across > >all the nodes. The posts I found were from someone with an IBM email > >address, so I guess someone in the IBM teams. > > > >I checked a couple of my protocol nodes and they use the same Export_Id > >consistently, though I guess that might not be the same as the FSID value. > > > >Perhaps someone from IBM could comment on if FSID is likely to the cause > >of my problems? > > > >Thanks > > > >Simon > > > >On 25/04/2017, 14:51, > >"gpfsug-discuss-bounces at spectrumscale.org gpfsug-discuss-bounces at sp > >ectrumscale.org> on behalf of Ouwehand, JJ" > > gpfsug-discuss-bounces at sp > >ectrumscale.org> on behalf of > >j.ouwehand at vumc.nl> wrote: > > > >>Hello, > >> > >>At first a short introduction. My name is Jaap Jan Ouwehand, I work at > >>a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of > >>IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our > >>critical (office, research and clinical data) business process. We have > >>three large GPFS filesystems for different purposes. > >> > >>We also had such a situation with cNFS. A failover (IPtakeover) was > >>technically good, only clients experienced "stale filehandles". We > >>opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few > >>months later, the solution appeared to be in the fsid option. > >> > >>An NFS filehandle is built by a combination of fsid and a hash function > >>on the inode. After a failover, the fsid value can be different and the > >>client has a "stale filehandle". To avoid this, the fsid value can be > >>statically specified. See: > >> > >> > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > >>. > >>scale.v4r22.doc/bl1adm_nfslin.htm > >> > >>Maybe there is also a value in Ganesha that changes after a failover. > >>Certainly since most sessions will be re-established after a failback. > >>Maybe you see more debug information with tcpdump. > >> > >> > >>Kind regards, > >> > >>Jaap Jan Ouwehand > >>ICT Specialist (Storage & Linux) > >>VUmc - ICT > >>E: jj.ouwehand at vumc.nl > >>W: www.vumc.com > >> > >> > >> > >>-----Oorspronkelijk bericht----- > >>Van: > >>gpfsug-discuss-bounces at spectrumscale.org >>spectrumscale.org> > >>[mailto:gpfsug-discuss-bounces at spectrumscale.org >>bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) > >>Verzonden: dinsdag 25 april 2017 13:21 > >>Aan: > >>gpfsug-discuss at spectrumscale.org >>g> > >>Onderwerp: [gpfsug-discuss] NFS issues > >> > >>Hi, > >> > >>We have recently started deploying NFS in addition our existing SMB > >>exports on our protocol nodes. > >> > >>We use a RR DNS name that points to 4 VIPs for SMB services and > >>failover seems to work fine with SMB clients. We figured we could use > >>the same name and IPs and run Ganesha on the protocol servers, however > >>we are seeing issues with NFS clients when IP failover occurs. > >> > >>In normal operation on a client, we might see several mounts from > >>different IPs obviously due to the way the DNS RR is working, but it > >>all works fine. > >> > >>In a failover situation, the IP will move to another node and some > >>clients will carry on, others will hang IO to the mount points referred > >>to by the IP which has moved. We can *sometimes* trigger this by > >>manually suspending a CES node, but not always and some clients > >>mounting from the IP moving will be fine, others won't. > >> > >>If we resume a node an it fails back, the clients that are hanging will > >>usually recover fine. We can reboot a client prior to failback and it > >>will be fine, stopping and starting the ganesha service on a protocol > >>node will also sometimes resolve the issues. > >> > >>So, has anyone seen this sort of issue and any suggestions for how we > >>could either debug more or workaround? > >> > >>We are currently running the packages > >>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >> > >>At one point we were seeing it a lot, and could track it back to an > >>underlying GPFS network issue that was causing protocol nodes to be > >>expelled occasionally, we resolved that and the issues became less > >>apparent, but maybe we just fixed one failure mode so see it less often. > >> > >>On the clients, we use -o sync,hard BTW as in the IBM docs. > >> > >>On a client showing the issues, we'll see in dmesg, NFS related > >>messages > >>like: > >>[Wed Apr 12 16:59:53 2017] nfs: server > >>MYNFSSERVER.bham.ac.uk not responding, > >>timed out > >> > >>Which explains the client hang on certain mount points. > >> > >>The symptoms feel very much like those logged in this Gluster/ganesha > >>bug: > >>https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > >> > >> > >>Thanks > >> > >>Simon > >> > >>_______________________________________________ > >>gpfsug-discuss mailing list > >>gpfsug-discuss at spectrumscale.org > >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>_______________________________________________ > >>gpfsug-discuss mailing list > >>gpfsug-discuss at spectrumscale.org > >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From peserocka at gmail.com Wed Apr 26 18:53:51 2017 From: peserocka at gmail.com (Peter Serocka) Date: Wed, 26 Apr 2017 19:53:51 +0200 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: > On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) wrote: > > Nope, the clients are all L3 connected, so not an arp issue. ...not on the client, but the server-facing L3 switch still need to manage its ARP table, and might miss the IP moving to a new MAC. Cisco switches have a default ARP cache timeout of 4 hours, fwiw. Can your network team provide you the ARP status from the switch when you see a fail-over being stuck? ? Peter > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > >> Are you using infiniband or Ethernet? I'm wondering if IBM have solved >> the gratuitous arp issue which we see with our non-protocols NFS >> implementation. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >> Thompson (IT Research Support) >> Sent: Wednesday, 26 April 2017 3:31 AM >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I did some digging in the mmcesfuncs to see what happens server side on >> fail over. >> >> Basically the server losing the IP is supposed to terminate all sessions >> and the receiver server sends ACK tickles. >> >> My current supposition is that for whatever reason, the losing server >> isn't releasing something and the client still has hold of a connection >> which is mostly dead. The tickle then fails to the client from the new >> server. >> >> This would explain why failing the IP back to the original server usually >> brings the client back to life. >> >> This is only my working theory at the moment as we can't reliably >> reproduce this. Next time it happens we plan to grab some netstat from >> each side. >> >> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >> server that received the IP and see if that fixes it (i.e. the receiver >> server didn't tickle properly). (Usage extracted from mmcesfuncs which is >> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >> for anyone interested. >> >> Then try and kill he sessions on the losing server to check if there is >> stuff still open and re-tickle the client. >> >> If we can get steps to workaround, I'll log a PMR. I suppose I could do >> that now, but given its non deterministic and we want to be 100% sure >> it's not us doing something wrong, I'm inclined to wait until we do some >> more testing. >> >> I agree with the suggestion that it's probably IO pending nodes that are >> affected, but don't have any data to back that up yet. We did try with a >> read workload on a client, but may we need either long IO blocked reads >> or writes (from the GPFS end). >> >> We also originally had soft as the default option, but saw issues then >> and the docs suggested hard, so we switched and also enabled sync (we >> figured maybe it was NFS client with uncommited writes), but neither have >> resolved the issues entirely. Difficult for me to say if they improved >> the issue though given its sporadic. >> >> Appreciate people's suggestions! >> >> Thanks >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >> Myklebust [janfrode at tanso.net] >> Sent: 25 April 2017 18:04 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I *think* I've seen this, and that we then had open TCP connection from >> client to NFS server according to netstat, but these connections were not >> visible from netstat on NFS-server side. >> >> Unfortunately I don't remember what the fix was.. >> >> >> >> -jf >> >> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >> >: >> Hi, >> >> From what I can see, Ganesha uses the Export_Id option in the config file >> (which is managed by CES) for this. I did find some reference in the >> Ganesha devs list that if its not set, then it would read the FSID from >> the GPFS file-system, either way they should surely be consistent across >> all the nodes. The posts I found were from someone with an IBM email >> address, so I guess someone in the IBM teams. >> >> I checked a couple of my protocol nodes and they use the same Export_Id >> consistently, though I guess that might not be the same as the FSID value. >> >> Perhaps someone from IBM could comment on if FSID is likely to the cause >> of my problems? >> >> Thanks >> >> Simon >> >> On 25/04/2017, 14:51, >> "gpfsug-discuss-bounces at spectrumscale.org> ectrumscale.org> on behalf of Ouwehand, JJ" >> > ectrumscale.org> on behalf of >> j.ouwehand at vumc.nl> wrote: >> >>> Hello, >>> >>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>> critical (office, research and clinical data) business process. We have >>> three large GPFS filesystems for different purposes. >>> >>> We also had such a situation with cNFS. A failover (IPtakeover) was >>> technically good, only clients experienced "stale filehandles". We >>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>> months later, the solution appeared to be in the fsid option. >>> >>> An NFS filehandle is built by a combination of fsid and a hash function >>> on the inode. After a failover, the fsid value can be different and the >>> client has a "stale filehandle". To avoid this, the fsid value can be >>> statically specified. See: >>> >>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>> . >>> scale.v4r22.doc/bl1adm_nfslin.htm >>> >>> Maybe there is also a value in Ganesha that changes after a failover. >>> Certainly since most sessions will be re-established after a failback. >>> Maybe you see more debug information with tcpdump. >>> >>> >>> Kind regards, >>> >>> Jaap Jan Ouwehand >>> ICT Specialist (Storage & Linux) >>> VUmc - ICT >>> E: jj.ouwehand at vumc.nl >>> W: www.vumc.com >>> >>> >>> >>> -----Oorspronkelijk bericht----- >>> Van: >>> gpfsug-discuss-bounces at spectrumscale.org>> spectrumscale.org> >>> [mailto:gpfsug-discuss-bounces at spectrumscale.org>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>> Verzonden: dinsdag 25 april 2017 13:21 >>> Aan: >>> gpfsug-discuss at spectrumscale.org>> g> >>> Onderwerp: [gpfsug-discuss] NFS issues >>> >>> Hi, >>> >>> We have recently started deploying NFS in addition our existing SMB >>> exports on our protocol nodes. >>> >>> We use a RR DNS name that points to 4 VIPs for SMB services and >>> failover seems to work fine with SMB clients. We figured we could use >>> the same name and IPs and run Ganesha on the protocol servers, however >>> we are seeing issues with NFS clients when IP failover occurs. >>> >>> In normal operation on a client, we might see several mounts from >>> different IPs obviously due to the way the DNS RR is working, but it >>> all works fine. >>> >>> In a failover situation, the IP will move to another node and some >>> clients will carry on, others will hang IO to the mount points referred >>> to by the IP which has moved. We can *sometimes* trigger this by >>> manually suspending a CES node, but not always and some clients >>> mounting from the IP moving will be fine, others won't. >>> >>> If we resume a node an it fails back, the clients that are hanging will >>> usually recover fine. We can reboot a client prior to failback and it >>> will be fine, stopping and starting the ganesha service on a protocol >>> node will also sometimes resolve the issues. >>> >>> So, has anyone seen this sort of issue and any suggestions for how we >>> could either debug more or workaround? >>> >>> We are currently running the packages >>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >>> >>> At one point we were seeing it a lot, and could track it back to an >>> underlying GPFS network issue that was causing protocol nodes to be >>> expelled occasionally, we resolved that and the issues became less >>> apparent, but maybe we just fixed one failure mode so see it less often. >>> >>> On the clients, we use -o sync,hard BTW as in the IBM docs. >>> >>> On a client showing the issues, we'll see in dmesg, NFS related >>> messages >>> like: >>> [Wed Apr 12 16:59:53 2017] nfs: server >>> MYNFSSERVER.bham.ac.uk not responding, >>> timed out >>> >>> Which explains the client hang on certain mount points. >>> >>> The symptoms feel very much like those logged in this Gluster/ganesha >>> bug: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >>> >>> >>> Thanks >>> >>> Simon >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Apr 26 19:00:06 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 26 Apr 2017 18:00:06 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> , Message-ID: We have no issues with L3 SMB accessing clients, so I'm pretty sure it's not arp. And some of the boxes on the other side of the L3 gateway don't see the issues. We don't use Cisco kit. I posted in a different update that we think it's related to connections to other ports on the same IP which get left open when the IP quickly gets moved away and back again. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Peter Serocka [peserocka at gmail.com] Sent: 26 April 2017 18:53 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues > On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) wrote: > > Nope, the clients are all L3 connected, so not an arp issue. ...not on the client, but the server-facing L3 switch still need to manage its ARP table, and might miss the IP moving to a new MAC. Cisco switches have a default ARP cache timeout of 4 hours, fwiw. Can your network team provide you the ARP status from the switch when you see a fail-over being stuck? ? Peter > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > >> Are you using infiniband or Ethernet? I'm wondering if IBM have solved >> the gratuitous arp issue which we see with our non-protocols NFS >> implementation. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >> Thompson (IT Research Support) >> Sent: Wednesday, 26 April 2017 3:31 AM >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I did some digging in the mmcesfuncs to see what happens server side on >> fail over. >> >> Basically the server losing the IP is supposed to terminate all sessions >> and the receiver server sends ACK tickles. >> >> My current supposition is that for whatever reason, the losing server >> isn't releasing something and the client still has hold of a connection >> which is mostly dead. The tickle then fails to the client from the new >> server. >> >> This would explain why failing the IP back to the original server usually >> brings the client back to life. >> >> This is only my working theory at the moment as we can't reliably >> reproduce this. Next time it happens we plan to grab some netstat from >> each side. >> >> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >> server that received the IP and see if that fixes it (i.e. the receiver >> server didn't tickle properly). (Usage extracted from mmcesfuncs which is >> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >> for anyone interested. >> >> Then try and kill he sessions on the losing server to check if there is >> stuff still open and re-tickle the client. >> >> If we can get steps to workaround, I'll log a PMR. I suppose I could do >> that now, but given its non deterministic and we want to be 100% sure >> it's not us doing something wrong, I'm inclined to wait until we do some >> more testing. >> >> I agree with the suggestion that it's probably IO pending nodes that are >> affected, but don't have any data to back that up yet. We did try with a >> read workload on a client, but may we need either long IO blocked reads >> or writes (from the GPFS end). >> >> We also originally had soft as the default option, but saw issues then >> and the docs suggested hard, so we switched and also enabled sync (we >> figured maybe it was NFS client with uncommited writes), but neither have >> resolved the issues entirely. Difficult for me to say if they improved >> the issue though given its sporadic. >> >> Appreciate people's suggestions! >> >> Thanks >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >> Myklebust [janfrode at tanso.net] >> Sent: 25 April 2017 18:04 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I *think* I've seen this, and that we then had open TCP connection from >> client to NFS server according to netstat, but these connections were not >> visible from netstat on NFS-server side. >> >> Unfortunately I don't remember what the fix was.. >> >> >> >> -jf >> >> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >> >: >> Hi, >> >> From what I can see, Ganesha uses the Export_Id option in the config file >> (which is managed by CES) for this. I did find some reference in the >> Ganesha devs list that if its not set, then it would read the FSID from >> the GPFS file-system, either way they should surely be consistent across >> all the nodes. The posts I found were from someone with an IBM email >> address, so I guess someone in the IBM teams. >> >> I checked a couple of my protocol nodes and they use the same Export_Id >> consistently, though I guess that might not be the same as the FSID value. >> >> Perhaps someone from IBM could comment on if FSID is likely to the cause >> of my problems? >> >> Thanks >> >> Simon >> >> On 25/04/2017, 14:51, >> "gpfsug-discuss-bounces at spectrumscale.org> ectrumscale.org> on behalf of Ouwehand, JJ" >> > ectrumscale.org> on behalf of >> j.ouwehand at vumc.nl> wrote: >> >>> Hello, >>> >>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>> critical (office, research and clinical data) business process. We have >>> three large GPFS filesystems for different purposes. >>> >>> We also had such a situation with cNFS. A failover (IPtakeover) was >>> technically good, only clients experienced "stale filehandles". We >>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>> months later, the solution appeared to be in the fsid option. >>> >>> An NFS filehandle is built by a combination of fsid and a hash function >>> on the inode. After a failover, the fsid value can be different and the >>> client has a "stale filehandle". To avoid this, the fsid value can be >>> statically specified. See: >>> >>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>> . >>> scale.v4r22.doc/bl1adm_nfslin.htm >>> >>> Maybe there is also a value in Ganesha that changes after a failover. >>> Certainly since most sessions will be re-established after a failback. >>> Maybe you see more debug information with tcpdump. >>> >>> >>> Kind regards, >>> >>> Jaap Jan Ouwehand >>> ICT Specialist (Storage & Linux) >>> VUmc - ICT >>> E: jj.ouwehand at vumc.nl >>> W: www.vumc.com >>> >>> >>> >>> -----Oorspronkelijk bericht----- >>> Van: >>> gpfsug-discuss-bounces at spectrumscale.org>> spectrumscale.org> >>> [mailto:gpfsug-discuss-bounces at spectrumscale.org>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>> Verzonden: dinsdag 25 april 2017 13:21 >>> Aan: >>> gpfsug-discuss at spectrumscale.org>> g> >>> Onderwerp: [gpfsug-discuss] NFS issues >>> >>> Hi, >>> >>> We have recently started deploying NFS in addition our existing SMB >>> exports on our protocol nodes. >>> >>> We use a RR DNS name that points to 4 VIPs for SMB services and >>> failover seems to work fine with SMB clients. We figured we could use >>> the same name and IPs and run Ganesha on the protocol servers, however >>> we are seeing issues with NFS clients when IP failover occurs. >>> >>> In normal operation on a client, we might see several mounts from >>> different IPs obviously due to the way the DNS RR is working, but it >>> all works fine. >>> >>> In a failover situation, the IP will move to another node and some >>> clients will carry on, others will hang IO to the mount points referred >>> to by the IP which has moved. We can *sometimes* trigger this by >>> manually suspending a CES node, but not always and some clients >>> mounting from the IP moving will be fine, others won't. >>> >>> If we resume a node an it fails back, the clients that are hanging will >>> usually recover fine. We can reboot a client prior to failback and it >>> will be fine, stopping and starting the ganesha service on a protocol >>> node will also sometimes resolve the issues. >>> >>> So, has anyone seen this sort of issue and any suggestions for how we >>> could either debug more or workaround? >>> >>> We are currently running the packages >>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >>> >>> At one point we were seeing it a lot, and could track it back to an >>> underlying GPFS network issue that was causing protocol nodes to be >>> expelled occasionally, we resolved that and the issues became less >>> apparent, but maybe we just fixed one failure mode so see it less often. >>> >>> On the clients, we use -o sync,hard BTW as in the IBM docs. >>> >>> On a client showing the issues, we'll see in dmesg, NFS related >>> messages >>> like: >>> [Wed Apr 12 16:59:53 2017] nfs: server >>> MYNFSSERVER.bham.ac.uk not responding, >>> timed out >>> >>> Which explains the client hang on certain mount points. >>> >>> The symptoms feel very much like those logged in this Gluster/ganesha >>> bug: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >>> >>> >>> Thanks >>> >>> Simon >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From valdis.kletnieks at vt.edu Thu Apr 27 00:44:44 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Wed, 26 Apr 2017 19:44:44 -0400 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: <52226.1493250284@turing-police.cc.vt.edu> On Wed, 26 Apr 2017 14:20:30 -0000, "Simon Thompson (IT Research Support)" said: > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). After over 3 decades of experience with 'exportfs' being totally safe to run in real time with both userspace and kernel NFSD implementations, it came as quite a surprise when we did 'mmnfs eport change --nfsadd='... and it bounced the NFS server on all 4 protocol nodes. At the same time. Fortunately for us, the set of client nodes only changes once every 2-3 months. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From secretary at gpfsug.org Thu Apr 27 09:29:41 2017 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 27 Apr 2017 09:29:41 +0100 Subject: [gpfsug-discuss] Meet other spectrum scale users in May Message-ID: <1f483faa9cb61dcdc80afb187e908745@webmail.gpfsug.org> Dear Members, Please join us and other spectrum scale users for 2 days of great talks and networking! WHEN: 9-10th May 2017 WHERE: Macdonald Manchester Hotel & Spa, Manchester, UK (right by Manchester Piccadilly train station) WHO? The event is free to attend, is open to members from all industries and welcomes users with a little and a lot of experience using Spectrum Scale. The SSUG brings together the Spectrum Scale User Community including Spectrum Scale developers and architects to share knowledge, experiences and future plans. Topics include transparent cloud tiering, AFM, automation and security best practices, Docker and HDFS support, problem determination, and an update on Elastic Storage Server (ESS). Our popular forum includes interactive problem solving, a best practices discussion and networking. We're very excited to welcome back Doris Conti the Director for Spectrum Scale (GPFS) and HPC SW Product Development at IBM. The May meeting is sponsored by IBM, DDN, Lenovo, Mellanox, Seagate, Arcastream, Ellexus, and OCF. It is an excellent opportunity to learn more and get your questions answered. Register your place today at the Eventbrite page https://goo.gl/tRptru [1] We hope to see you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://goo.gl/tRptru -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at strubi.ox.ac.uk Thu Apr 27 12:46:09 2017 From: robert at strubi.ox.ac.uk (Robert Esnouf) Date: Thu, 27 Apr 2017 12:46:09 +0100 (BST) Subject: [gpfsug-discuss] Two high-performance research computing posts in Oxford University Medical Sciences Message-ID: <201704271146.061978@mail.strubi.ox.ac.uk> Dear All, I hope that it is allowed to put job postings on this discussion list... sorry if I've broken a rule but it does mention SpectrumScale! I'd like to advertise the availability two exciting and challenging new opportunities to work in research computing/high-performance computing at Oxford University within the Nuffield Department of Medicine. The first is a Grade 8 position to expand the current Research Computing Core team at the Wellcome Trust Centre for Human Genetics. The Core now runs a cluster of about ~3800 high-memory compute cores, a further ~700 cores outside the cluster, a (growing) smattering of GPU-enabled and KNL nodes, 4PB high-performance SpectrumScale (GPFS) storage and about 4PB of lower grade (mostly XFS) storage. The facility has an FDR InfiniBand fabric providing for access to storage at up to 20GB/s and supporting MPI workloads. We mainly support the statistical genetics work of the Centre and other departments around Oxford, the work of the sequencing and bioinformatics cores and electron microscopy, but the workload is varied and interesting! Further significant update and expansion of this facility will occur during 2017 and beyond and means that we are expanding the team. http://www.well.ox.ac.uk/home http://www.well.ox.ac.uk/research-8 https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.display_form?p_company=10&p_internal_external=E&p_display_in_irish=N&p_process_type=&p_applicant_no=&p_form_profile_detail=&p_display_apply_ind=Y&p_refresh_search=Y&p_recruitment_id=126748 The second is a Grade 9 post at the newly opened Big Data Institute next door to the WTCHG - to work with me to establish a brand new Research Computing facility. The Big Data Institute Building has 32 shiny new racks ready to be filled with up to 320kW of IT load - and we won't stop there! The current plans envisage a virtualized infrastructure for secure access, a high-performance cluster supporting traditional workloads and containers, high-performance filesystem storage, a hyperconverged infrastructure supporting (OpenStack, project VMs, containers and distributed computing plaforms such as Apache Spark), a significant GPU-based artificial intelligence/deep learning platform and a large, multisite object store for managing research data in the long term. https://www.bdi.ox.ac.uk/ https://www.ndm.ox.ac.uk/current-job-vacancies/vacancy/128486-BDI-Research-Computing-Manager https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.display_form?p_company=10&p_internal_external=E&p_display_in_irish=N&p_process_type=&p_applicant_no=&p_form_profile_detail=&p_display_apply_ind=Y&p_refresh_search=Y&p_recruitment_id=128486 It is expected that the Wellcome Trust Centre and Big Data Institute facilities will develop independently for now, but in a complementary and supportive fashion given the overlap in science and technology that is likely to exist. The Research Computing support teams will therefore work extremely closely together to address the challenges facing computing in the medical sciences. If either (or both) of these vacancies seem interesting then please feel free to contact the Head of the Research Computing Core at the WTCHG (me) or the Director of Research Computing at the BDI (me). Deadline for the WTCHG post is 31st May and for the BDI post is 24th May. Please feel free to circulate this email to anyone who might be interested and apologies for any cross postings! Regards, Robert -- Dr Robert Esnouf University Research Lecturer, Director of Research Computing BDI, Head of Research Computing Core WTCHG, NDM Research Computing Strategy Officer Main office: Room 10/028, Wellcome Trust Centre for Human Genetics, Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK Emails: robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at bdi.ox.ac.uk Tel: (+44) - 1865 - 287783 From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 5 17:40:30 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Apr 2017 16:40:30 +0000 Subject: [gpfsug-discuss] Can't delete filesystem Message-ID: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Hi All, First off, I can open a PMR on this if I need to? I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. What do I need to do to resolve this issue on those 4 clients? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 5 17:47:36 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Wed, 5 Apr 2017 16:47:36 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: Do you have ILM (dsmrecalld and friends) running? They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] Sent: 05 April 2017 17:40 To: gpfsug main discussion list Subject: [gpfsug-discuss] Can't delete filesystem Hi All, First off, I can open a PMR on this if I need to? I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. What do I need to do to resolve this issue on those 4 clients? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 From valdis.kletnieks at vt.edu Wed Apr 5 17:54:16 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Wed, 05 Apr 2017 12:54:16 -0400 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: <7103.1491411256@turing-police.cc.vt.edu> On Wed, 05 Apr 2017 16:40:30 -0000, "Buterbaugh, Kevin L" said: > So, I have gone to all of the 4 clients and none of them say they have it > mounted according to either ???df??? or ???mount???. I???ve gone ahead and run both > ???mmunmount??? and ???umount -l??? on the filesystem anyway, but the mmdelfs still > fails saying that they have it mounted. I've over the years seen this a few times. Doing an 'mmshutdown/mmstartup' pair on the offending nodes has always cleared it up. I probably should have opened a PMR, but it always seems to happen when I'm up to in alligators with other issues. (Am I the only person who wonders why all complex software packages contain alligator-detector routines? :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 484 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 5 17:54:14 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Apr 2017 16:54:14 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Hi Simon, No, I do not. Let me also add that this is a filesystem that I migrated users off of and to another GPFS filesystem. I moved the last users this morning and then ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the simple solution is to use the ?-p? option to mmdelfs I?m fine with that. I?m just not sure what the right course of action is at this point. Thanks again? Kevin > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT Services) wrote: > > Do you have ILM (dsmrecalld and friends) running? > > They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] > Sent: 05 April 2017 17:40 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Can't delete filesystem > > Hi All, > > First off, I can open a PMR on this if I need to? > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. > > So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > Kevin > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Wed Apr 5 22:51:15 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 05 Apr 2017 21:51:15 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Message-ID: Maybe try mmumount -f on the remaining 4 nodes? -jf ons. 5. apr. 2017 kl. 18.54 skrev Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu>: > Hi Simon, > > No, I do not. > > Let me also add that this is a filesystem that I migrated users off of and > to another GPFS filesystem. I moved the last users this morning and then > ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the > simple solution is to use the ?-p? option to mmdelfs I?m fine with that. > I?m just not sure what the right course of action is at this point. > > Thanks again? > > Kevin > > > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT > Services) wrote: > > > > Do you have ILM (dsmrecalld and friends) running? > > > > They can also stop the filesystem being released (e.g. mmshutdown fails > if they are up). > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin > L [Kevin.Buterbaugh at Vanderbilt.Edu] > > Sent: 05 April 2017 17:40 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] Can't delete filesystem > > > > Hi All, > > > > First off, I can open a PMR on this if I need to? > > > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that > the filesystem is still mounted on 14 nodes and therefore can?t be > deleted. 10 of those nodes are my 10 GPFS servers and they have an > ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I > need to concentrate on ? i.e. once those other 4 clients no longer have it > mounted the internal mounts will resolve themselves. Correct me if I?m > wrong on that, please. > > > > So, I have gone to all of the 4 clients and none of them say they have > it mounted according to either ?df? or ?mount?. I?ve gone ahead and run > both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs > still fails saying that they have it mounted. > > > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > > > Kevin > > > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and > Education > > Kevin.Buterbaugh at vanderbilt.edu > - (615)875-9633 > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Thu Apr 6 02:54:07 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Thu, 6 Apr 2017 01:54:07 +0000 Subject: [gpfsug-discuss] AFM misunderstanding Message-ID: When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Apr 6 09:20:31 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 6 Apr 2017 08:20:31 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption Message-ID: We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon From vpuvvada at in.ibm.com Thu Apr 6 11:45:37 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Thu, 6 Apr 2017 16:15:37 +0530 Subject: [gpfsug-discuss] AFM misunderstanding In-Reply-To: References: Message-ID: Could you explain "bits of actual file" mentioned below ? Prefetch with ?metadata-only pulls everything (xattrs, ACLs etc..) except data. Doing " ls ?ltrs" shows file allocation size as zero if data prefetch not yet completed on them. ~Venkat (vpuvvada at in.ibm.com) From: Mark Bush To: gpfsug main discussion list Date: 04/06/2017 07:24 AM Subject: [gpfsug-discuss] AFM misunderstanding Sent by: gpfsug-discuss-bounces at spectrumscale.org When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Thu Apr 6 13:28:40 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Thu, 6 Apr 2017 12:28:40 +0000 Subject: [gpfsug-discuss] AFM misunderstanding In-Reply-To: References: Message-ID: <425C32E7-B752-4B61-BDF5-83C219D89ADB@siriuscom.com> I think I was missing a key piece in that I thought that just doing a mmafmctl fs1 prefetch ?j cache would start grabbing everything (data and metadata) but it appears that the ?list-file myfiles.txt is the trigger for the prefetch to work properly. I mistakenly assumed that omitting the ?list-file switch would prefetch all the data in the fileset. From: on behalf of Venkateswara R Puvvada Reply-To: gpfsug main discussion list Date: Thursday, April 6, 2017 at 5:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM misunderstanding Could you explain "bits of actual file" mentioned below ? Prefetch with ?metadata-onlypulls everything (xattrs, ACLs etc..) except data. Doing "ls ?ltrs" shows file allocation size as zero if data prefetch not yet completed on them. ~Venkat (vpuvvada at in.ibm.com) From: Mark Bush To: gpfsug main discussion list Date: 04/06/2017 07:24 AM Subject: [gpfsug-discuss] AFM misunderstanding Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Apr 6 15:33:18 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 6 Apr 2017 14:33:18 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Message-ID: Hi JF, I actually tried that - to no effect. Yesterday evening I rebooted the 4 clients and, as expected, the 10 servers released their internal mounts as well ? and then I was able to delete the filesystem successfully. Thanks for the suggestions, all? Kevin On Apr 5, 2017, at 4:51 PM, Jan-Frode Myklebust > wrote: Maybe try mmumount -f on the remaining 4 nodes? -jf ons. 5. apr. 2017 kl. 18.54 skrev Buterbaugh, Kevin L >: Hi Simon, No, I do not. Let me also add that this is a filesystem that I migrated users off of and to another GPFS filesystem. I moved the last users this morning and then ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the simple solution is to use the ?-p? option to mmdelfs I?m fine with that. I?m just not sure what the right course of action is at this point. Thanks again? Kevin > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Do you have ILM (dsmrecalld and friends) running? > > They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] > Sent: 05 April 2017 17:40 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Can't delete filesystem > > Hi All, > > First off, I can open a PMR on this if I need to? > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. > > So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > Kevin > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Apr 6 15:54:42 2017 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 6 Apr 2017 14:54:42 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption In-Reply-To: References: Message-ID: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> This is rather dependant on SS version. So what used to happen before 4.2.2.* is that a client would be unable to mount the filesystem in question and would give an error in the mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount the file system and then give errors on file access instead. (just tested this on 4.2.2.3) I'll have to read through the changelogs looking for this one. Depending on your policy for encryption then, this might be exactly what you want, but I REALLY REALLY dislike this behaviour. To me this means clients can now mount an encrypted FS now and then fail during operation. If I get a client node that comes up improperly, user work will start, and it will fail with "Operation not permitted" errors on file access. I imagine my batch system could run through a massive amount of jobs on a bad client without anyone noticing immeadiately. Yet another thing we now have to monitor now I guess. *shrug* A couple other gotcha's we've seen with Encryption: Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Ed Wahl OSC ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Thursday, April 06, 2017 4:20 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale Encryption We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Thu Apr 6 16:11:38 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 6 Apr 2017 15:11:38 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption Message-ID: Hi Ed, Thanks. We already have several SKLM servers (tape backups). For me, we plan to encrypt specific parts of the FS (probably by file-set), so as long as all that is needed is an empty RKM.conf file, sounds like it will work. I suppose I could have an MEK that is granted to all clients, but then never actually use it for encryption if RKM.conf needs at least one key (hack hack hack). (We are at 4.2.2-2 (mostly) or higher (a few nodes)). I *thought* the FEK was wrapped in the metadata with the MEK (possibly multiple times with different MEKs), so what the docs say about operation continuing with no SKLM server sounds sensible, but of course that might not be what actually happens I guess... Simon On 06/04/2017, 15:54, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Wahl, Edward" wrote: >This is rather dependant on SS version. > >So what used to happen before 4.2.2.* is that a client would be unable to >mount the filesystem in question and would give an error in the >mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount >the file system and then give errors on file access instead. (just >tested this on 4.2.2.3) I'll have to read through the changelogs looking >for this one. > >Depending on your policy for encryption then, this might be exactly what >you want, but I REALLY REALLY dislike this behaviour. > >To me this means clients can now mount an encrypted FS now and then fail >during operation. If I get a client node that comes up improperly, user >work will start, and it will fail with "Operation not permitted" errors >on file access. I imagine my batch system could run through a massive >amount of jobs on a bad client without anyone noticing immeadiately. Yet >another thing we now have to monitor now I guess. *shrug* > >A couple other gotcha's we've seen with Encryption: > >Encrypted file systems do not store data in large MD blocks. Makes >sense. This means large MD blocks aren't as useful as they are in >unencrypted FS, if you are using this. > >Having at least one backup SKLM server is a good idea. >"kmipServerUri[N+1]" in the conf. > >While the documentation claims the FS can continue operation once it >caches the MEK if an SKLM server goes away, in operation this does NOT >work as you may expect. Your users still need access to the FEKs for the >files your clients work on. Logs will fill with Key could not be >fetched. errors. > >Ed Wahl >OSC > >________________________________________ >From: gpfsug-discuss-bounces at spectrumscale.org >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson >(Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] >Sent: Thursday, April 06, 2017 4:20 AM >To: gpfsug-discuss at spectrumscale.org >Subject: [gpfsug-discuss] Spectrum Scale Encryption > >We are currently looking at adding encryption to our deployment for some >of our data sets and for some of our nodes. Apologies in advance if some >of this is a bit vague, we're not yet at the point where we can test this >stuff out, so maybe some of it will become clear when we try it out. > > >For a node that we don't want to have access to any encrypted data, what >do we need to set up? > >According to the docs: >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >s >cale.v4r22.doc/bl1adv_encryption_prep.htm > > >"After the file system is configured with encryption policy rules, the >file system is considered encrypted. From that point on, each node that >has access to that file system must have an RKM.conf file present. >Otherwise, the file system might not be mounted or might become >unmounted." > >So on a node which I don't want to have access to any encrypted files, do >I just need to have an empty RKM.conf file? > >(If this is the case, would be good to have this added to the docs) > > >Secondly ... (and maybe I'm misunderstanding the docs here) > >For the Policy >https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectr >u >m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm > > >KEYS ('Keyname'[, 'Keyname', ... ]) > > >KeyId:RkmId > > >RkmId should match the stanza name in RKM.conf? > >If so, it would be useful if the docs used the same names in the examples >(RKMKMIP3 vs rkmname3) > >And KeyId should match a "Key UUID" in SKLM? > > >Third. My understanding from talking to various IBM people is that we need >ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways >(probably), do we have to do any kind of node registration in ISKLM? Or is >this purely based on the certificates being distributed to clients and >keys are mapped in ISKLM to the client cert to determine if the node is >able to request the key? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Jon.Edwards at newbase.com.au Fri Apr 7 05:56:33 2017 From: Jon.Edwards at newbase.com.au (Jon Edwards) Date: Fri, 7 Apr 2017 04:56:33 +0000 Subject: [gpfsug-discuss] Spectrum scale sending cluster traffic across the management network Message-ID: <7929c064d6df4d7b88065b4d882daa98@newbase.com.au> Hi All, Just getting started with spectrum scale, Just wondering if anyone has come across the issue where when doing a mmcrfs or mmdelfs you get the error Failed to connect to file system daemon: Connection timed out mmdelfs: tsdelfs failed. mmdelfs: Command failed. Examine previous error messages to determine cause. When viewing the logs in /var/mmfs/gen/mmfslog on a node other than the one I am running the command on i get: 2017-04-07_14:03:13.354+1000: [N] Filtered log entry: 'connect to node 192.168.0.1:1191' occurred 10 times between 2017-04-07_11:38:19.058+1000 and 2017-04-07_11:54:58.649+1000 192.168.0.0/24 In this case is the management network configured on eth0 of all the nodes. It is failing because port 1191 is not allowed on this interface. The dns and hostname for each node resolves to a dedicated cluster network, lets say 10.0.0.0/24 (ETH1) For some reason when I run the mmcrfs or mmdelfs it tries to talk back over the management network instead of the cluster network which fails to connect due to firewall blocking cluster traffic over management. Anyone seen this before? Kind Regards, Jon Edwards Senior Systems Engineer NewBase Email: jon.edwards at newbase.com.au Ph: + 61 7 3216 0776 Fax: + 61 7 3216 0779 http://www.newbase.com.au Opinions contained in this e-mail do not necessarily reflect the opinions of NewBase Computer Services Pty Ltd. This e-mail is for the exclusive use of the addressee and should not be disseminated further or copied without permission of the sender. If you have received this message in error, please immediately notify the sender and delete the message from your computer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jon.Edwards at newbase.com.au Fri Apr 7 06:26:56 2017 From: Jon.Edwards at newbase.com.au (Jon Edwards) Date: Fri, 7 Apr 2017 05:26:56 +0000 Subject: [gpfsug-discuss] Spectrum scale sending cluster traffic across the management network Message-ID: <6e02ed91cb404d46b7b5cd3515ad8fe9@newbase.com.au> Please disregard, found the solution. Found the subnets= parameter for the cluster config mmchconfig subnets="192.168.0.0/24 192.168.1.0/24" Which forces it to use this subnet. Kind Regards, Jon Edwards | Senior Systems Engineer NewBase Ph: + 61 7 3216 0776 | Email: jon.edwards at newbase.com.au http://www.newbase.com.au From: Jon Edwards Sent: Friday, 7 April 2017 2:56 PM To: 'gpfsug-discuss at spectrumscale.org' Cc: 'Andrew Beattie' Subject: Spectrum scale sending cluster traffic across the management network Hi All, Just getting started with spectrum scale, Just wondering if anyone has come across the issue where when doing a mmcrfs or mmdelfs you get the error Failed to connect to file system daemon: Connection timed out mmdelfs: tsdelfs failed. mmdelfs: Command failed. Examine previous error messages to determine cause. When viewing the logs in /var/mmfs/gen/mmfslog on a node other than the one I am running the command on i get: 2017-04-07_14:03:13.354+1000: [N] Filtered log entry: 'connect to node 192.168.0.1:1191' occurred 10 times between 2017-04-07_11:38:19.058+1000 and 2017-04-07_11:54:58.649+1000 192.168.0.0/24 In this case is the management network configured on eth0 of all the nodes. It is failing because port 1191 is not allowed on this interface. The dns and hostname for each node resolves to a dedicated cluster network, lets say 10.0.0.0/24 (ETH1) For some reason when I run the mmcrfs or mmdelfs it tries to talk back over the management network instead of the cluster network which fails to connect due to firewall blocking cluster traffic over management. Anyone seen this before? Kind Regards, Jon Edwards Senior Systems Engineer NewBase Email: jon.edwards at newbase.com.au Ph: + 61 7 3216 0776 Fax: + 61 7 3216 0779 http://www.newbase.com.au Opinions contained in this e-mail do not necessarily reflect the opinions of NewBase Computer Services Pty Ltd. This e-mail is for the exclusive use of the addressee and should not be disseminated further or copied without permission of the sender. If you have received this message in error, please immediately notify the sender and delete the message from your computer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Fri Apr 7 15:00:09 2017 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 7 Apr 2017 10:00:09 -0400 Subject: [gpfsug-discuss] Spectrum Scale Encryption In-Reply-To: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> References: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> Message-ID: All, A few comments on the topics raised below. 1) All nodes that mount an encrypted file system, and also the nodes with management roles on the file system will need access to the keys have the proper setup (RKM.conf, etc). Edward is correct that there was some change in behavior, introduced in 4.2.1 . Before the change, a mount would fail unless RKM.conf is present on the node. In addition, once a policy with encryption rules was applied, nodes without the proper encryption setup would unmount the file system. With the change, the error gets delayed to when encrypted files are accessed. The change in behavior was introduced based on feedback that unmounting the file system in that case was too drastic in that scenario. >> So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? All nodes which mount an encrypted file system should have proper setup for encryption, even for a node from where only unencrypted files are being accessed. 2) >> Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Correct. Data is not stored in the inode for encrypted files. On the other hand, since encryption metadata is stored as an extended attribute in the inode, 4K inodes are still recommended -- especially in cases where a more complicated encryption policy is used. 3) >> Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Using a backup key server is strongly recommended. While it's true that the files may still be accessed for a while if the key server becomes unreachable, this was not something to be counted on. First because keys (MEK) may expire at any time, requiring the key to be retrieved from the key server again. Second because a file may require a key may be needed that has not been cached before. 4) >> Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? Correct. >> If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Correct. We'll review the documentation to ensure that the meaning of the RkmId in the examples is clear. 5) >> Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? I'll work on getting clarifications from the ISKLM folks on this aspect. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Wahl, Edward" To: gpfsug main discussion list Date: 04/06/2017 10:55 AM Subject: Re: [gpfsug-discuss] Spectrum Scale Encryption Sent by: gpfsug-discuss-bounces at spectrumscale.org This is rather dependant on SS version. So what used to happen before 4.2.2.* is that a client would be unable to mount the filesystem in question and would give an error in the mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount the file system and then give errors on file access instead. (just tested this on 4.2.2.3) I'll have to read through the changelogs looking for this one. Depending on your policy for encryption then, this might be exactly what you want, but I REALLY REALLY dislike this behaviour. To me this means clients can now mount an encrypted FS now and then fail during operation. If I get a client node that comes up improperly, user work will start, and it will fail with "Operation not permitted" errors on file access. I imagine my batch system could run through a massive amount of jobs on a bad client without anyone noticing immeadiately. Yet another thing we now have to monitor now I guess. *shrug* A couple other gotcha's we've seen with Encryption: Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Ed Wahl OSC ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Thursday, April 06, 2017 4:20 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale Encryption We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Fri Apr 7 15:58:29 2017 From: mweil at wustl.edu (Matt Weil) Date: Fri, 7 Apr 2017 09:58:29 -0500 Subject: [gpfsug-discuss] AFM gateways Message-ID: Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From vpuvvada at in.ibm.com Mon Apr 10 11:56:16 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Mon, 10 Apr 2017 16:26:16 +0530 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: gpfsug main discussion list Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sandra.McLaughlin at astrazeneca.com Mon Apr 10 12:20:53 2017 From: Sandra.McLaughlin at astrazeneca.com (McLaughlin, Sandra M) Date: Mon, 10 Apr 2017 11:20:53 +0000 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn't do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christian.Fey at sva.de Mon Apr 10 17:04:31 2017 From: Christian.Fey at sva.de (Fey, Christian) Date: Mon, 10 Apr 2017 16:04:31 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: <455e54150cd04cd8808619acbf7d8d2b@sva.de> Hi, I'm just dealing with a maybe similar issue that also seems to be related to the output of "tsctl shownodes up" (before CES i actually never had to do with this command). In my case the output of a "mmlscluster" for example shows the nodes like "node1.acme.local" but in " tsctl shownodes up" they are displayed as "node1.acme.local.acme.local" for example. This maybe causes a fresh CES implementation in a existing GPFS cluster to also not spread ip-adresses. It instead loops in the same way as it did in your case @jonathon. I think it tries to search for "node1.acme.local" but doesn't find it since tsctl shows it with doubled suffix. Can anyone explain, from where the "tsctl shownodes up" reads the data? Additionally does anyone have an idea why the dns suffix is doubled? Kind regards Christian -----Urspr?ngliche Nachricht----- Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Jonathon A Anderson Gesendet: Donnerstag, 23. M?rz 2017 16:02 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Achtung! Die Absender-Adresse ist m?glicherweise gef?lscht. Bitte ?berpr?fen Sie die Plausibilit?t der Email und lassen bei enthaltenen Anh?ngen und Links besondere Vorsicht walten. Wenden Sie sich im Zweifelsfall an das CIT unter cit at sva.de oder 06122 536 350. (Stichwort: DKIM Test Fehlgeschlagen) ---------------------------------------------------------------------------------------------------------------- Thanks! I?m looking forward to upgrading our CES nodes and resuming work on the project. ~jonathon On 3/23/17, 8:24 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Olaf Weiser" wrote: the issue is fixed, an APAR will be released soon - IV93100 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Simon Thompson (Research Computing - IT Services)" An:"gpfsug main discussion list" Datum:Di. 31.01.2017 21:07Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From service at metamodul.com Mon Apr 10 17:47:41 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 10 Apr 2017 18:47:41 +0200 (CEST) Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network Message-ID: <788130355.197989.1491842861235@email.1und1.de> An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Apr 10 17:58:36 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 10 Apr 2017 12:58:36 -0400 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: <788130355.197989.1491842861235@email.1und1.de> References: <788130355.197989.1491842861235@email.1und1.de> Message-ID: 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers wrote: > My understanding of the GPFS networks is not quite clear. > > For an GPFS setup i would like to use 2 Networks > > 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24 > > 2 Admin Network using for example: 192.168.1.0/24 network > > Questions > > 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - > Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. > network or is it sufficient for the tiebreaker node to be configured as > part of the admin 192.168.1 network ? > > 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 > network or is it sufficient for the remote cluster to access the 10.1.1 > network ? If so i assume that remotecluster commands and ping to/from > remote cluster are going via the Daemon network ? > > Note: > > I am aware and read https://www.ibm.com/developerworks/community/ > wikis/home?lang=en#!/wiki/General%20Parallel%20File% > 20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview > > -- > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 <+49%20177%204393994> > Mail: service at metamodul.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Mon Apr 10 18:13:08 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 10 Apr 2017 18:13:08 +0100 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de> Message-ID: <3a8f72c6-407a-0f4d-cf3c-f4698ca7b8e5@qsplace.co.uk> All nodes in a GPFS cluster need to be able to communicate over the data and admin network with the exception of remote clusters which can have their own separate admin network (for their own cluster that they are a member of) but still require communications over the daemon network. The networks can be routed and on different subnets, however the each member of the cluster will need to be able to communicate with every other member. With this in mind: 1) The quorum node will need to be accessible on both the 10.1.1.0/24 and 192.168.1.0/24 however again the network that the quorum node is on could be routed. 2) Remote clusters don't need access to the home clusters admin network, as they will use their own clusters admin network. As Eric has mentioned I would double check your 2+1 cluster suggestion, do you mean 2 x Servers with NSD's (with a quorum role) and 1 quorum node without NSD's? which gives you 3 quorum, or are you only going to have 1 quorum? If the latter that I would suggest using all 3 servers for quorum as they should be licensed as GPFS servers anyway due to their roles. -- Lauz On 10/04/2017 17:58, J. Eric Wonderley wrote: > 1) You want more that one quorum node on your server cluster. The > non-quorum node does need a daemon network interface exposed to the > client cluster as does the quorum nodes. > > 2) No. Admin network is for intra cluster communications...not inter > cluster(between clusters). Daemon interface(port 1191) is used for > communications between clusters. I think there is little benefit > gained by having designated an admin network...maybe someone can point > out benefits of an admin network. > > > > Eric Wonderley > > On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > > wrote: > > My understanding of the GPFS networks is not quite clear. > > For an GPFS setup i would like to use 2 Networks > > 1 Daemon (data) network using port 1191 using for example. > 10.1.1.0/24 > > 2 Admin Network using for example: 192.168.1.0/24 > network > > Questions > > 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) > Config - Does the Tiebreaker Node needs to have access to the > daemon(data) 10.1.1. network or is it sufficient for the > tiebreaker node to be configured as part of the admin 192.168.1 > network ? > > 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 > network or is it sufficient for the remote cluster to access the > 10.1.1 network ? If so i assume that remotecluster commands and > ping to/from remote cluster are going via the Daemon network ? > > Note: > > I am aware and read > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview > > > -- > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 > Mail: service at metamodul.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Apr 10 18:26:42 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 10 Apr 2017 17:26:42 +0000 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de>, Message-ID: If you have network congestion, then a separate admin network is of benefit. Maybe less important if you have 10GbE networks, but if (for example), you normally rely on IB to talk data, and gpfs fails back to the Ethernet (which may be only 1GbE), then you may have cluster issues, for example missing gpfs pings. Having a separate physical admin network can protect you from this. Having been bitten by this several years back, it's a good idea IMHO to have a separate admin network. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of J. Eric Wonderley [eric.wonderley at vt.edu] Sent: 10 April 2017 17:58 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > wrote: My understanding of the GPFS networks is not quite clear. For an GPFS setup i would like to use 2 Networks 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24 2 Admin Network using for example: 192.168.1.0/24 network Questions 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. network or is it sufficient for the tiebreaker node to be configured as part of the admin 192.168.1 network ? 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 network or is it sufficient for the remote cluster to access the 10.1.1 network ? If so i assume that remotecluster commands and ping to/from remote cluster are going via the Daemon network ? Note: I am aware and read https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Mon Apr 10 18:44:47 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 10 Apr 2017 19:44:47 +0200 (CEST) Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de>, Message-ID: <795203366.199195.1491846287405@email.1und1.de> An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Apr 10 19:02:30 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 10 Apr 2017 21:02:30 +0300 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: <795203366.199195.1491846287405@email.1und1.de> References: <788130355.197989.1491842861235@email.1und1.de>, <795203366.199195.1491846287405@email.1und1.de> Message-ID: Hi Out of curiosity. Are you using Failure groups and doing replication of data/metadata too? If you you do need to deal with the file system descriptors as well on the 3rd node. Thanks From: Hans-Joachim Ehlers To: gpfsug main discussion list Date: 10/04/2017 20:44 Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry for not being clear. The setup is of course a 3 Node Cluster where each node is a quorum node - 2 NSD Server and 1 TieBreaker/Quorum Buster node. For me it was not clear if the Tiebreaker/Quorum Buster node - which does nothing in terms of data serving - must be part of the daemon/data network or not. So i get the understanding that a Tiebreaker Node must be also part of the Daemon network. Thx a lot to all Hajo "Simon Thompson (IT Research Support)" hat am 10. April 2017 um 19:26 geschrieben: If you have network congestion, then a separate admin network is of benefit. Maybe less important if you have 10GbE networks, but if (for example), you normally rely on IB to talk data, and gpfs fails back to the Ethernet (which may be only 1GbE), then you may have cluster issues, for example missing gpfs pings. Having a separate physical admin network can protect you from this. Having been bitten by this several years back, it's a good idea IMHO to have a separate admin network. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of J. Eric Wonderley [eric.wonderley at vt.edu] Sent: 10 April 2017 17:58 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > wrote: My understanding of the GPFS networks is not quite clear. For an GPFS setup i would like to use 2 Networks 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24< http://10.1.1.0/24> 2 Admin Network using for example: 192.168.1.0/24 network Questions 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. network or is it sufficient for the tiebreaker node to be configured as part of the admin 192.168.1 network ? 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 network or is it sufficient for the remote cluster to access the 10.1.1 network ? If so i assume that remotecluster commands and ping to/from remote cluster are going via the Daemon network ? Note: I am aware and read https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Apr 10 21:15:38 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 10 Apr 2017 15:15:38 -0500 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mitsugi at linux.vnet.ibm.com Tue Apr 11 05:29:16 2017 From: mitsugi at linux.vnet.ibm.com (Masanori Mitsugi) Date: Tue, 11 Apr 2017 13:29:16 +0900 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Message-ID: Hello, Does anyone have experience to do mmapplypolicy against billion files for ILM/HSM? Currently I'm planning/designing * 1 Scale filesystem (5-10 PB) * 10-20 filesets which includes 1 billion files each And our biggest concern is "How log does it take for mmapplypolicy policy scan against billion files?" I know it depends on how to write the policy, but I don't have no billion files policy scan experience, so I'd like to know the order of time (min/hour/day...). It would be helpful if anyone has experience of such large number of files scan and let me know any considerations or points for policy design. -- Masanori Mitsugi mitsugi at linux.vnet.ibm.com From zgiles at gmail.com Tue Apr 11 05:49:10 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 11 Apr 2017 00:49:10 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com From olaf.weiser at de.ibm.com Tue Apr 11 07:51:48 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 11 Apr 2017 08:51:48 +0200 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <455e54150cd04cd8808619acbf7d8d2b@sva.de> References: <455e54150cd04cd8808619acbf7d8d2b@sva.de> Message-ID: An HTML attachment was scrubbed... URL: From ckrafft at de.ibm.com Tue Apr 11 09:24:35 2017 From: ckrafft at de.ibm.com (Christoph Krafft) Date: Tue, 11 Apr 2017 10:24:35 +0200 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Message-ID: Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A788784.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From p.childs at qmul.ac.uk Tue Apr 11 09:57:44 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Tue, 11 Apr 2017 08:57:44 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London From jonathan at buzzard.me.uk Tue Apr 11 11:21:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Tue, 11 Apr 2017 11:21:05 +0100 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <1491906065.4102.87.camel@buzzard.me.uk> On Tue, 2017-04-11 at 00:49 -0400, Zachary Giles wrote: [SNIP] > * Then throw ~8 well tuned Infiniband attached nodes at it using -N, > If they're the same as the NSD servers serving the flash, even better. > Exactly how much are you going to gain from Infiniband over 40Gbps or even 100Gbps Ethernet? Not a lot I would have thought. Even with flash all your latency is going to be in the flash not the Ethernet. Unless you have a compute cluster and need Infiniband for the MPI traffic, it is surely better to stick to Ethernet. Infiniband is rather esoteric, what I call a minority sport best avoided if at all possible. Even if you have an Infiniband fabric, I would argue that give current core counts and price points for 10Gbps Ethernet, that actually you are better off keeping your storage traffic on the Ethernet, and reserving the Infiniband for MPI duties. That is 10Gbps Ethernet to the compute nodes and 40/100Gbps Ethernet on the storage nodes. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From zgiles at gmail.com Tue Apr 11 12:50:26 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 11 Apr 2017 07:50:26 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: <1491906065.4102.87.camel@buzzard.me.uk> References: <1491906065.4102.87.camel@buzzard.me.uk> Message-ID: Yeah, that can be true. I was just trying to show the size/shape that can achieve this. There's a good chance 10G or 40G ethernet would yield similar results, especially if you're running the policy on the NSD servers. On Tue, Apr 11, 2017 at 6:21 AM, Jonathan Buzzard wrote: > On Tue, 2017-04-11 at 00:49 -0400, Zachary Giles wrote: > > [SNIP] > >> * Then throw ~8 well tuned Infiniband attached nodes at it using -N, >> If they're the same as the NSD servers serving the flash, even better. >> > > Exactly how much are you going to gain from Infiniband over 40Gbps or > even 100Gbps Ethernet? Not a lot I would have thought. Even with flash > all your latency is going to be in the flash not the Ethernet. > > Unless you have a compute cluster and need Infiniband for the MPI > traffic, it is surely better to stick to Ethernet. Infiniband is rather > esoteric, what I call a minority sport best avoided if at all possible. > > Even if you have an Infiniband fabric, I would argue that give current > core counts and price points for 10Gbps Ethernet, that actually you are > better off keeping your storage traffic on the Ethernet, and reserving > the Infiniband for MPI duties. That is 10Gbps Ethernet to the compute > nodes and 40/100Gbps Ethernet on the storage nodes. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com From stockf at us.ibm.com Tue Apr 11 12:53:33 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 11 Apr 2017 07:53:33 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: As Zachary noted the location of your metadata is the key and for the scanning you have planned flash is necessary. If you have the resources you may consider setting up your flash in a mirrored RAID configuration (RAID1/RAID10) and have GPFS only keep one copy of metadata since the underlying storage is replicating it via the RAID. This should improve metadata write performance but likely has little impact on your scanning, assuming you are just reading through the metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Zachary Giles To: gpfsug main discussion list Date: 04/11/2017 12:49 AM Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Sent by: gpfsug-discuss-bounces at spectrumscale.org It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Apr 11 16:18:01 2017 From: chair at spectrumscale.org (Spectrum Scale UG Chair (Simon Thompson)) Date: Tue, 11 Apr 2017 16:18:01 +0100 Subject: [gpfsug-discuss] May Meeting Registration Message-ID: Hi all, Just a reminder that the next UK user group meeting is taking place on 9th/10th May. If you are planning on attending, please do register at: https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi stration-32113696932 (or try https://goo.gl/tRptru ) As last year, this is a 2 day event and we're planning a fun evening event on the Tuesday night at Manchester Museum of Science. Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, OCF and Seagate for helping make this happen! We also still have some customer talk slots to fill, so please let me know if you are interested in speaking. Thanks Simon From bbanister at jumptrading.com Tue Apr 11 16:29:25 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:29:25 +0000 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <1e86aa0c2e4344f19cb5eedf8f03efa9@jumptrading.com> A word of caution, be careful about where you run this kind of policy scan as the sort process can consume all memory on your hosts and that could lead to issues with the OS deciding to kill off GPFS or other similar bad things can occur. I recommend restricting the ILM policy scan to a subset of servers, no quorum nodes, and ensuring at least one NSD server is available for all NSDs in the file system(s). Watch the memory consumption on your nodes during the sort operations to see if you need to tune that down in the mmapplypolicy options. Hope that helps, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: Tuesday, April 11, 2017 6:54 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM As Zachary noted the location of your metadata is the key and for the scanning you have planned flash is necessary. If you have the resources you may consider setting up your flash in a mirrored RAID configuration (RAID1/RAID10) and have GPFS only keep one copy of metadata since the underlying storage is replicating it via the RAID. This should improve metadata write performance but likely has little impact on your scanning, assuming you are just reading through the metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Zachary Giles > To: gpfsug main discussion list > Date: 04/11/2017 12:49 AM Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi > wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.leach at ed.ac.uk Tue Apr 11 16:32:41 2017 From: k.leach at ed.ac.uk (Kieran Leach) Date: Tue, 11 Apr 2017 16:32:41 +0100 Subject: [gpfsug-discuss] May Meeting Registration In-Reply-To: References: Message-ID: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> Hi Simon, would you be interested in a customer talk about the RDF (http://rdf.ac.uk/). We manage the RDF at EPCC, providing a 23PB filestore to complement ARCHER (the national research HPC service) and other UK Research HPC services. This is of course a GPFS system. If you've any questions or want more info please let me know but I thought I'd get an email off to you while I remember. Cheers Kieran On 11/04/17 16:18, Spectrum Scale UG Chair (Simon Thompson) wrote: > Hi all, > > Just a reminder that the next UK user group meeting is taking place on > 9th/10th May. If you are planning on attending, please do register at: > > https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi > stration-32113696932 > > > (or try https://goo.gl/tRptru ) > > As last year, this is a 2 day event and we're planning a fun evening event > on the Tuesday night at Manchester Museum of Science. > > Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, > OCF and Seagate for helping make this happen! > > We also still have some customer talk slots to fill, so please let me know > if you are interested in speaking. > > Thanks > > Simon > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From k.leach at ed.ac.uk Tue Apr 11 16:33:29 2017 From: k.leach at ed.ac.uk (Kieran Leach) Date: Tue, 11 Apr 2017 16:33:29 +0100 Subject: [gpfsug-discuss] May Meeting Registration In-Reply-To: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> References: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> Message-ID: Apologies all, wrong reply button. Cheers Kieran On 11/04/17 16:32, Kieran Leach wrote: > Hi Simon, > would you be interested in a customer talk about the RDF > (http://rdf.ac.uk/). We manage the RDF at EPCC, providing a 23PB > filestore to complement ARCHER (the national research HPC service) and > other UK Research HPC services. This is of course a GPFS system. If > you've any questions or want more info please let me know but I > thought I'd get an email off to you while I remember. > > Cheers > > Kieran > > On 11/04/17 16:18, Spectrum Scale UG Chair (Simon Thompson) wrote: >> Hi all, >> >> Just a reminder that the next UK user group meeting is taking place on >> 9th/10th May. If you are planning on attending, please do register at: >> >> https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi >> >> stration-32113696932 >> >> >> (or try https://goo.gl/tRptru ) >> >> As last year, this is a 2 day event and we're planning a fun evening >> event >> on the Tuesday night at Manchester Museum of Science. >> >> Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, >> OCF and Seagate for helping make this happen! >> >> We also still have some customer talk slots to fill, so please let me >> know >> if you are interested in speaking. >> >> Thanks >> >> Simon >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From makaplan at us.ibm.com Tue Apr 11 16:36:47 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Apr 2017 11:36:47 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: As primary developer of mmapplypolicy, please allow me to comment: 1) Fast access to metadata in system pool is most important, as several have commented on. These days SSD is the favorite, but you can still go with "spinning" media. If you do go with disks, it's extremely important to spread your metadata over independent disk "arms" -- so you can have many concurrent seeks in progress at the same time. IOW, if there is a virtualization/mapping layer, watchout that your logical disks don't get mapped to the same physical disk. 2) Crucial to use both -g and -N :: -g /gpfs-not-necessarily-the-same-fs-as-Im-scanning/tempdir and -N several-nodes-that-will-be-accessing-the-system-pool 3a) If at all possible, encourage your data and application designers to "pack" their directories with lots of files. Keep in mind that, mmapplypolicy will read every directory. The more directories, the more seeks, more time spent waiting for IO. OTOH, in more typical Unix/Linux usage, we tend to low average number of files per directory. 3b) As admin, you may not be able to change your data design to pack hundreds of files per directory, BUT you can make sure you are running a sufficiently modern release of Spectrum Scale that supports "data in inode" -- "Data in inode" also means "directory entries in inode" -- which means practically any small directory, up to a few hundred files, will fit in an an inode -- which means mmapplypolicy can read small directories with one seek, instead of two. (Someone will please remind us of the release number that first supported "directories in inode".) 4) Sorry, Fred, but the recommendation to use RAID mirroring of metadata on SSD, is not necessarily, important for metadata scanning. In fact it may work against you. If you use GPFS replication of metadata - that can work for you -- since then GPFS can direct read operations to either copy, preferring a locally attached copy, depending on how storage is attached to node, etc, etc. Choice of how to replicate metadata - either using GPFS replication or the RAID controller - is probably best made based on reliability and recoverability requirements. 5) YMMV - We'd love to hear/see your performance results for mmapplypolicy, especially if they're good. Even if they're bad, come back here for more tuning tips! -- marc of Spectrum Scale (ne GPFS) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Apr 11 16:51:56 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:51:56 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From S.J.Thompson at bham.ac.uk Tue Apr 11 16:55:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 11 Apr 2017 15:55:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jonathon.anderson at colorado.edu Tue Apr 11 16:56:56 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Tue, 11 Apr 2017 15:56:56 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: Bryan, That looks like a really useful set of presentation slides! Thanks for sharing! Which one in particular is the one Yuri gave that you?re referring to? ~jonathon On 4/11/17, 9:51 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bbanister at jumptrading.com Tue Apr 11 16:59:51 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:59:51 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Problem Determination and GPFS Internals. My security group won't let me go to the google docs site from my work compute... I'm sure there is malicious malware on that site!! j/k, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathon A Anderson Sent: Tuesday, April 11, 2017 10:57 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories Bryan, That looks like a really useful set of presentation slides! Thanks for sharing! Which one in particular is the one Yuri gave that you?re referring to? ~jonathon On 4/11/17, 9:51 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From p.childs at qmul.ac.uk Tue Apr 11 20:35:40 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Tue, 11 Apr 2017 19:35:40 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Can you remember what version you were running? Don't worry if you can't remember. It looks like ibm may have withdrawn 4.2.1 from fix central and wish to forget its existences. Never a good sign, 4.2.0, 4.2.2, 4.2.3 and even 3.5, so maybe upgrading is worth a try. I've looked at all the standard trouble shouting guides and got nowhere hence why I asked. But another set of slides always helps. Thank-you for the help, still head scratching.... Which only makes the issue more random. Peter Childs Research Storage ITS Research and Teaching Support Queen Mary, University of London ---- Simon Thompson (IT Research Support) wrote ---- We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mitsugi at linux.vnet.ibm.com Wed Apr 12 02:51:03 2017 From: mitsugi at linux.vnet.ibm.com (Masanori Mitsugi) Date: Wed, 12 Apr 2017 10:51:03 +0900 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <0851d194-088e-d93a-303d-ceb0de3dbaa8@linux.vnet.ibm.com> Marc, Zachary, Fred, Bryan, Thank you for providing great advice! It's pretty useful for me to tune our policy with best performance. As for "directories in inode", we plan to use latest version, so I believe we can leverage this function. -- Masanori Mitsugi mitsugi at linux.vnet.ibm.com From vpuvvada at in.ibm.com Wed Apr 12 10:53:25 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 12 Apr 2017 15:23:25 +0530 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> References: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Message-ID: Gateway node requires server license. ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: Date: 04/11/2017 01:46 AM Subject: Re: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: gpfsug main discussion list Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Wed Apr 12 15:52:48 2017 From: mweil at wustl.edu (Matt Weil) Date: Wed, 12 Apr 2017 09:52:48 -0500 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Message-ID: yes it tells you that when you attempt to make the node a gateway and is does not have a server license designation. On 4/12/17 4:53 AM, Venkateswara R Puvvada wrote: Gateway node requires server license. ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: Date: 04/11/2017 01:46 AM Subject: Re: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Apr 12 22:01:45 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 12 Apr 2017 14:01:45 -0700 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> On 4/11/17 8:36 AM, Marc A Kaplan wrote: > > 5) YMMV - We'd love to hear/see your performance results for > mmapplypolicy, especially if they're good. Even if they're bad, come > back here for more tuning tips! I have a filesystem that currently has 267919775 (roughly quarter billion, 250 million) used inodes. The metadata is on SSD behind a DDN 12K. We do use 4K inodes, and files smaller than 4K fit into the inodes. Here is the command I use to apply a policy: mmapplypolicy gsfs0 -P policy.txt -N scg-gs0,scg-gs1,scg-gs2,scg-gs3,scg-gs4,scg-gs5,scg-gs6,scg-gs7 -g /srv/gsfs0/admin_stuff/ -I test -B 500 -A 61 -a 4 That takes approximately 10 minutes to do the whole scan. The "-B 500 -A 61 -a 4" numbers we determined just by trying different values with the same policy file and seeing the resulting scan duration. 10mins is short enough to do almost "interactive" type of file list policies and look at the results. E.g. list all files over 1TB in size. This was a couple of years ago, probably on a different GPFS version, but on same storage and NSD hardware, so now I just copy those parameters. You should probably not just copy them but try some other values yourself. Regards, Alex From makaplan at us.ibm.com Wed Apr 12 23:43:20 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 12 Apr 2017 18:43:20 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> References: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> Message-ID: >>>Here is the command I use to apply a policy: mmapplypolicy gsfs0 -P policy.txt -N scg-gs0,scg-gs1,scg-gs2,scg-gs3,scg-gs4,scg-gs5,scg-gs6,scg-gs7 -g /srv/gsfs0/admin_stuff/ -I test -B 500 -A 61 -a 4 That takes approximately 10 minutes to do the whole scan. The "-B 500 -A 61 -a 4" numbers we determined just by trying different values with the same policy file and seeing the resulting scan duration. <<< That's pretty good. BUT, FYI, the -A number-of-buckets parameter should be scaled with the total number of files you expect to find in the argument filesystem or directory. If you don't set it the command will default to number-of-inodes-allocated / million, but capped at a minimum of 7 and a maximum of 4096. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Apr 13 11:35:19 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 13 Apr 2017 10:35:19 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: , Message-ID: After a load more debugging, and switching off the quota's the issue looks to be quota related. in that the issue has gone away since I switched quota's off. I will need to switch them back on, but at least we know the issue is not the network and is likely to be fixed by upgrading..... Peter Childs ITS Research Infrastructure Queen Mary, University of London ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Peter Childs Sent: Tuesday, April 11, 2017 8:35:40 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories Can you remember what version you were running? Don't worry if you can't remember. It looks like ibm may have withdrawn 4.2.1 from fix central and wish to forget its existences. Never a good sign, 4.2.0, 4.2.2, 4.2.3 and even 3.5, so maybe upgrading is worth a try. I've looked at all the standard trouble shouting guides and got nowhere hence why I asked. But another set of slides always helps. Thank-you for the help, still head scratching.... Which only makes the issue more random. Peter Childs Research Storage ITS Research and Teaching Support Queen Mary, University of London ---- Simon Thompson (IT Research Support) wrote ---- We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Fri Apr 14 08:34:06 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 14 Apr 2017 15:34:06 +0800 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? In-Reply-To: References: Message-ID: If you can use " mmchconfig usePersistentReserve=yes" successfully, then it is supported, we will check the compatibility during the command, and you can also use "tsprinquiry device(no /dev prefix)" check the vendor output. Thanks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Christoph Krafft" To: "gpfsug main discussion list" Cc: Achim Christ , Petra Christ Date: 04/11/2017 04:25 PM Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A696179.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A223532.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From scale at us.ibm.com Fri Apr 14 08:34:06 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 14 Apr 2017 15:34:06 +0800 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? In-Reply-To: References: Message-ID: If you can use " mmchconfig usePersistentReserve=yes" successfully, then it is supported, we will check the compatibility during the command, and you can also use "tsprinquiry device(no /dev prefix)" check the vendor output. Thanks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Christoph Krafft" To: "gpfsug main discussion list" Cc: Achim Christ , Petra Christ Date: 04/11/2017 04:25 PM Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A696179.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A223532.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sun Apr 16 14:47:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sun, 16 Apr 2017 13:47:20 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Message-ID: Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Sun Apr 16 17:20:15 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Sun, 16 Apr 2017 16:20:15 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Message-ID: <252ABBB2-7E94-41F6-AD76-B6D836E5C916@nuance.com> I think the first thing I would do is turn up the ?-L? level to a large value (like ?6?) and see what it tells you about files that are being chosen and which ones aren?t being migrated and why. You could run it in test mode, write the output to a file and see what it says. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of "Buterbaugh, Kevin L" Reply-To: gpfsug main discussion list Date: Sunday, April 16, 2017 at 8:47 AM To: gpfsug main discussion list Subject: [EXTERNAL] [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sun Apr 16 20:15:40 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sun, 16 Apr 2017 15:15:40 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From makaplan at us.ibm.com Sun Apr 16 20:39:21 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sun, 16 Apr 2017 15:39:21 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Correction: So that's why it chooses to migrate "only" 67TB.... (67000 GB) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Apr 17 16:24:02 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 17 Apr 2017 15:24:02 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Mon Apr 17 19:49:12 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 17 Apr 2017 11:49:12 -0700 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: <09e154ef-15ed-3217-db65-51e693e28faa@stanford.edu> Hi Kevin, IMHO, safe to just run it again. You can also run it with '-I test -L 6' again and look through the output. But I don't think you can "break" anything by having it scan and/or move data. Can you post the full command line that you use to run it? The behavior you describe is odd; you say it prints out the "files migrated successfully" message, but the files didn't actually get migrated? Turn up the debug param and have it print every file as it is moving it or something. Regards, Alex On 4/17/17 8:24 AM, Buterbaugh, Kevin L wrote: > Hi Marc, > > I do understand what you?re saying about mmapplypolicy deciding it only > needed to move ~1.8 million files to fill the capacity pool to ~98% > full. However, it is now more than 24 hours since the mmapplypolicy > finished ?successfully? and: > > Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) > eon35Ansd 58.2T 35 No Yes 29.66T ( > 51%) 64.16G ( 0%) > eon35Dnsd 58.2T 35 No Yes 29.66T ( > 51%) 64.61G ( 0%) > ------------- > -------------------- ------------------- > (pool total) 116.4T 59.33T ( > 51%) 128.8G ( 0%) > > And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the > partially redacted command line: > > /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g another gpfs filesystem> -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy > -N some,list,of,NSD,server,nodes > > And here?s that policy file: > > define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) > define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) > > RULE 'OldStuff' > MIGRATE FROM POOL 'gpfs23data' > TO POOL 'gpfs23capacity' > LIMIT(98) > WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) > > RULE 'INeedThatAfterAll' > MIGRATE FROM POOL 'gpfs23capacity' > TO POOL 'gpfs23data' > LIMIT(75) > WHERE (access_age < 14) > > The one thing that has changed is that formerly I only ran the migration > in one direction at a time ? i.e. I used to have those two rules in two > separate files and would run an mmapplypolicy using the OldStuff rule > the 1st weekend of the month and run the other rule the other weekends > of the month. This is the 1st weekend that I attempted to run an > mmapplypolicy that did both at the same time. Did I mess something up > with that? > > I have not run it again yet because we also run migrations on the other > filesystem that we are still in the process of migrating off of. So > gpfs23 goes 1st and as soon as it?s done the other filesystem migration > kicks off. I don?t like to run two migrations simultaneously if at all > possible. The 2nd migration ran until this morning, when it was > unfortunately terminated by a network switch crash that has also had me > tied up all morning until now. :-( > > And yes, there is something else going on ? well, was going on - the > network switch crash killed this too ? I have been running an rsync on > one particular ~80TB directory tree from the old filesystem to gpfs23. > I understand that the migration wouldn?t know about those files and > that?s fine ? I just don?t understand why mmapplypolicy said it was > going to fill the capacity pool to 98% but didn?t do it ? wait, > mmapplypolicy hasn?t gone into politics, has it?!? ;-) > > Thanks - and again, if I should open a PMR for this please let me know... > > Kevin > >> On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > > wrote: >> >> Let's look at how mmapplypolicy does the reckoning. >> Before it starts, it see your pools as: >> >> [I] GPFS Current Data Pool Utilization in KB and % >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 55365193728 124983549952 44.297984614% >> gpfs23data 166747037696 343753326592 48.507759721% >> system 0 0 >> 0.000000000% (no user data) >> [I] 75142046 of 209715200 inodes used: 35.830520%. >> >> Your rule says you want to migrate data to gpfs23capacity, up to 98% full: >> >> RULE 'OldStuff' >> MIGRATE FROM POOL 'gpfs23data' >> TO POOL 'gpfs23capacity' >> LIMIT(98) WHERE ... >> >> We scan your files and find and reckon... >> [I] Summary of Rule Applicability and File Choices: >> Rule# Hit_Cnt KB_Hit Chosen KB_Chosen >> KB_Ill Rule >> 0 5255960 237675081344 1868858 67355430720 >> 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO >> POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) >> >> So yes, 5.25Million files match the rule, but the utility chooses >> 1.868Million files that add up to 67,355GB and figures that if it >> migrates those to gpfs23capacity, >> (and also figuring the other migrations by your second rule)then >> gpfs23 will end up 97.9999% full. >> We show you that with our "predictions" message. >> >> Predicted Data Pool Utilization in KB and %: >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 122483878944 124983549952 97.999999993% >> gpfs23data 104742360032 343753326592 30.470209865% >> >> So that's why it chooses to migrate "only" 67GB.... >> >> See? Makes sense to me. >> >> Questions: >> Did you run with -I yes or -I defer ? >> >> Were some of the files illreplicated or illplaced? >> >> Did you give the cluster-wide space reckoning protocols time to see >> the changes? mmdf is usually "behind" by some non-neglible amount of >> time. >> >> What else is going on? >> If you're moving or deleting or creating data by other means while >> mmapplypolicy is running -- it doesn't "know" about that! >> >> Run it again! >> >> >> >> >> >> From: "Buterbaugh, Kevin L" > > >> To: gpfsug main discussion list >> > > >> Date: 04/16/2017 09:47 AM >> Subject: [gpfsug-discuss] mmapplypolicy didn't migrate >> everything it should have - why not? >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Hi All, >> >> First off, I can open a PMR for this if I need to. Second, I am far >> from an mmapplypolicy guru. With that out of the way ? I have an >> mmapplypolicy job that didn?t migrate anywhere close to what it could >> / should have. From the log file I have it create, here is the part >> where it shows the policies I told it to invoke: >> >> [I] Qos 'maintenance' configured as inf >> [I] GPFS Current Data Pool Utilization in KB and % >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 55365193728 124983549952 44.297984614% >> gpfs23data 166747037696 343753326592 48.507759721% >> system 0 0 >> 0.000000000% (no user data) >> [I] 75142046 of 209715200 inodes used: 35.830520%. >> [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. >> Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC >> Parsed 2 policy rules. >> >> RULE 'OldStuff' >> MIGRATE FROM POOL 'gpfs23data' >> TO POOL 'gpfs23capacity' >> LIMIT(98) >> WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND >> (KB_ALLOCATED > 3584)) >> >> RULE 'INeedThatAfterAll' >> MIGRATE FROM POOL 'gpfs23capacity' >> TO POOL 'gpfs23data' >> LIMIT(75) >> WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) >> >> And then the log shows it scanning all the directories and then says, >> "OK, here?s what I?m going to do": >> >> [I] Summary of Rule Applicability and File Choices: >> Rule# Hit_Cnt KB_Hit Chosen KB_Chosen >> KB_Ill Rule >> 0 5255960 237675081344 1868858 67355430720 >> 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO >> POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) >> 1 611 236745504 611 236745504 >> 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL >> 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) >> >> [I] Filesystem objects with no applicable rules: 414911602. >> >> [I] GPFS Policy Decisions and File Choice Totals: >> Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; >> Predicted Data Pool Utilization in KB and %: >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 122483878944 124983549952 97.999999993% >> gpfs23data 104742360032 343753326592 30.470209865% >> system 0 0 >> 0.000000000% (no user data) >> >> Notice that it says it?s only going to migrate less than 2 million of >> the 5.25 million candidate files!! And sure enough, that?s all it did: >> >> [I] A total of 1869469 files have been migrated, deleted or processed >> by an EXTERNAL EXEC/script; >> 0 'skipped' files and/or errors. >> >> And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere >> near 98% full: >> >> Disks in storage pool: gpfs23capacity (Maximum disk size allowed is >> 519 TB) >> eon35Ansd 58.2T 35 No Yes 29.54T ( >> 51%) 63.93G ( 0%) >> eon35Dnsd 58.2T 35 No Yes 29.54T ( >> 51%) 64.39G ( 0%) >> ------------- >> -------------------- ------------------- >> (pool total) 116.4T 59.08T ( >> 51%) 128.3G ( 0%) >> >> I don?t understand why it only migrated a small subset of what it >> could / should have? >> >> We are doing a migration from one filesystem (gpfs21) to gpfs23 and I >> really need to stuff my gpfs23capacity pool as full of data as I can >> to keep the migration going. Any ideas anyone? Thanks in advance? >> >> ? >> Kevin Buterbaugh - Senior System Administrator >> Vanderbilt University - Advanced Computing Center for Research and >> Education >> _Kevin.Buterbaugh at vanderbilt.edu_ >> - (615)875-9633 From makaplan at us.ibm.com Mon Apr 17 21:11:18 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 17 Apr 2017 16:11:18 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Mon Apr 17 21:18:42 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 17 Apr 2017 16:18:42 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Oops... If you want to see the list of what would be migrated '-I test -L 2' If you want to migrate and see each file migrated '-I yes -L 2' I don't recommend -L 4 or higher, unless you want to see the files that do not match your rules. -L 3 will show you all the files that match the rules, including those that are NOT chosen for migration. See the command gu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Apr 17 22:16:57 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 17 Apr 2017 21:16:57 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan > wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Apr 18 14:31:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 18 Apr 2017 13:31:20 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> Message-ID: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Hi All, but especially Marc, I ran the mmapplypolicy again last night and, unfortunately, it again did not fill the capacity pool like it said it would. From the log file: [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 3632859 181380873184 1620175 61434283936 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 88 99230048 88 99230048 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 442962867. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 61533513984KB: 1620263 of 3632947 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878464 124983549952 97.999999609% gpfs23data 128885076416 343753326592 37.493477574% system 0 0 0.000000000% (no user data) [I] 2017-04-18 at 02:52:48.402 Policy execution. 0 files dispatched. And the tail end of the log file says that it moved those files: [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. But mmdf (and how quickly the mmapplypolicy itself ran) say otherwise: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.73T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.73T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.45T ( 51%) 128.8G ( 0%) Ideas? Or is it time for me to open a PMR? Thanks? Kevin On Apr 17, 2017, at 4:16 PM, Buterbaugh, Kevin L > wrote: Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan > wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Tue Apr 18 14:56:43 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 18 Apr 2017 09:56:43 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: Kevin, Here's a silly theory: Have you tried putting a weight value in? I wonder if during migration it hits some large file that would go over the threshold and stops. With a weight flag you could move all small files in first or by lack of heat etc to pack the tier more tightly. Just something else to try before the PMR process. Zach On Apr 18, 2017 9:32 AM, "Buterbaugh, Kevin L" < Kevin.Buterbaugh at vanderbilt.edu> wrote: Hi All, but especially Marc, I ran the mmapplypolicy again last night and, unfortunately, it again did not fill the capacity pool like it said it would. From the log file: [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 3632859 181380873184 1620175 61434283936 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 88 99230048 88 99230048 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 442962867. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 61533513984KB: 1620263 of 3632947 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878464 124983549952 97.999999609% gpfs23data 128885076416 343753326592 37.493477574% system 0 0 0.000000000% (no user data) [I] 2017-04-18 at 02:52:48.402 Policy execution. 0 files dispatched. And the tail end of the log file says that it moved those files: [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. But mmdf (and how quickly the mmapplypolicy itself ran) say otherwise: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.73T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.73T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.45T ( 51%) 128.8G ( 0%) Ideas? Or is it time for me to open a PMR? Thanks? Kevin On Apr 17, 2017, at 4:16 PM, Buterbaugh, Kevin L < Kevin.Buterbaugh at Vanderbilt.Edu> wrote: Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ------------------------------ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan <*makaplan at us.ibm.com* > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" <*Kevin.Buterbaugh at Vanderbilt.Edu* > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: *gpfsug-discuss-bounces at spectrumscale.org* ------------------------------ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. >From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education *Kevin.Buterbaugh at vanderbilt.edu* - (615)875-9633 <(615)%20875-9633> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at *spectrumscale.org* *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at *spectrumscale.org* *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 16:11:19 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 11:11:19 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Tue Apr 18 16:31:12 2017 From: david_johnson at brown.edu (David D. Johnson) Date: Tue, 18 Apr 2017 11:31:12 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> I have an observation, which may merely serve to show my ignorance: Is it significant that the words "EXTERNAL EXEC/script? are seen below? If migrating between storage pools within the cluster, I would expect the PIT engine to do the migration. When doing HSM (off cluster, tape libraries, etc) is where I would expect to need a script to actually do the work. > [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. > [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; > 0 'skipped' files and/or errors. ? ddj Dave Johnson Brown University > On Apr 18, 2017, at 11:11 AM, Marc A Kaplan wrote: > > ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? > > ------ > > Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. > > So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? > > Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... > > While we're waiting for that... Here's what I suggest next. > > Add a clause ... > > SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) > > before the WHERE clause to each of your rules. > > Re-run the command with options '-I test -L 2' and collect the output. > > We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... > > You should see 1.6 million lines that look kind of like this: > > /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) > > Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed > add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). > > That sanity checks the policy arithmetic. Let's assume that's okay. > > Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as > find some of the biggest of those files and check that they really are that big.... > > At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... > and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... > > HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are > not recognized by mmapplypolicy as sharing storage... > This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? > > The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... > Optimistically that means it works fine for most customers... > > So sorry, something unusual about your installation or usage... > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 17:06:16 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 12:06:16 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> Message-ID: That is a summary message. It says one way or another, the command has dealt with 1.6 million files. For the case under discussion there are no EXTERNAL pools, nor any DELETions, just intra-GPFS MIGRATions. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Apr 18 17:32:24 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 18 Apr 2017 16:32:24 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> Hi Marc, Two things: 1. I have a PMR open now. 2. You *may* have identified the problem ? I?m still checking ? but files with hard links may be our problem. I wrote a simple Perl script to interate over the log file I had mmapplypolicy create. Here?s the code (don?t laugh, I?m a SysAdmin, not a programmer, and I whipped this out in < 5 minutes ? and yes, I realize the fact that I used Perl instead of Python shows my age as well ): #!/usr/bin/perl # use strict; use warnings; my $InputFile = "/tmp/mmapplypolicy.gpfs23.log"; my $TotalFiles = 0; my $TotalLinks = 0; my $TotalSize = 0; open INPUT, $InputFile or die "Couldn\'t open $InputFile for read: $!\n"; while () { next unless /MIGRATED/; $TotalFiles++; my $FileName = (split / /)[3]; if ( -f $FileName ) { # some files may have been deleted since mmapplypolicy ran my ($NumLinks, $FileSize) = (stat($FileName))[3,7]; $TotalLinks += $NumLinks; $TotalSize += $FileSize; } } close INPUT; print "Number of files / links = $TotalFiles / $TotalLinks, Total size = $TotalSize\n"; exit 0; And here?s what it kicked out: Number of files / links = 1620263 / 80818483, Total size = 53966202814094 1.6 million files but 80 million hard links!!! I?m doing some checking right now, but it appears that it is one particular group - and therefore one particular fileset - that is responsible for this ? they?ve got thousands of files with 50 or more hard links each ? and they?re not inconsequential in size. IIRC (and keep in mind I?m far from a GPFS policy guru), there is a way to say something to the effect of ?and the path does not contain /gpfs23/fileset/path? ? may need a little help getting that right. I?ll post this information to the ticket as well but wanted to update the list. This wouldn?t be the first time we were an ?edge case? for something in GPFS? ;-) Thanks... Kevin On Apr 18, 2017, at 10:11 AM, Marc A Kaplan > wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 17:56:11 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 12:56:11 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? hard links! A workaround In-Reply-To: <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> Message-ID: Kevin, Wow. Never underestimate the power of ... Anyhow try this as a fix. Add the clause SIZE(KB_ALLOCATED/NLINK) to your MIGRATE rules. This spreads the total actual size over each hardlink... From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/18/2017 12:33 PM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Two things: 1. I have a PMR open now. 2. You *may* have identified the problem ? I?m still checking ? but files with hard links may be our problem. I wrote a simple Perl script to interate over the log file I had mmapplypolicy create. Here?s the code (don?t laugh, I?m a SysAdmin, not a programmer, and I whipped this out in < 5 minutes ? and yes, I realize the fact that I used Perl instead of Python shows my age as well ): #!/usr/bin/perl # use strict; use warnings; my $InputFile = "/tmp/mmapplypolicy.gpfs23.log"; my $TotalFiles = 0; my $TotalLinks = 0; my $TotalSize = 0; open INPUT, $InputFile or die "Couldn\'t open $InputFile for read: $!\n"; while () { next unless /MIGRATED/; $TotalFiles++; my $FileName = (split / /)[3]; if ( -f $FileName ) { # some files may have been deleted since mmapplypolicy ran my ($NumLinks, $FileSize) = (stat($FileName))[3,7]; $TotalLinks += $NumLinks; $TotalSize += $FileSize; } } close INPUT; print "Number of files / links = $TotalFiles / $TotalLinks, Total size = $TotalSize\n"; exit 0; And here?s what it kicked out: Number of files / links = 1620263 / 80818483, Total size = 53966202814094 1.6 million files but 80 million hard links!!! I?m doing some checking right now, but it appears that it is one particular group - and therefore one particular fileset - that is responsible for this ? they?ve got thousands of files with 50 or more hard links each ? and they?re not inconsequential in size. IIRC (and keep in mind I?m far from a GPFS policy guru), there is a way to say something to the effect of ?and the path does not contain /gpfs23/fileset/path? ? may need a little help getting that right. I?ll post this information to the ticket as well but wanted to update the list. This wouldn?t be the first time we were an ?edge case? for something in GPFS? ;-) Thanks... Kevin On Apr 18, 2017, at 10:11 AM, Marc A Kaplan wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 14:12:16 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 13:12:16 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> Message-ID: <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Hi All, I think we *may* be able to wrap this saga up? ;-) Dave - in regards to your question, all I know is that the tail end of the log file is ?normal? for all the successful pool migrations I?ve done in the past few years. It looks like the hard links were the problem. We have one group with a fileset on our filesystem that they use for backing up Linux boxes in their lab. That one fileset has thousands and thousands (I haven?t counted, but based on the output of that Perl script I wrote it could well be millions) of files with anywhere from 50 to 128 hard links each ? those files ranged from a few KB to a few MB in size. From what Marc said, my understanding is that with the way I had my policy rule written mmapplypolicy was seeing each of those as separate files and therefore thinking it was moving 50 to 128 times as much space to the gpfs23capacity pool as it really was for those files. Marc can correct me or clarify further if necessary. He directed me to add: SIZE(KB_ALLOCATED/NLINK) to both of my migrate rules in my policy file. I did so and kicked off another mmapplypolicy last night, which is still running. However, the prediction section now says: [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 40050141920KB: 2051495 of 2051495 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 104098980256 124983549952 83.290145220% gpfs23data 168478368352 343753326592 49.011414674% system 0 0 0.000000000% (no user data) So now it?s going to move every file it can that matches my policies because it?s figured out that a lot of those are hard links ? and I don?t have enough files matching the criteria to fill the gpfs23capacity pool to the 98% limit like mmapplypolicy thought I did before. According to the log file, it?s happily chugging along migrating files, and mmdf agrees that my gpfs23capacity pool is gradually getting more full (I have it QOSed, of course): Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 25.33T ( 44%) 68.13G ( 0%) eon35Dnsd 58.2T 35 No Yes 25.33T ( 44%) 68.49G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 50.66T ( 44%) 136.6G ( 0%) My sincere thanks to all who took the time to respond to my questions. Of course, that goes double for Marc. We (Vanderbilt) seem to have a long tradition of finding some edge cases in GPFS going all the way back to when we originally moved off of an NFS server to GPFS (2.2, 2.3?) back in 2005. I was creating individual tarballs of each users? home directory on the NFS server, copying the tarball to one of the NSD servers, and untarring it there (don?t remember why we weren?t rsync?ing, but there was a reason). Everything was working just fine except for one user. Every time I tried to untar her home directory on GPFS it barfed part of the way thru ? turns out that until then IBM hadn?t considered that someone would want to put 6 million files in one directory. Gotta love those users! ;-) Kevin On Apr 18, 2017, at 10:31 AM, David D. Johnson > wrote: I have an observation, which may merely serve to show my ignorance: Is it significant that the words "EXTERNAL EXEC/script? are seen below? If migrating between storage pools within the cluster, I would expect the PIT engine to do the migration. When doing HSM (off cluster, tape libraries, etc) is where I would expect to need a script to actually do the work. [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. ? ddj Dave Johnson Brown University On Apr 18, 2017, at 11:11 AM, Marc A Kaplan > wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 15:37:29 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 10:37:29 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Apr 19 17:18:50 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 19 Apr 2017 16:18:50 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Hey Marc, I'm having some issues where a simple ILM list policy never completes, but I have yet to open a PMR or enable additional logging. But I was wondering if there are known reasons that this would not complete, such as when there is a symbolic link that creates a loop within the directory structure or something simple like that. Do you know of any cases like this, Marc, that I should try to find in my file systems? Thanks in advance! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, April 19, 2017 9:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Wed Apr 19 17:23:12 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Wed, 19 Apr 2017 19:23:12 +0300 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu><458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Hi Maybe the temp list file - fill the FS that they build on. Try to monitor the FS where the temp filelist is created. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Bryan Banister To: gpfsug main discussion list Date: 04/19/2017 07:19 PM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hey Marc, I?m having some issues where a simple ILM list policy never completes, but I have yet to open a PMR or enable additional logging. But I was wondering if there are known reasons that this would not complete, such as when there is a symbolic link that creates a loop within the directory structure or something simple like that. Do you know of any cases like this, Marc, that I should try to find in my file systems? Thanks in advance! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, April 19, 2017 9:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From makaplan at us.ibm.com Wed Apr 19 18:10:28 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 13:10:28 -0400 Subject: [gpfsug-discuss] mmapplypolicy not terminating properly? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu><458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: (Bryan B asked...) Open a PMR. The first response from me will be ... Run the mmapplypolicy command again, except with additional option `-d 017` and collect output with something equivalent to `2>&1 | tee /tmp/save-all-command-output-here-to-be-passed-along-to-IBM-service ` If you are convinced that mmapplypolicy is "looping" or "hung" - wait another 2 minutes, terminate, and then pass along the saved-all-command-output. -d 017 will dump a lot of additional diagnostics -- If you want to narrow it by baby steps we could try `-d 03` first and see if there are enough clues in that. To answer two of your questions: 1. mmapplypolicy does not follow symlinks, so no "infinite loop" possible with symlinks. 2a. loops in directory are file system bugs in GPFS, (in fact in any posixish file system), (mm)fsck! 2b. mmapplypolicy does impose a limit on total length of pathnames, so even if there is a loop in the directory, mmapplypolicy will "trim" the directory walk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 20:53:42 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 19:53:42 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data Message-ID: Hi All, We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Wed Apr 19 20:59:18 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 19 Apr 2017 19:59:18 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: Message-ID: Hi I'll give my opinion. Worth what you pay for. Do as many as you can, six in this case for the good reason you mentioned. But play with the callbacks so the migration happens on watermarks when it happens. Otherwise you might hit no space till your next policy run. The second is well documented on the redbook AFAIK Cheers -- Cheers > On 19 Apr 2017, at 22.54, Buterbaugh, Kevin L wrote: > > Hi All, > > We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. > > Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. > > So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. > > Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 19 21:05:49 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 19 Apr 2017 20:05:49 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] Sent: 19 April 2017 20:53 To: gpfsug main discussion list Subject: [gpfsug-discuss] RAID config for SSD's used for data Hi All, We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 From aaron.s.knister at nasa.gov Wed Apr 19 21:13:14 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 19 Apr 2017 16:13:14 -0400 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: On 4/19/17 4:05 PM, Simon Thompson (IT Research Support) wrote: > By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... > > And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) You mean like HAWC but for writes larger than 64K? ;-) Or I guess "HARC" as it might be called for a read cache... -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From luis.bolinches at fi.ibm.com Wed Apr 19 21:20:20 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 19 Apr 2017 20:20:20 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: Message-ID: I assume you are making the joke of external LROC. But not sure I would use external storage for LROC, as the whole point is to have really fast storage as close to the node (L for local) as possible. Maybe those SSD that will get replaced with the fancy external storage? -- Cheers > On 19 Apr 2017, at 23.13, Aaron Knister wrote: > > > >> On 4/19/17 4:05 PM, Simon Thompson (IT Research Support) wrote: >> By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... >> >> And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) > > You mean like HAWC but for writes larger than 64K? ;-) > > Or I guess "HARC" as it might be called for a read cache... > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 21:49:56 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 16:49:56 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 22:12:35 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 21:12:35 +0000 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Hi Marc, But the limitation on GPFS replication is that I can set replication separately for metadata and data, but no matter whether I have one data pool or ten data pools they all must have the same replication, correct? And believe me I *love* GPFS replication ? I would hope / imagine that I am one of the few people on this mailing list who has actually gotten to experience a ?fire scenario? ? electrical fire, chemical suppressant did it?s thing, and everything in the data center had a nice layer of soot, ash, and chemical suppressant on and in it and therefore had to be professionally cleaned. Insurance bought us enough disk space that we could (temporarily) turn on GPFS data replication and clean storage arrays one at a time! But in my current hypothetical scenario I?m stretching the budget just to get that one storage array with 12 x 1.8 TB SSD?s in it. Two are out of the question. My current metadata that I?ve got on SSDs is on RAID 1 mirrors and has GPFS replication set to 2. I thought the multiple RAID 1 mirrors approach was the way to go for SSDs for data as well, as opposed to one big RAID 6 LUN, but wanted to get the advice of those more knowledgeable than me. Thanks! Kevin On Apr 19, 2017, at 3:49 PM, Marc A Kaplan > wrote: As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: * Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. * GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Apr 19 22:23:15 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 19 Apr 2017 14:23:15 -0700 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> On 04/19/2017 12:53 PM, Buterbaugh, Kevin L wrote: > > So you?re considering the purchase of a dual-controller FC storage array > with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage > would be in its? own storage pool and that pool would be the default > location for I/O for your main filesystem ? at least for smaller files. > You intend to use mmapplypolicy nightly to move data to / from this > pool and the spinning disk pools. We did this and failed in interesting (but in retrospect obvious) ways. You will want to ensure that your users cannot fill your write target pool within a day. The faster the storage, the more likely that is to happen. Or else your users will get ENOSPC. You will want to ensure that your pools can handle the additional I/O from the migration in aggregate with all the user I/O. Or else your users will see worse performance from the fast pool than the slow pool while the migration is running. You will want to make sure that the write throughput of your slow pool is faster than the read throughput of your fast pool. In our case, the fast pool was undersized in capacity, and oversized in terms of performance. And overall the filesystem was oversubscribed (~100 10GbE clients, 8 x 10GbE NSD servers) So the fast pool would fill very quickly. Then I would switch the placement policy to the big slow pool and performance would drop dramatically, and then if I ran a migration it would either (depending on parameters) take up all the I/O to the slow pool (leaving none for the users), or else take forever (weeks) because the user I/O was maxing out the slow pool. Things should better today with QoS stuff, but your relative pool capacities (in our case it was like 1% fast, 99% slow) and your relative pool performance (in our case, slow pool had fewer IOPS than fast pool) are still going to matter a lot. -- Alex Chekholko chekh at stanford.edu From makaplan at us.ibm.com Wed Apr 19 22:58:24 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 17:58:24 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Kevin asked: " ... data pools they all must have the same replication, correct?" Actually no! You can use policy RULE ... SET POOL 'x' REPLICATE(2) to set the replication factor when a file is created. Use mmchattr or mmapplypolicy to change the replication factor after creation. You specify the maximum data replication factor when you create the file system (1,2,3), but any given file can have replication factor set to 1 or 2 or 3. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From kums at us.ibm.com Wed Apr 19 23:03:33 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Wed, 19 Apr 2017 18:03:33 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Hi, >> As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: >>Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. >>This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. As you pointed out, the RAID choices for GPFS may not be simple and we need to take into consideration factors such as storage subsystem configuration/capabilities such as if all drives are homogenous or there is mix of drives. If all the drives are homogeneous, then create dataAndMetadata NSDs across RAID-6 and if the storage controller supports write-cache + write-cache mirroring (WC + WM) then enable this (WC +WM) can alleviate read-modify-write for small writes (typical in metadata). If there is MIX of SSD and HDD (e.g. 15K RPM), then we need to take into consideration the aggregate IOPS of RAID-1 SSD volumes vs. RAID-6 HDDs before separating data and metadata into separate media. For example, if the storage subsystem has 2 x SSDs and ~300 x 15K RPM or NL_SAS HDDs then most likely aggregate IOPS of RAID-6 HDD volumes will be higher than RAID-1 SSD volumes. It would be recommended to also assess the I/O performance on different configuration (dataAndMetadata vs dataOnly/metadataOnly NSDs) with some application workload + production scenarios before deploying the final solution. >> GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more >>robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. For high-resiliency (for e.g. metadataOnly) and if there are multiple storage across different failure domains (different racks/rooms/DC etc), it will be good to enable BOTH hardware RAID-1 as well as GPFS metadata replication enabled (at the minimum, -m 2). If there is single shared storage for GPFS file-system storage and metadata is separated from data, then RAID-1 would minimize administrative overhead compared to GPFS replication in the event of drive failure (since with GPFS replication across single SSD would require mmdeldisk/mmdelnsd/mmcrnsd/mmadddisk every time disk goes faulty and needs to be replaced). Best, -Kums From: Marc A Kaplan/Watson/IBM at IBMUS To: gpfsug main discussion list Date: 04/19/2017 04:50 PM Subject: Re: [gpfsug-discuss] RAID config for SSD's - potential pitfalls Sent by: gpfsug-discuss-bounces at spectrumscale.org As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 23:41:19 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 18:41:19 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Kums is our performance guru, so weigh that appropriately and relative to my own remarks... Nevertheless, I still think RAID-5or6 is a poor choice for GPFS metadata. The write cache will NOT mitigate the read-modify-write problem of a workload that has a random or hop-scotch access pattern of small writes. In the end you've still got to read and write several times more disk blocks than you actually set out to modify. Same goes for any large amount of data that will be written in a pattern of non-sequential small writes. (Define a small write as less than a full RAID stripe). For sure, non-volatile write caches are a good thing - but not a be all end all solution. Relying on RAID-1 to protect your metadata may well be easier to administer, but still GPFS replication can be more robust. Doing both - belt and suspenders is fine -- if you can afford it. Either is buying 2x storage, both is 4x. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Apr 20 00:16:08 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 19 Apr 2017 23:16:08 +0000 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Message-ID: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Thu Apr 20 01:10:51 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 19 Apr 2017 20:10:51 -0400 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values In-Reply-To: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> References: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> Message-ID: Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Apr 20 01:21:04 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 20 Apr 2017 00:21:04 +0000 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Message-ID: Hi Eric Looks like your assumption is correct - no negative values from ?mmfsadm eventsExporter mmpmon vfss?. I don?t normally view these via ?mmfsadm?, I use the zimon stats. But, It?s a bug that should be fixed. What?s the best way to get this fixed? root at cnt-r01r07u15 ~]# mmfsadm eventsExporter mmpmon vfss _response_ begin mmpmon vfss _mmpmon::vfss_ _n_ 10.30.100.193 _nn_ cnt-r01r07u15 _rc_ 0 _t_ 1492647309 _tu_ 311964 _access_ 8472897 56529.874886 _close_ 1460223848 49854.938090 _create_ 2101927 155055.515041 _fclear_ 0 0.000000 _fsync_ 20 0.024288 _fsync_range_ 0 0.000000 _ftrunc_ 0 0.000000 _getattr_ 859626332 101183.720281 _link_ 2175473 625.343799 _lockctl_ 17326 5.229828 _lookup_ 200378610 1201985.264220 _map_lloff_ 220854519 8561.860515 _mkdir_ 817943 217390.170859 _mknod_ 3 0.001422 _open_ 1460217712 134812.649162 _read_ 3883163461 3971457.463527 _write_ 186078410 137927.496812 _mmapRead_ 17108947 10665.929860 _mmapWrite_ 0 0.000000 _aioRead_ 0 0.000000 _aioWrite_ 0 0.000000 _readdir_ 142262897 6999.189450 _readlink_ 485337171 2111.634286 _readpage_ 3646233600 14346.331414 _remove_ 4241324 93277.463798 _rename_ 350679 19334.235924 _rmdir_ 342042 2736.048976 _setacl_ 0 0.000000 _setattr_ 3709289 16963.901179 _symlink_ 161336 8522.670079 _unmap_ 3929805828 1735.740690 _writepage_ 0 0.000000 _tsfattr_ 0 0.000000 _tsfsattr_ 0 0.000000 _flock_ 0 0.000000 _setxattr_ 119 0.001042 _getxattr_ 4077218348 628418.213008 _listxattr_ 0 0.000000 _removexattr_ 15 0.000042 _encode_fh_ 0 0.000000 _decode_fh_ 0 0.000000 _get_dentry_ 0 0.000000 _get_parent_ 0 0.000000 _mount_ 0 0.000000 _statfs_ 2625497 214.309671 _sync_ 0 0.000000 _vget_ 0 0.000000 _response_ end Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Wednesday, April 19, 2017 at 7:10 PM To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Thu Apr 20 02:03:16 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 19 Apr 2017 21:03:16 -0400 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values In-Reply-To: References: Message-ID: Thanks Bob. Yes, it looks good for the hypothesis. ZIMon gets its VFSS stats from the mmpmon code that we just exercised with "mmfsadm eventsExporter mmpmon vfss"; so the ZIMon stats are also probably correct. Having said that, I agree with you that the "mmfsadm vfsstats" problem is a bug that should be fixed. If you would like to open a PMR so an APAR gets generated, it might help speed the routing of the PMR if you include in the PMR text our email exchange, and highlight Eric Agar is the GPFS developer with whom you've already discussed this issue. You could also mention that I believe I have no need for a gpfs snap. Having an APAR will help ensure the fix makes it into a PTF for the release you are using. If you do not want to open a PMR, I still intend to fix the problem in the development stream. Thanks again. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Cc: IBM Spectrum Scale/Poughkeepsie/IBM at IBMUS Date: 04/19/2017 08:21 PM Subject: Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Hi Eric Looks like your assumption is correct - no negative values from ?mmfsadm eventsExporter mmpmon vfss?. I don?t normally view these via ?mmfsadm?, I use the zimon stats. But, It?s a bug that should be fixed. What?s the best way to get this fixed? root at cnt-r01r07u15 ~]# mmfsadm eventsExporter mmpmon vfss _response_ begin mmpmon vfss _mmpmon::vfss_ _n_ 10.30.100.193 _nn_ cnt-r01r07u15 _rc_ 0 _t_ 1492647309 _tu_ 311964 _access_ 8472897 56529.874886 _close_ 1460223848 49854.938090 _create_ 2101927 155055.515041 _fclear_ 0 0.000000 _fsync_ 20 0.024288 _fsync_range_ 0 0.000000 _ftrunc_ 0 0.000000 _getattr_ 859626332 101183.720281 _link_ 2175473 625.343799 _lockctl_ 17326 5.229828 _lookup_ 200378610 1201985.264220 _map_lloff_ 220854519 8561.860515 _mkdir_ 817943 217390.170859 _mknod_ 3 0.001422 _open_ 1460217712 134812.649162 _read_ 3883163461 3971457.463527 _write_ 186078410 137927.496812 _mmapRead_ 17108947 10665.929860 _mmapWrite_ 0 0.000000 _aioRead_ 0 0.000000 _aioWrite_ 0 0.000000 _readdir_ 142262897 6999.189450 _readlink_ 485337171 2111.634286 _readpage_ 3646233600 14346.331414 _remove_ 4241324 93277.463798 _rename_ 350679 19334.235924 _rmdir_ 342042 2736.048976 _setacl_ 0 0.000000 _setattr_ 3709289 16963.901179 _symlink_ 161336 8522.670079 _unmap_ 3929805828 1735.740690 _writepage_ 0 0.000000 _tsfattr_ 0 0.000000 _tsfsattr_ 0 0.000000 _flock_ 0 0.000000 _setxattr_ 119 0.001042 _getxattr_ 4077218348 628418.213008 _listxattr_ 0 0.000000 _removexattr_ 15 0.000042 _encode_fh_ 0 0.000000 _decode_fh_ 0 0.000000 _get_dentry_ 0 0.000000 _get_parent_ 0 0.000000 _mount_ 0 0.000000 _statfs_ 2625497 214.309671 _sync_ 0 0.000000 _vget_ 0 0.000000 _response_ end Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Wednesday, April 19, 2017 at 7:10 PM To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Apr 20 09:11:15 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 20 Apr 2017 10:11:15 +0200 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: Some thoughts: you give typical cumulative usage values. However, a fast pool might matter most for spikes of the traffic. Do you have spikes driving your current system to the edge? Then: using the SSD pool for writes is straightforward (placement), using it for reads will only pay off if data are either pre-fetched to the pool somehow, or read more than once before getting migrated back to the HDD pool(s). Write traffic is less than read as you wrote. RAID1 vs RAID6: RMW penalty of parity-based RAIDs was mentioned, which strikes at writes smaller than the full stripe width of your RAID - what type of write I/O do you have (or expect)? (This may also be important for choosing the quality of SSDs, with RMW in mind you will have a comparably huge amount of data written on the SSD devices if your I/O traffic consists of myriads of small IOs and you organized the SSDs in a RAID5 or RAID6) I suppose your current system is well set to provide the required aggregate throughput. Now, what kind of improvement do you expect? How are the clients connected? Would they have sufficient network bandwidth to see improvements at all? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 gpfsug-discuss-bounces at spectrumscale.org wrote on 04/19/2017 09:53:42 PM: > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/19/2017 09:54 PM > Subject: [gpfsug-discuss] RAID config for SSD's used for data > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > Hi All, > > We currently have what I believe is a fairly typical setup ? > metadata for our GPFS filesystems is the only thing in the system > pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). > Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB > usable space. > > Now lets just say that you have a little bit of money to spend. > Your I/O demands aren?t great - in fact, they?re way on the low end > ? typical (cumulative) usage is 200 - 600 MB/sec read, less than > that for writes. But while GPFS has always been great and therefore > you don?t need to Make GPFS Great Again, you do want to provide your > users with the best possible environment. > > So you?re considering the purchase of a dual-controller FC storage > array with 12 or so 1.8 TB SSD?s in it, with the idea being that > that storage would be in its? own storage pool and that pool would > be the default location for I/O for your main filesystem ? at least > for smaller files. You intend to use mmapplypolicy nightly to move > data to / from this pool and the spinning disk pools. > > Given all that ? would you configure those disks as 6 RAID 1 mirrors > and have 6 different primary NSD servers or would it be feasible to > configure one big RAID 6 LUN? I?m thinking the latter is not a good > idea as there could only be one primary NSD server for that one LUN, > but given that: 1) I have no experience with this, and 2) I have > been wrong once or twice before (), I?m looking for advice. Thanks! > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jonathan at buzzard.me.uk Thu Apr 20 10:25:40 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 20 Apr 2017 10:25:40 +0100 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> References: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> Message-ID: <1492680340.4102.120.camel@buzzard.me.uk> On Wed, 2017-04-19 at 14:23 -0700, Alex Chekholko wrote: > On 04/19/2017 12:53 PM, Buterbaugh, Kevin L wrote: > > > > So you?re considering the purchase of a dual-controller FC storage array > > with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage > > would be in its? own storage pool and that pool would be the default > > location for I/O for your main filesystem ? at least for smaller files. > > You intend to use mmapplypolicy nightly to move data to / from this > > pool and the spinning disk pools. > > We did this and failed in interesting (but in retrospect obvious) ways. > You will want to ensure that your users cannot fill your write target > pool within a day. The faster the storage, the more likely that is to > happen. Or else your users will get ENOSPC. Eh? Seriously you should have a fail over rule so that when your "fast" pool is filled up it starts allocating in the "slow" pool (nice good names that are descriptive and less than 8 characters including termination character). Now there are issues when you get close to very full so you need to set the fail over to as sizeable bit less than the full size, 95% is a good starting point. The pool names size is important because if the fast pool is less than eight characters and the slow is more because you called in "nearline" (which is 9 including termination character) once the files get moved they get backed up again by TSM, yeah!!! The 95% bit comes about from this. Imagine you had 12KB left in the fast pool and you go to write a file. You open the file with 0B in size and then start writing. At 12KB you run out of space in the fast pool and as the file can only be in one pool you get a ENOSPC, and the file gets canned. This then starts repeating on a regular basis. So if you start allocating at significantly less than 100%, say 95% where that 5% is larger than the largest file you expect that file works, but all subsequent files get allocated in the slow pool, till you flush the fast pool. Something like this as the last two rules in your policy should do the trick. /* by default new files to the fast disk unless full, then to slow */ RULE 'new' SET POOL 'fast' LIMIT(95) RULE 'spillover' SET POOL 'slow' However in general your fast pool needs to have sufficient capacity to take your daily churn and then some. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From jonathan at buzzard.me.uk Thu Apr 20 10:32:20 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 20 Apr 2017 10:32:20 +0100 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: <1492680740.4102.126.camel@buzzard.me.uk> On Wed, 2017-04-19 at 20:05 +0000, Simon Thompson (IT Research Support) wrote: > By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... > > And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) > If you have sized the "fast" pool correctly then the "slow" pool will be spending most of it's time doing diddly squat, aka under 10 IOPS per second unless you are flushing the pool of old files to make space. I have graphs that show this. Then two things happen, if you are just reading the file then fine, probably coming from the cache or the disks are not very busy anyway so you won't notice. If you happen to *change* the file and start doing things actively with it again, then because most programs approach this by creating an entirely new file with a temporary name, then doing a rename and delete shuffle so a crash will leave you with a valid file somewhere then the changed version ends up on the fast disk by virtue of being a new file. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From p.childs at qmul.ac.uk Thu Apr 20 12:38:09 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 20 Apr 2017 11:38:09 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Simon, We've managed to resolve this issue by switching off quota's and switching them back on again and rebuilding the quota file. Can I check if you run quota's on your cluster. See you 2 weeks in Manchester Thanks in advance. Peter Childs Research Storage Expert ITS Research Infrastructure Queen Mary, University of London Phone: 020 7882 8393 ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support) Sent: Tuesday, April 11, 2017 4:55:35 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From kenneth.waegeman at ugent.be Thu Apr 20 15:53:29 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Thu, 20 Apr 2017 16:53:29 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Message-ID: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for > writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on > the order of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that > reducing maxMBpS from 10000 to 100 fixed the problem! Digging further > I found that reducing prefetchThreads from default=72 to 32 also fixed > it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > >: > > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the > load up on one socket, you push all the interrupt handling to the > other socket where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org > > [gpfsug-discuss-bounces at spectrumscale.org > ] on behalf of > Aaron Knister [aaron.s.knister at nasa.gov > ] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going > out to > > the clients. I was having a really hard time getting anything > resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do > better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load > I saw > > an almost 4x performance jump which is pretty much goes against > every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated > crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling > shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 > processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I > still have > > to run something to drive up the CPU load and then performance > improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm > curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Apr 20 16:04:20 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Thu, 20 Apr 2017 15:04:20 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> , <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Apr 20 16:07:32 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 20 Apr 2017 17:07:32 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: Hi Kennmeth, is prefetching off or on at your storage backend? Raw sequential is very different from GPFS sequential at the storage device ! GPFS does its own prefetching, the storage would never know what sectors sequential read at GPFS level maps to at storage level! Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/20/2017 04:53 PM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From marcusk at nz1.ibm.com Fri Apr 21 02:21:51 2017 From: marcusk at nz1.ibm.com (Marcus Koenig1) Date: Fri, 21 Apr 2017 14:21:51 +1300 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: Hi Kennmeth, we also had similar performance numbers in our tests. Native was far quicker than through GPFS. When we learned though that the client tested the performance on the FS at a big blocksize (512k) with small files - we were able to speed it up significantly using a smaller FS blocksize (obviously we had to recreate the FS). So really depends on how you do your tests. Cheers, Marcus Koenig Lab Services Storage & Power Specialist IBM Australia & New Zealand Advanced Technical Skills IBM Systems-Hardware |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | | |Mobile: +64 21 67 34 27 | | | |E-mail: marcusk at nz1.ibm.com | | | | | | | | | | | | | | | |82 Wyndham Street | | | |Auckland, AUK 1010 | | | |New Zealand | | | | | | | | | | | | | | | | | | | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| From: "Uwe Falke" To: gpfsug main discussion list Date: 04/21/2017 03:07 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Kennmeth, is prefetching off or on at your storage backend? Raw sequential is very different from GPFS sequential at the storage device ! GPFS does its own prefetching, the storage would never know what sectors sequential read at GPFS level maps to at storage level! Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/20/2017 04:53 PM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17773863.gif Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17405449.jpg Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17997200.gif Type: image/gif Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Fri Apr 21 08:25:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 21 Apr 2017 09:25:22 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From kenneth.waegeman at ugent.be Fri Apr 21 10:43:25 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:43:25 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> Message-ID: <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Interesting. Could you share a little more about your architecture? Is > it possible to mount the fs on an NSD server and do some dd's from the > fs on the NSD server? If that gives you decent performance perhaps try > NSDPERF next > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf > > > -Aaron > > > > > On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman > wrote: >> >> Hi, >> >> >> Having an issue that looks the same as this one: >> >> We can do sequential writes to the filesystem at 7,8 GB/s total , >> which is the expected speed for our current storage >> backend. While we have even better performance with sequential reads >> on raw storage LUNS, using GPFS we can only reach 1GB/s in total >> (each nsd server seems limited by 0,5GB/s) independent of the number >> of clients >> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev >> params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as >> discussed in this thread, but nothing seems to impact this read >> performance. >> >> Any ideas? >> >> Thanks! >> >> Kenneth >> >> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>> I just had a similar experience from a sandisk infiniflash system >>> SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for >>> writes. and 250-300 Mbyte/s on sequential reads!! Random reads were >>> on the order of 2 Gbyte/s. >>> >>> After a bit head scratching snd fumbling around I found out that >>> reducing maxMBpS from 10000 to 100 fixed the problem! Digging >>> further I found that reducing prefetchThreads from default=72 to 32 >>> also fixed it, while leaving maxMBpS at 10000. Can now also read at >>> 3,2 GByte/s. >>> >>> Could something like this be the problem on your box as well? >>> >>> >>> >>> -jf >>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >>> >: >>> >>> Well, I'm somewhat scrounging for hardware. This is in our test >>> environment :) And yep, it's got the 2U gpu-tray in it although even >>> without the riser it has 2 PCIe slots onboard (excluding the >>> on-board >>> dual-port mezz card) so I think it would make a fine NSD server even >>> without the riser. >>> >>> -Aaron >>> >>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT >>> Services) >>> wrote: >>> > Maybe its related to interrupt handlers somehow? You drive the >>> load up on one socket, you push all the interrupt handling to >>> the other socket where the fabric card is attached? >>> > >>> > Dunno ... (Though I am intrigued you use idataplex nodes as >>> NSD servers, I assume its some 2U gpu-tray riser one or something !) >>> > >>> > Simon >>> > ________________________________________ >>> > From: gpfsug-discuss-bounces at spectrumscale.org >>> >>> [gpfsug-discuss-bounces at spectrumscale.org >>> ] on behalf of >>> Aaron Knister [aaron.s.knister at nasa.gov >>> ] >>> > Sent: 17 February 2017 15:52 >>> > To: gpfsug main discussion list >>> > Subject: [gpfsug-discuss] bizarre performance behavior >>> > >>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>> > connections coming in and 1x FDR10 and 1x QDR connection going >>> out to >>> > the clients. I was having a really hard time getting anything >>> resembling >>> > sensible performance out of it (4-5Gb/s writes but maybe >>> 1.2Gb/s for >>> > reads). The back-end is a DDN SFA12K and I *know* it can do >>> better than >>> > that. >>> > >>> > I don't remember quite how I figured this out but simply by >>> running >>> > "openssl speed -multi 16" on the nsd server to drive up the >>> load I saw >>> > an almost 4x performance jump which is pretty much goes >>> against every >>> > sysadmin fiber in me (i.e. "drive up the cpu load with >>> unrelated crap to >>> > quadruple your i/o performance"). >>> > >>> > This feels like some type of C-states frequency scaling >>> shenanigans that >>> > I haven't quite ironed down yet. I booted the box with the >>> following >>> > kernel parameters "intel_idle.max_cstate=0 >>> processor.max_cstate=0" which >>> > didn't seem to make much of a difference. I also tried setting the >>> > frequency governer to userspace and setting the minimum >>> frequency to >>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I >>> still have >>> > to run something to drive up the CPU load and then performance >>> improves. >>> > >>> > I'm wondering if this could be an issue with the C1E state? >>> I'm curious >>> > if anyone has seen anything like this. The node is a dx360 M4 >>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>> > >>> > -Aaron >>> > >>> > -- >>> > Aaron Knister >>> > NASA Center for Climate Simulation (Code 606.2) >>> > Goddard Space Flight Center >>> > (301) 286-2776 >>> > _______________________________________________ >>> > gpfsug-discuss mailing list >>> > gpfsug-discuss at spectrumscale.org >>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > _______________________________________________ >>> > gpfsug-discuss mailing list >>> > gpfsug-discuss at spectrumscale.org >>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 10:50:55 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:50:55 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <2b0824a1-e1a2-8dd8-4a55-a57d7b00e09f@ugent.be> Hi, prefetching was already disabled at our storage backend, but a good thing to recheck :) thanks! On 20/04/17 17:07, Uwe Falke wrote: > Hi Kennmeth, > > is prefetching off or on at your storage backend? > Raw sequential is very different from GPFS sequential at the storage > device ! > GPFS does its own prefetching, the storage would never know what sectors > sequential read at GPFS level maps to at storage level! > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Andreas Hasse, Thorsten Moehring > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Kenneth Waegeman > To: gpfsug main discussion list > Date: 04/20/2017 04:53 PM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > > Having an issue that looks the same as this one: > We can do sequential writes to the filesystem at 7,8 GB/s total , which is > the expected speed for our current storage > backend. While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd > server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in > this thread, but nothing seems to impact this read performance. > Any ideas? > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. > and 250-300 Mbyte/s on sequential reads!! Random reads were on the order > of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that reducing > maxMBpS from 10000 to 100 fixed the problem! Digging further I found that > reducing prefetchThreads from default=72 to 32 also fixed it, while > leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > : > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: >> Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? >> Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, > I assume its some 2U gpu-tray riser one or something !) >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] >> Sent: 17 February 2017 15:52 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] bizarre performance behavior >> >> This is a good one. I've got an NSD server with 4x 16GB fibre >> connections coming in and 1x FDR10 and 1x QDR connection going out to >> the clients. I was having a really hard time getting anything resembling >> sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for >> reads). The back-end is a DDN SFA12K and I *know* it can do better than >> that. >> >> I don't remember quite how I figured this out but simply by running >> "openssl speed -multi 16" on the nsd server to drive up the load I saw >> an almost 4x performance jump which is pretty much goes against every >> sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to >> quadruple your i/o performance"). >> >> This feels like some type of C-states frequency scaling shenanigans that >> I haven't quite ironed down yet. I booted the box with the following >> kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which >> didn't seem to make much of a difference. I also tried setting the >> frequency governer to userspace and setting the minimum frequency to >> 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have >> to run something to drive up the CPU load and then performance improves. >> >> I'm wondering if this could be an issue with the C1E state? I'm curious >> if anyone has seen anything like this. The node is a dx360 M4 >> (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From kenneth.waegeman at ugent.be Fri Apr 21 10:52:58 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:52:58 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> Hi, Tried these settings, but sadly I'm not seeing any changes. Thanks, Kenneth On 21/04/17 09:25, Olaf Weiser wrote: > pls check > workerThreads (assuming you 're > 4.2.2) start with 128 .. increase > iteratively > pagepool at least 8 G > ignorePrefetchLunCount=yes (1) > > then you won't see a difference and GPFS is as fast or even faster .. > > > > From: "Marcus Koenig1" > To: gpfsug main discussion list > Date: 04/21/2017 03:24 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi Kennmeth, > > we also had similar performance numbers in our tests. Native was far > quicker than through GPFS. When we learned though that the client > tested the performance on the FS at a big blocksize (512k) with small > files - we were able to speed it up significantly using a smaller FS > blocksize (obviously we had to recreate the FS). > > So really depends on how you do your tests. > > *Cheers,* > * > Marcus Koenig* > Lab Services Storage & Power Specialist/ > IBM Australia & New Zealand Advanced Technical Skills/ > IBM Systems-Hardware > ------------------------------------------------------------------------ > > *Mobile:*+64 21 67 34 27* > E-mail:*_marcusk at nz1.ibm.com_ > > 82 Wyndham Street > Auckland, AUK 1010 > New Zealand > > > > > > > > > > Inactive hide details for "Uwe Falke" ---04/21/2017 03:07:48 AM---Hi > Kennmeth, is prefetching off or on at your storage backe"Uwe Falke" > ---04/21/2017 03:07:48 AM---Hi Kennmeth, is prefetching off or on at > your storage backend? > > From: "Uwe Falke" > To: gpfsug main discussion list > Date: 04/21/2017 03:07 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Hi Kennmeth, > > is prefetching off or on at your storage backend? > Raw sequential is very different from GPFS sequential at the storage > device ! > GPFS does its own prefetching, the storage would never know what sectors > sequential read at GPFS level maps to at storage level! > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Andreas Hasse, Thorsten Moehring > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Kenneth Waegeman > To: gpfsug main discussion list > Date: 04/20/2017 04:53 PM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > > Having an issue that looks the same as this one: > We can do sequential writes to the filesystem at 7,8 GB/s total , > which is > the expected speed for our current storage > backend. While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd > server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in > this thread, but nothing seems to impact this read performance. > Any ideas? > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. > and 250-300 Mbyte/s on sequential reads!! Random reads were on the order > of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that reducing > maxMBpS from 10000 to 100 fixed the problem! Digging further I found that > reducing prefetchThreads from default=72 to 32 also fixed it, while > leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, > I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Fri Apr 21 13:58:26 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 21 Apr 2017 08:58:26 -0400 Subject: [gpfsug-discuss] bizarre performance behavior - prefetchThreads In-Reply-To: <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> Message-ID: Seems counter-logical, but we have testimony that you may need to reduce the prefetchThreads parameter. Of all the parameters, that's the one that directly affects prefetching, so worth trying. Jan-Frode Myklebust wrote: ...Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s.... I can speculate that having prefetchThreads to high may create a contention situation where more threads causes overall degradation in system performance. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From aaron.s.knister at nasa.gov Fri Apr 21 14:10:49 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Fri, 21 Apr 2017 13:10:49 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister <aaron.s.knister at nasa.gov>: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Fri Apr 21 14:18:34 2017 From: david_johnson at brown.edu (David D Johnson) Date: Fri, 21 Apr 2017 09:18:34 -0400 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <02C0BD31-E743-4F1C-91E7-20555099CBF5@brown.edu> We had some luck making the client and server IB performance consistently decent by configuring tuned with the profile "latency-performance". The key is the line /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=1 which prevents cpu from going to sleep just when the next burst of IB traffic is about to arrive. -- ddj Dave Johnson Brown University CCV On Apr 21, 2017, at 9:10 AM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > > Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. > > A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: >> Hi, >> We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. >> We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. >> When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. >> >> I'll use the nsdperf tool to see if we can find anything, >> >> thanks! >> >> K >> >> On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: >>> Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf >>> >>> -Aaron >>> >>> >>> >>> >>> On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: >>>> Hi, >>>> >>>> >>>> Having an issue that looks the same as this one: >>>> >>>> We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage >>>> backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>>> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. >>>> Any ideas? >>>> >>>> Thanks! >>>> >>>> Kenneth >>>> >>>> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>>>> I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. >>>>> >>>>> After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. >>>>> >>>>> Could something like this be the problem on your box as well? >>>>> >>>>> >>>>> >>>>> -jf >>>>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister < aaron.s.knister at nasa.gov >: >>>>> Well, I'm somewhat scrounging for hardware. This is in our test >>>>> environment :) And yep, it's got the 2U gpu-tray in it although even >>>>> without the riser it has 2 PCIe slots onboard (excluding the on-board >>>>> dual-port mezz card) so I think it would make a fine NSD server even >>>>> without the riser. >>>>> >>>>> -Aaron >>>>> >>>>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) >>>>> wrote: >>>>> > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? >>>>> > >>>>> > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) >>>>> > >>>>> > Simon >>>>> > ________________________________________ >>>>> > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org ] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov ] >>>>> > Sent: 17 February 2017 15:52 >>>>> > To: gpfsug main discussion list >>>>> > Subject: [gpfsug-discuss] bizarre performance behavior >>>>> > >>>>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>>>> > connections coming in and 1x FDR10 and 1x QDR connection going out to >>>>> > the clients. I was having a really hard time getting anything resembling >>>>> > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for >>>>> > reads). The back-end is a DDN SFA12K and I *know* it can do better than >>>>> > that. >>>>> > >>>>> > I don't remember quite how I figured this out but simply by running >>>>> > "openssl speed -multi 16" on the nsd server to drive up the load I saw >>>>> > an almost 4x performance jump which is pretty much goes against every >>>>> > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to >>>>> > quadruple your i/o performance"). >>>>> > >>>>> > This feels like some type of C-states frequency scaling shenanigans that >>>>> > I haven't quite ironed down yet. I booted the box with the following >>>>> > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which >>>>> > didn't seem to make much of a difference. I also tried setting the >>>>> > frequency governer to userspace and setting the minimum frequency to >>>>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have >>>>> > to run something to drive up the CPU load and then performance improves. >>>>> > >>>>> > I'm wondering if this could be an issue with the C1E state? I'm curious >>>>> > if anyone has seen anything like this. The node is a dx360 M4 >>>>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>>>> > >>>>> > -Aaron >>>>> > >>>>> > -- >>>>> > Aaron Knister >>>>> > NASA Center for Climate Simulation (Code 606.2) >>>>> > Goddard Space Flight Center >>>>> > (301) 286-2776 >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Fri Apr 21 15:01:33 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 21 Apr 2017 14:01:33 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) Turbo Mode - Enable QPI Link Frequency - Max Performance Operating Mode - Maximum Performance >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To: gpfsug main discussion list Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From bbanister at jumptrading.com Fri Apr 21 16:01:54 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 21 Apr 2017 15:01:54 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <7dcbac92e19043faa7968702d852668f@jumptrading.com> I think we have a new topic and new speaker for the next UG meeting at SC! Kums presenting "Performance considerations for Spectrum Scale"!! Kums, I have to say you do have a lot to offer here... ;o) -Bryan Disclaimer: There are some selfish reasons of me wanting to hang out with you again involved in this suggestion From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kumaran Rajaram Sent: Friday, April 21, 2017 9:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] bizarre performance behavior Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) * Turbo Mode - Enable * QPI Link Frequency - Max Performance * Operating Mode - Maximum Performance * >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). * [cid:image001.gif at 01D2BA86.4D4B4C10] [cid:image002.gif at 01D2BA86.4D4B4C10] [cid:image003.gif at 01D2BA86.4D4B4C10] Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" > To: gpfsug main discussion list > Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 61023 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 85131 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 84819 bytes Desc: image003.gif URL: From g.mangeot at gmail.com Fri Apr 21 16:04:58 2017 From: g.mangeot at gmail.com (Guillaume Mangeot) Date: Fri, 21 Apr 2017 17:04:58 +0200 Subject: [gpfsug-discuss] HA on snapshot scheduling in GPFS GUI Message-ID: Hi, I'm looking for a way to get the GUI working in HA to schedule snapshots. I have 2 servers with gpfs.gui service running on them. I checked a bit with lssnaprule in /usr/lpp/mmfs/gui/cli and the file /var/lib/mmfs/gui/snapshots.json But it doesn't look to be shared between all the GUI servers. Is there a way to get GPFS GUI working in HA to schedule snapshots? (keeping the coherency: avoiding to trigger snapshots on both servers in the same time) Regards, Guillaume Mangeot DDN Storage -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 16:33:16 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 17:33:16 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <41475044-c195-5561-c94a-b54ee30c7e68@ugent.be> On 21/04/17 15:10, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Fantastic news! It might also be worth running "cpupower monitor" or > "turbostat" on your NSD servers while you're running dd tests from the > clients to see what CPU frequency your cores are actually running at. Thanks! I verified with turbostat and cpuinfo, our cpus are running in high performance mode and frequency is always at highest level. > > A typical NSD server workload (especially with IB verbs and for reads) > can be pretty light on CPU which might not prompt your CPU crew > governor to up the frequency (which can affect throughout). If your > frequency scaling governor isn't kicking up the frequency of your CPUs > I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: >> >> Hi, >> >> We are running a test setup with 2 NSD Servers backed by 4 Dell >> Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of >> the 4 powervaults, nsd02 is primary serving LUNS of controller B. >> >> We are testing from 2 testing machines connected to the nsds with >> infiniband, verbs enabled. >> >> When we do dd from the NSD servers, we see indeed performance going >> to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is >> able to get the data at a decent speed. Since we can write from the >> clients at a good speed, I didn't suspect the communication between >> clients and nsds being the issue, especially since total performance >> stays the same using 1 or multiple clients. >> >> I'll use the nsdperf tool to see if we can find anything, >> >> thanks! >> >> K >> >> On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE >> CORP] wrote: >>> Interesting. Could you share a little more about your architecture? >>> Is it possible to mount the fs on an NSD server and do some dd's >>> from the fs on the NSD server? If that gives you decent performance >>> perhaps try NSDPERF next >>> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf >>> >>> >>> >>> -Aaron >>> >>> >>> >>> >>> On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman >>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> Having an issue that looks the same as this one: >>>> >>>> We can do sequential writes to the filesystem at 7,8 GB/s total , >>>> which is the expected speed for our current storage >>>> backend. While we have even better performance with sequential >>>> reads on raw storage LUNS, using GPFS we can only reach 1GB/s in >>>> total (each nsd server seems limited by 0,5GB/s) independent of the >>>> number of clients >>>> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev >>>> params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. >>>> as discussed in this thread, but nothing seems to impact this read >>>> performance. >>>> >>>> Any ideas? >>>> >>>> Thanks! >>>> >>>> Kenneth >>>> >>>> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>>>> I just had a similar experience from a sandisk infiniflash system >>>>> SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for >>>>> writes. and 250-300 Mbyte/s on sequential reads!! Random reads >>>>> were on the order of 2 Gbyte/s. >>>>> >>>>> After a bit head scratching snd fumbling around I found out that >>>>> reducing maxMBpS from 10000 to 100 fixed the problem! Digging >>>>> further I found that reducing prefetchThreads from default=72 to >>>>> 32 also fixed it, while leaving maxMBpS at 10000. Can now also >>>>> read at 3,2 GByte/s. >>>>> >>>>> Could something like this be the problem on your box as well? >>>>> >>>>> >>>>> >>>>> -jf >>>>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >>>>> >: >>>>> >>>>> Well, I'm somewhat scrounging for hardware. This is in our test >>>>> environment :) And yep, it's got the 2U gpu-tray in it >>>>> although even >>>>> without the riser it has 2 PCIe slots onboard (excluding the >>>>> on-board >>>>> dual-port mezz card) so I think it would make a fine NSD >>>>> server even >>>>> without the riser. >>>>> >>>>> -Aaron >>>>> >>>>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT >>>>> Services) >>>>> wrote: >>>>> > Maybe its related to interrupt handlers somehow? You drive >>>>> the load up on one socket, you push all the interrupt handling >>>>> to the other socket where the fabric card is attached? >>>>> > >>>>> > Dunno ... (Though I am intrigued you use idataplex nodes as >>>>> NSD servers, I assume its some 2U gpu-tray riser one or >>>>> something !) >>>>> > >>>>> > Simon >>>>> > ________________________________________ >>>>> > From: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> [gpfsug-discuss-bounces at spectrumscale.org >>>>> ] on behalf >>>>> of Aaron Knister [aaron.s.knister at nasa.gov >>>>> ] >>>>> > Sent: 17 February 2017 15:52 >>>>> > To: gpfsug main discussion list >>>>> > Subject: [gpfsug-discuss] bizarre performance behavior >>>>> > >>>>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>>>> > connections coming in and 1x FDR10 and 1x QDR connection >>>>> going out to >>>>> > the clients. I was having a really hard time getting >>>>> anything resembling >>>>> > sensible performance out of it (4-5Gb/s writes but maybe >>>>> 1.2Gb/s for >>>>> > reads). The back-end is a DDN SFA12K and I *know* it can do >>>>> better than >>>>> > that. >>>>> > >>>>> > I don't remember quite how I figured this out but simply by >>>>> running >>>>> > "openssl speed -multi 16" on the nsd server to drive up the >>>>> load I saw >>>>> > an almost 4x performance jump which is pretty much goes >>>>> against every >>>>> > sysadmin fiber in me (i.e. "drive up the cpu load with >>>>> unrelated crap to >>>>> > quadruple your i/o performance"). >>>>> > >>>>> > This feels like some type of C-states frequency scaling >>>>> shenanigans that >>>>> > I haven't quite ironed down yet. I booted the box with the >>>>> following >>>>> > kernel parameters "intel_idle.max_cstate=0 >>>>> processor.max_cstate=0" which >>>>> > didn't seem to make much of a difference. I also tried >>>>> setting the >>>>> > frequency governer to userspace and setting the minimum >>>>> frequency to >>>>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I >>>>> still have >>>>> > to run something to drive up the CPU load and then >>>>> performance improves. >>>>> > >>>>> > I'm wondering if this could be an issue with the C1E state? >>>>> I'm curious >>>>> > if anyone has seen anything like this. The node is a dx360 M4 >>>>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>>>> > >>>>> > -Aaron >>>>> > >>>>> > -- >>>>> > Aaron Knister >>>>> > NASA Center for Climate Simulation (Code 606.2) >>>>> > Goddard Space Flight Center >>>>> > (301) 286-2776 >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 16:42:34 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 17:42:34 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> Hi, We already verified this on our nsds: [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed QpiSpeed=maxdatarate [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode turbomode=enable [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ?-SysProfile SysProfile=perfoptimized so sadly this is not the issue. Also the output of the verbs commands look ok, there are connections from the client to the nsds are there is data being read and writen. Thanks again! Kenneth On 21/04/17 16:01, Kumaran Rajaram wrote: > Hi, > > Try enabling the following in the BIOS of the NSD servers (screen > shots below) > > * Turbo Mode - Enable > * QPI Link Frequency - Max Performance > * Operating Mode - Maximum Performance > * > > >>>>While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total > (each nsd server seems limited by 0,5GB/s) independent of the > number of clients > > >>We are testing from 2 testing machines connected to the nsds > with infiniband, verbs enabled. > > > Also, It will be good to verify that all the GPFS nodes have Verbs > RDMA started using "mmfsadm test verbs status" and that the NSD > client-server communication from client to server during "dd" is > actually using Verbs RDMA using "mmfsadm test verbs conn" command (on > NSD client doing dd). If not, then GPFS might be using TCP/IP network > over which the cluster is configured impacting performance (If this is > the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and > resolve). > > * > > > > > > > Regards, > -Kums > > > > > > > From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" > > To: gpfsug main discussion list > Date: 04/21/2017 09:11 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Fantastic news! It might also be worth running "cpupower monitor" or > "turbostat" on your NSD servers while you're running dd tests from the > clients to see what CPU frequency your cores are actually running at. > > A typical NSD server workload (especially with IB verbs and for reads) > can be pretty light on CPU which might not prompt your CPU crew > governor to up the frequency (which can affect throughout). If your > frequency scaling governor isn't kicking up the frequency of your CPUs > I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: > > Hi, > > We are running a test setup with 2 NSD Servers backed by 4 Dell > Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of > the 4 powervaults, nsd02 is primary serving LUNS of controller B. > > We are testing from 2 testing machines connected to the nsds with > infiniband, verbs enabled. > > When we do dd from the NSD servers, we see indeed performance going to > 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is > able to get the data at a decent speed. Since we can write from the > clients at a good speed, I didn't suspect the communication between > clients and nsds being the issue, especially since total performance > stays the same using 1 or multiple clients. > > I'll use the nsdperf tool to see if we can find anything, > > thanks! > > K > > On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: > Interesting. Could you share a little more about your architecture? Is > it possible to mount the fs on an NSD server and do some dd's from the > fs on the NSD server? If that gives you decent performance perhaps try > NSDPERF next > _https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf_ > > > -Aaron > > > > > On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman > __ wrote: > > Hi, > > Having an issue that looks the same as this one: > > We can do sequential writes to the filesystem at 7,8 GB/s total , > which is the expected speed for our current storage > backend. While we have even better performance with sequential reads > on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each > nsd server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed > in this thread, but nothing seems to impact this read performance. > > Any ideas? > > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for > writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on > the order of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that > reducing maxMBpS from 10000 to 100 fixed the problem! Digging further > I found that reducing prefetchThreads from default=72 to 32 also fixed > it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > <_aaron.s.knister at nasa.gov_ >: > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load > up on one socket, you push all the interrupt handling to the other > socket where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: _gpfsug-discuss-bounces at spectrumscale.org_ > [_gpfsug-discuss-bounces at spectrumscale.org_ > ] on behalf of Aaron > Knister [_aaron.s.knister at nasa.gov_ ] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at _spectrumscale.org_ > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at _spectrumscale.org_ > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at _spectrumscale.org_ _ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From kums at us.ibm.com Fri Apr 21 21:27:49 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 21 Apr 2017 20:27:49 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov><9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> Message-ID: Hi Kenneth, As it was mentioned earlier, it will be good to first verify the raw network performance between the NSD client and NSD server using the nsdperf tool that is built with RDMA support. g++ -O2 -DRDMA -o nsdperf -lpthread -lrt -libverbs -lrdmacm nsdperf.C In addition, since you have 2 x NSD servers it will be good to perform NSD client file-system performance test with just single NSD server (mmshutdown the other server, assuming all the NSDs have primary, server NSD server configured + Quorum will be intact when a NSD server is brought down) to see if it helps to improve the read performance + if there are variations in the file-system read bandwidth results between NSD_server#1 'active' vs. NSD_server #2 'active' (with other NSD server in GPFS "down" state). If there is significant variation, it can help to isolate the issue to particular NSD server (HW or IB issue?). You can issue "mmdiag --waiters" on NSD client as well as NSD servers during your dd test, to verify if there are unsual long GPFS waiters. In addition, you may issue Linux "perf top -z" command on the GPFS node to see if there is high CPU usage by any particular call/event (for e.g., If GPFS config parameter verbsRdmaMaxSendBytes has been set to low value from the default 16M, then it can cause RDMA completion threads to go CPU bound ). Please verify some performance scenarios detailed in Chapter 22 in Spectrum Scale Problem Determination Guide (link below). https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/pdf/scale_pdg.pdf?view=kc Thanks, -Kums From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/21/2017 11:43 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, We already verified this on our nsds: [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed QpiSpeed=maxdatarate [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode turbomode=enable [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ?-SysProfile SysProfile=perfoptimized so sadly this is not the issue. Also the output of the verbs commands look ok, there are connections from the client to the nsds are there is data being read and writen. Thanks again! Kenneth On 21/04/17 16:01, Kumaran Rajaram wrote: Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) Turbo Mode - Enable QPI Link Frequency - Max Performance Operating Mode - Maximum Performance >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To: gpfsug main discussion list Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org[ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From frank.tower at outlook.com Thu Apr 20 13:27:13 2017 From: frank.tower at outlook.com (Frank Tower) Date: Thu, 20 Apr 2017 12:27:13 +0000 Subject: [gpfsug-discuss] Protocol node recommendations Message-ID: Hi, We have here around 2PB GPFS where users access oney through GPFS client (used by an HPC cluster), but we will have to setup protocols nodes. We will have to share GPFS data to ~ 1000 users, where each users will have different access usage, meaning: - some will do large I/O (e.g: store 1TB files) - some will read/write more than 10k files in a raw - other will do only sequential read I already read the following wiki page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node IBM Spectrum Scale Wiki - Sizing Guidance for Protocol Node www.ibm.com developerWorks wikis allow groups of people to jointly create and maintain content through contribution and collaboration. Wikis apply the wisdom of crowds to ... But I wondering if some people have recommendations regarding hardware sizing and software tuning for such situation ? Or better, if someone already such setup ? Thank you by advance, Frank. -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Apr 22 05:30:29 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Sat, 22 Apr 2017 00:30:29 -0400 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: <52354.1492835429@turing-police.cc.vt.edu> On Thu, 20 Apr 2017 12:27:13 -0000, Frank Tower said: > - some will do large I/O (e.g: store 1TB files) > - some will read/write more than 10k files in a raw > - other will do only sequential read > But I wondering if some people have recommendations regarding hardware sizing > and software tuning for such situation ? The three most critical pieces of info are missing here: 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? 2) How many of the users are likely to be active at the same time? 1,000 users, each of whom are active an hour a week is entirely different from 200 users that are each active 140 hours a week. 3) What SLA/performance target are they expecting? If they want large 1TB I/O and 100MB/sec is acceptable, that's different than if they have a business need to go at 1.2GB/sec.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From frank.tower at outlook.com Sat Apr 22 07:34:44 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 06:34:44 +0000 Subject: [gpfsug-discuss] Protocol node recommendations Message-ID: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sat Apr 22 09:50:11 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sat, 22 Apr 2017 08:50:11 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower : > Hi, > > We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with > GPFS client on each node. > > We will have to open GPFS to all our users over CIFS and kerberized NFS > with ACL support for both protocol for around +1000 users > > All users have different use case and needs: > - some will do random I/O through a large set of opened files (~5k files) > - some will do large write with 500GB-1TB files > - other will arrange sequential I/O with ~10k opened files > > NFS and CIFS will share the same server, so I through to use SSD drive, at > least 128GB memory with 2 sockets. > > Regarding tuning parameters, I thought at: > > maxFilesToCache 10000 > syncIntervalStrict yes > workerThreads (8*core) > prefetchPct 40 (for now and update if needed) > > I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering > if someone could share his experience/best practice regarding hardware > sizing and/or tuning parameters. > > Thank by advance, > Frank > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sat Apr 22 19:47:59 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 18:47:59 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: <52354.1492835429@turing-police.cc.vt.edu> References: , <52354.1492835429@turing-police.cc.vt.edu> Message-ID: Hi, Thank for your answer. > 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? True, here the list: - 800 users that have 1 workstation through 1Gb/s ethernet and will use NFS/CIFS - 200 users that have 2 workstation through 1Gb/s ethernet, few have 10Gb/s ethernet and will use NFS/CIFS > 2) How many of the users are likely to be active at the same time? 1,000 > users, each of whom are active an hour a week is entirely different from > 200 users that are each active 140 hours a week. True again, around 200 users will actively use GPFS through NFS/CIFS during night and day, but we cannot control if people will use 2 workstations or more :( We will have peak during day with an average of 700 'workstations' > 3) What SLA/performance target are they expecting? If they want > large 1TB I/O and 100MB/sec is acceptable, that's different than if they > have a business need to go at 1.2GB/sec.... We just want to provide at normal throughput through an 1GB/s network. Users are aware of such situation and will mainly use HPC cluster for high speed and heavy computation. But they would like to do 'light' computation on their desktop. The main topic here is to sustain 'normal' throughput for all users during peak. Thank for your help. ________________________________ From: valdis.kletnieks at vt.edu Sent: Saturday, April 22, 2017 6:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Protocol node recommendations On Thu, 20 Apr 2017 12:27:13 -0000, Frank Tower said: > - some will do large I/O (e.g: store 1TB files) > - some will read/write more than 10k files in a raw > - other will do only sequential read > But I wondering if some people have recommendations regarding hardware sizing > and software tuning for such situation ? The three most critical pieces of info are missing here: 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? 2) How many of the users are likely to be active at the same time? 1,000 users, each of whom are active an hour a week is entirely different from 200 users that are each active 140 hours a week. 3) What SLA/performance target are they expecting? If they want large 1TB I/O and 100MB/sec is acceptable, that's different than if they have a business need to go at 1.2GB/sec.... -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sat Apr 22 20:22:23 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 19:22:23 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sun Apr 23 11:07:38 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sun, 23 Apr 2017 10:07:38 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower : > Hi, > > > Thank for the recommendations. > > Now we deal with the situation of: > > > - take 3 nodes with round robin DNS that handle both protocols > > - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and > NFS services. > > > Regarding your recommendations, 256GB memory node could be a plus if we > mix both protocols for such case. > > > Is the spreadsheet publicly available or do we need to ask IBM ? > > > Thank for your help, > > Frank. > > > ------------------------------ > *From:* Jan-Frode Myklebust > *Sent:* Saturday, April 22, 2017 10:50 AM > *To:* gpfsug-discuss at spectrumscale.org > *Subject:* Re: [gpfsug-discuss] Protocol node recommendations > > That's a tiny maxFilesToCache... > > I would start by implementing the settings from > /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your > protocoll nodes, and leave further tuning to when you see you have issues. > > Regarding sizing, we have a spreadsheet somewhere where you can input some > workload parameters and get an idea for how many nodes you'll need. Your > node config seems fine, but one node seems too few to serve 1000+ users. We > support max 3000 SMB connections/node, and I believe the recommendation is > 4000 NFS connections/node. > > > -jf > l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower : > >> Hi, >> >> We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with >> GPFS client on each node. >> >> We will have to open GPFS to all our users over CIFS and kerberized NFS >> with ACL support for both protocol for around +1000 users >> >> All users have different use case and needs: >> - some will do random I/O through a large set of opened files (~5k files) >> - some will do large write with 500GB-1TB files >> - other will arrange sequential I/O with ~10k opened files >> >> NFS and CIFS will share the same server, so I through to use SSD >> drive, at least 128GB memory with 2 sockets. >> >> Regarding tuning parameters, I thought at: >> >> maxFilesToCache 10000 >> syncIntervalStrict yes >> workerThreads (8*core) >> prefetchPct 40 (for now and update if needed) >> >> I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering >> if someone could share his experience/best practice regarding hardware >> sizing and/or tuning parameters. >> >> Thank by advance, >> Frank >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rreuscher at verizon.net Sun Apr 23 17:43:44 2017 From: rreuscher at verizon.net (Robert Reuscher) Date: Sun, 23 Apr 2017 11:43:44 -0500 Subject: [gpfsug-discuss] LUN expansion Message-ID: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> We run GPFS on z/Linux and have been using ECKD devices for disks. We are looking at implementing some new filesystems on FCP LUNS. One of the features of a LUN is we can expand a LUN instead of adding new LUNS, where as with ECKD devices. From what I?ve found searching to see if GPFS filesystem can be expanding to see the expanded LUN, it doesn?t seem that this will work, you have to add new LUNS (or new disks) and then add them to the filesystem. Everything I?ve found is at least 2-3 old (most of it much older), and just want to check that this is still is true before we make finalize our LUN/GPFS procedures. Robert Reuscher NR5AR -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sun Apr 23 22:27:50 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sun, 23 Apr 2017 21:27:50 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfadden at us.ibm.com Sun Apr 23 23:44:56 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Sun, 23 Apr 2017 22:44:56 +0000 Subject: [gpfsug-discuss] LUN expansion In-Reply-To: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> References: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> Message-ID: An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Mon Apr 24 10:11:25 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 24 Apr 2017 09:11:25 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: What's your SSD going to help with... will you implement it as a LROC device? Otherwise I can't see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust ; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Mon Apr 24 11:28:08 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 24 Apr 2017 12:28:08 +0200 (CEST) Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Message-ID: <416417651.114582.1493029688959@email.1und1.de> An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Apr 24 12:14:17 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 12:14:17 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <416417651.114582.1493029688959@email.1und1.de> References: <416417651.114582.1493029688959@email.1und1.de> Message-ID: <1493032457.11896.20.camel@buzzard.me.uk> On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From service at metamodul.com Mon Apr 24 13:21:09 2017 From: service at metamodul.com (service at metamodul.com) Date: Mon, 24 Apr 2017 14:21:09 +0200 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Message-ID: Hi Jonathan todays hardware is so powerful that imho it might make sense to split a CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). I think that such a server is a little bit to big ?just to be a single NSD server. Note that i use for each GPFS service a dedicated node. So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. Inhm 4xS822L could handle this and a little bit more quite well. Of course blade technology could be used or 1U server. With kind regards Hajo --? Unix Systems Engineer MetaModul GmbH +49 177 4393994
-------- Urspr?ngliche Nachricht --------
Von: Jonathan Buzzard
Datum:2017.04.24 13:14 (GMT+01:00)
An: gpfsug main discussion list
Betreff: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale
On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Apr 24 13:42:51 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 24 Apr 2017 15:42:51 +0300 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: Hi As tastes vary, I would not partition it so much for the backend. Assuming there is little to nothing overhead on the CPU at PHYP level, which it depends. On the protocols nodes, due the CTDB keeping locks together across all nodes (SMB), you would get more performance on bigger & less number of CES nodes than more and smaller. Certainly a 822 is quite a server if we go back to previous generations but I would still keep a simple backend (NSd servers), simple CES (less number of nodes the merrier) & then on the client part go as micro partitions as you like/can as the effect on the cluster is less relevant in the case of resources starvation. But, it depends on workloads, SLA and money so I say try, establish a baseline and it fills the requirements, go for it. If not change till does. Have fun From: "service at metamodul.com" To: gpfsug main discussion list Date: 24/04/2017 15:21 Subject: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Jonathan todays hardware is so powerful that imho it might make sense to split a CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). I think that such a server is a little bit to big just to be a single NSD server. Note that i use for each GPFS service a dedicated node. So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. Inhm 4xS822L could handle this and a little bit more quite well. Of course blade technology could be used or 1U server. With kind regards Hajo -- Unix Systems Engineer MetaModul GmbH +49 177 4393994 -------- Urspr?ngliche Nachricht -------- Von: Jonathan Buzzard Datum:2017.04.24 13:14 (GMT+01:00) An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Apr 24 14:04:26 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 14:04:26 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <1493039066.11896.30.camel@buzzard.me.uk> On Mon, 2017-04-24 at 14:21 +0200, service at metamodul.com wrote: > Hi Jonathan > todays hardware is so powerful that imho it might make sense to split > a CEC into more "piece". For example the IBM S822L has up to 2x12 > cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). > I think that such a server is a little bit to big just to be a single > NSD server. So don't buy it for an NSD server then :-) > Note that i use for each GPFS service a dedicated node. > So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup > nodes and at least 3 test server a total of 11 server is needed. > Inhm 4xS822L could handle this and a little bit more quite well. > I think you are missing the point somewhat. Well by several country miles and quite possibly an ocean or two to be honest. Spectrum scale is supposed to be a "scale out" solution. More storage required add more arrays. More bandwidth add more servers etc. If you are just going to scale it all up on a *single* server then you might as well forget GPFS and do an old school standard scale up solution. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From janfrode at tanso.net Mon Apr 24 14:14:20 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 24 Apr 2017 15:14:20 +0200 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: I agree with Luis -- why so many nodes? """ So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. """ If this is your whole cluster, why not just 3x P822L/P812L running single partition per node, hosting a cluster of 3x protocol-nodes that does both direct FC for disk access, and also run backups on same nodes ? No complications, full hw performance. Then separate node for test, or separate partition on same nodes with dedicated adapters. But back to your original question. My experience is that LPAR/NPIV works great, but it's a bit annoying having to also have VIOs. Hope we'll get FC SR-IOV eventually.. Also LPAR/Dedicated-adapters naturally works fine. VMWare/RDM can be a challenge in some failure situations. It likes to pause VMs in APD or PDL situations, which will affect all VMs with access to it :-o VMs without direct disk access is trivial. -jf On Mon, Apr 24, 2017 at 2:42 PM, Luis Bolinches wrote: > Hi > > As tastes vary, I would not partition it so much for the backend. Assuming > there is little to nothing overhead on the CPU at PHYP level, which it > depends. On the protocols nodes, due the CTDB keeping locks together across > all nodes (SMB), you would get more performance on bigger & less number of > CES nodes than more and smaller. > > Certainly a 822 is quite a server if we go back to previous generations > but I would still keep a simple backend (NSd servers), simple CES (less > number of nodes the merrier) & then on the client part go as micro > partitions as you like/can as the effect on the cluster is less relevant in > the case of resources starvation. > > But, it depends on workloads, SLA and money so I say try, establish a > baseline and it fills the requirements, go for it. If not change till does. > Have fun > > > > From: "service at metamodul.com" > To: gpfsug main discussion list > Date: 24/04/2017 15:21 > Subject: Re: [gpfsug-discuss] Used virtualization technologies for > GPFS/Spectrum Scale > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Hi Jonathan > todays hardware is so powerful that imho it might make sense to split a > CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 > PCI3 slots ( 4?16 lans & 5?8 lan ). > I think that such a server is a little bit to big just to be a single NSD > server. > Note that i use for each GPFS service a dedicated node. > So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes > and at least 3 test server a total of 11 server is needed. > Inhm 4xS822L could handle this and a little bit more quite well. > > Of course blade technology could be used or 1U server. > > With kind regards > Hajo > > -- > Unix Systems Engineer > MetaModul GmbH > +49 177 4393994 <+49%20177%204393994> > > > -------- Urspr?ngliche Nachricht -------- > Von: Jonathan Buzzard > Datum:2017.04.24 13:14 (GMT+01:00) > An: gpfsug main discussion list > Betreff: Re: [gpfsug-discuss] Used virtualization technologies for > GPFS/Spectrum Scale > > On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > > @All > > > > > > does anybody uses virtualization technologies for GPFS Server ? If yes > > what kind and why have you selected your soulution. > > > > I think currently about using Linux on Power using 40G SR-IOV for > > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > > also assign only a certain amount of CPUs to GPFS. ( Lower license > > cost / You pay for what you use) > > > > > > I must admit that i am not familar how "good" KVM/ESX in respect to > > direct assignment of hardware is. Thus the question to the group > > > > For the most part GPFS is used at scale and in general all the > components are redundant. As such why you would want to allocate less > than a whole server into a production GPFS system in somewhat beyond me. > > That is you will have a bunch of NSD servers in the system and if one > crashes, well the other NSD's take over. Similar for protocol nodes, and > in general the total file system size is going to hundreds of TB > otherwise why bother with GPFS. > > I guess there is currently potential value at sticking the GUI into a > virtual machine to get redundancy. > > On the other hand if you want a test rig, then virtualization works > wonders. I have put GPFS on a single Linux box, using LV's for the disks > and mapping them into virtual machines under KVM. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______ > ________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Apr 24 16:29:56 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 24 Apr 2017 11:29:56 -0400 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <131241.1493047796@turing-police.cc.vt.edu> On Mon, 24 Apr 2017 14:21:09 +0200, "service at metamodul.com" said: > todays hardware is so powerful that imho it might make sense to split a CEC > into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots > ( 4?16 lans & 5?8 lan ). We look at it the other way around: Today's hardware is so powerful that you can build a cluster out of a stack of fairly low-end 1U servers (we have one cluster that's built out of Dell r630s). And it's more robust against hardware failures than a VM based solution - although the 822 seems to allow hot-swap of PCI cards, a dead socket or DIMM will still kill all the VMs when you go to replace it. If one 1U out of 4 goes down due to a bad DIMM (which has happened to us more often than a bad PCI card) you can just power it down and replace it.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From service at metamodul.com Mon Apr 24 17:11:25 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 24 Apr 2017 18:11:25 +0200 (CEST) Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <1961501377.286669.1493050285874@email.1und1.de> > Jan-Frode Myklebust hat am 24. April 2017 um 15:14 geschrieben: > I agree with Luis -- why so many nodes? Many ? IMHO it is not that much. I do not like to have one server doing more than one task. Thus a NSD Server does only serves GPFS. A Protocol server serves either NFS or SMB but not both except IBM says it would be better to run NFS/SMB on the same node. A backup server runs also on its "own" hardware. So i would need at least 4 NSD Server since if 1 fails i am losing only 25% of my "performance" and still having a 4/5 quorum. Nice in case an Update of a NSD failed. Each protocol service requires at least 2 nodes and the backup service as well. I can only say that with that approach i never had problems. I have be running into problems each time i did not followed that apporach. But of course YMMV But keep in mind that each service might requires different GPFS configuration or even slightly different hardware. Saying so i am a fan of having many GPFS Server ( NSD, Protocol , Backup a.s.o ) and i do not understand why not to use many nodes ^_^ Cheers Hajo From jonathan at buzzard.me.uk Mon Apr 24 17:24:29 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 17:24:29 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <131241.1493047796@turing-police.cc.vt.edu> References: <131241.1493047796@turing-police.cc.vt.edu> Message-ID: <1493051069.11896.39.camel@buzzard.me.uk> On Mon, 2017-04-24 at 11:29 -0400, valdis.kletnieks at vt.edu wrote: > On Mon, 24 Apr 2017 14:21:09 +0200, "service at metamodul.com" said: > > > todays hardware is so powerful that imho it might make sense to split a CEC > > into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots > > ( 4?16 lans & 5?8 lan ). > > We look at it the other way around: Today's hardware is so powerful that > you can build a cluster out of a stack of fairly low-end 1U servers (we > have one cluster that's built out of Dell r630s). And it's more robust > against hardware failures than a VM based solution - although the 822 seems > to allow hot-swap of PCI cards, a dead socket or DIMM will still kill all > the VMs when you go to replace it. If one 1U out of 4 goes down due to > a bad DIMM (which has happened to us more often than a bad PCI card) you > can just power it down and replace it.... Hate to say but the 822 will happily keep trucking when the CPU (assuming it has more than one) fails and similar with the DIMM's. In fact mirrored DIMM's is reasonably common on x86 machines these days, though very few people ever use it. That said CPU failures are incredibly rare in my experience. The only time I have ever come across a failed CPU was on a pSeries machine and then it was only because the backup was running really slow (it was running TSM) that prompted us to look closer and see what had happened. Monitoring (Zenoss) was not setup to register the event because like when does a CPU fail and the machine keep running! I am not 100% sure on the 822 put I suspect that the DIMM's and any socketed CPU's can be hot swapped in addition to the PCI card's which I have personally done on pSeries machines. However it is a stupidly over priced solution to run GPFS, because there are better or at the very least vastly cheaper ways to get the same level of reliability. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From valdis.kletnieks at vt.edu Mon Apr 24 18:58:17 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 24 Apr 2017 13:58:17 -0400 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <1493051069.11896.39.camel@buzzard.me.uk> References: <131241.1493047796@turing-police.cc.vt.edu> <1493051069.11896.39.camel@buzzard.me.uk> Message-ID: <7337.1493056697@turing-police.cc.vt.edu> On Mon, 24 Apr 2017 17:24:29 +0100, Jonathan Buzzard said: > Hate to say but the 822 will happily keep trucking when the CPU > (assuming it has more than one) fails and similar with the DIMM's. In How about when you go to replace the DIMM? You able to hot-swap the memory without anything losing its mind? (I know this is possible in the Z/series world, but those usually have at least 2-3 more zeros in the price tag). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From luis.bolinches at fi.ibm.com Mon Apr 24 19:08:32 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 24 Apr 2017 21:08:32 +0300 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <7337.1493056697@turing-police.cc.vt.edu> References: <131241.1493047796@turing-police.cc.vt.edu> <1493051069.11896.39.camel@buzzard.me.uk> <7337.1493056697@turing-police.cc.vt.edu> Message-ID: Hi 822 is an entry scale out Power machine, it has limited RAS compared with the high end ones (870/880). The 822 needs to be down for CPU / DIMM replacement: https://www.ibm.com/support/knowledgecenter/5148-21L/p8eg3/p8eg3_83x_8rx_kickoff.htm . And it is not a end user task. You can argue that, I owuld but it is the current statement and you pay for support for these kind of stuff. From: valdis.kletnieks at vt.edu To: gpfsug main discussion list Date: 24/04/2017 20:58 Subject: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Sent by: gpfsug-discuss-bounces at spectrumscale.org On Mon, 24 Apr 2017 17:24:29 +0100, Jonathan Buzzard said: > Hate to say but the 822 will happily keep trucking when the CPU > (assuming it has more than one) fails and similar with the DIMM's. In How about when you go to replace the DIMM? You able to hot-swap the memory without anything losing its mind? (I know this is possible in the Z/series world, but those usually have at least 2-3 more zeros in the price tag). [attachment "attqolcz.dat" deleted by Luis Bolinches/Finland/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Mon Apr 24 22:12:14 2017 From: frank.tower at outlook.com (Frank Tower) Date: Mon, 24 Apr 2017 21:12:14 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , , Message-ID: >From what I've read from the Wiki: 'The NFS protocol performance is largely dependent on the base system performance of the protocol node hardware and network. This includes multiple factors the type and number of CPUs, the size of the main memory in the nodes, the type of disk drives used (HDD, SSD, etc.) and the disk configuration (RAID-level, replication etc.). In addition, NFS protocol performance can be impacted by the overall load of the node (such as number of clients accessing, snapshot creation/deletion and more) and administrative tasks (for example filesystem checks or online re-striping of disk arrays).' Nowadays, SSD is worst to invest. LROC could be an option in the future, but we need to quantify NFS/CIFS workload first. Are you using LROC with your GPFS installation ? Best, Frank. ________________________________ From: Sobey, Richard A Sent: Monday, April 24, 2017 11:11 AM To: gpfsug main discussion list; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations What?s your SSD going to help with? will you implement it as a LROC device? Otherwise I can?t see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust ; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 09:19:10 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 08:19:10 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , , Message-ID: I tried it on one node but investing in what could be up to ?5000 in SSDs when we don't know the gains isn't something I can argue. Not that LROC will hurt the environment but my users may not see any benefit. My cluster is the complete opposite of busy (relative to people saying they're seeing sustained 800MB/sec throughput), I just need it stable. Richard From: Frank Tower [mailto:frank.tower at outlook.com] Sent: 24 April 2017 22:12 To: Sobey, Richard A ; gpfsug main discussion list ; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations >From what I've read from the Wiki: 'The NFS protocol performance is largely dependent on the base system performance of the protocol node hardware and network. This includes multiple factors the type and number of CPUs, the size of the main memory in the nodes, the type of disk drives used (HDD, SSD, etc.) and the disk configuration (RAID-level, replication etc.). In addition, NFS protocol performance can be impacted by the overall load of the node (such as number of clients accessing, snapshot creation/deletion and more) and administrative tasks (for example filesystem checks or online re-striping of disk arrays).' Nowadays, SSD is worst to invest. LROC could be an option in the future, but we need to quantify NFS/CIFS workload first. Are you using LROC with your GPFS installation ? Best, Frank. ________________________________ From: Sobey, Richard A > Sent: Monday, April 24, 2017 11:11 AM To: gpfsug main discussion list; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations What's your SSD going to help with... will you implement it as a LROC device? Otherwise I can't see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust >; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Apr 25 09:23:32 2017 From: chair at spectrumscale.org (Spectrum Scale UG Chair (Simon Thompson)) Date: Tue, 25 Apr 2017 09:23:32 +0100 Subject: [gpfsug-discuss] User group meeting May 9th/10th 2017 Message-ID: The UK user group is now just 2 weeks away! Its time to register ... https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi stration-32113696932 (or https://goo.gl/tRptru) Remember user group meetings are free to attend, and this year's 2 day meeting is packed full of sessions and several of the breakout sessions are cloud-focussed looking at how Spectrum Scale can be used with cloud deployments. And as usual, we have the ever popular Sven speaking with his views from the Research topics. Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, OCF and Seagate for helping make this happen! We need to finalise numbers for the evening event soon, so make sure you book your place now! Simon From S.J.Thompson at bham.ac.uk Tue Apr 25 12:20:39 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 11:20:39 +0000 Subject: [gpfsug-discuss] NFS issues Message-ID: Hi, We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes. We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs. In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine. In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't. If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues. So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround? We are currently running the packages nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often. On the clients, we use -o sync,hard BTW as in the IBM docs. On a client showing the issues, we'll see in dmesg, NFS related messages like: [Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out Which explains the client hang on certain mount points. The symptoms feel very much like those logged in this Gluster/ganesha bug: https://bugzilla.redhat.com/show_bug.cgi?id=1354439 Thanks Simon From Mark.Bush at siriuscom.com Tue Apr 25 14:27:38 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 13:27:38 +0000 Subject: [gpfsug-discuss] Perfmon and GUI Message-ID: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 14:44:59 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 13:44:59 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> References: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> Message-ID: I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.ouwehand at vumc.nl Tue Apr 25 14:51:22 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Tue, 25 Apr 2017 13:51:22 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: <5594921EA5B3674AB44AD9276126AAF40170DD3159@sp-mx-mbx42> Hello, At first a short introduction. My name is Jaap Jan Ouwehand, I work at a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical (office, research and clinical data) business process. We have three large GPFS filesystems for different purposes. We also had such a situation with cNFS. A failover (IPtakeover) was technically good, only clients experienced "stale filehandles". We opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few months later, the solution appeared to be in the fsid option. An NFS filehandle is built by a combination of fsid and a hash function on the inode. After a failover, the fsid value can be different and the client has a "stale filehandle". To avoid this, the fsid value can be statically specified. See: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_nfslin.htm Maybe there is also a value in Ganesha that changes after a failover. Certainly since most sessions will be re-established after a failback. Maybe you see more debug information with tcpdump. Kind regards, ? Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT E: jj.ouwehand at vumc.nl W: www.vumc.com -----Oorspronkelijk bericht----- Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson (IT Research Support) Verzonden: dinsdag 25 april 2017 13:21 Aan: gpfsug-discuss at spectrumscale.org Onderwerp: [gpfsug-discuss] NFS issues Hi, We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes. We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs. In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine. In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't. If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues. So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround? We are currently running the packages nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often. On the clients, we use -o sync,hard BTW as in the IBM docs. On a client showing the issues, we'll see in dmesg, NFS related messages like: [Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out Which explains the client hang on certain mount points. The symptoms feel very much like those logged in this Gluster/ganesha bug: https://bugzilla.redhat.com/show_bug.cgi?id=1354439 Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Tue Apr 25 15:06:04 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 14:06:04 +0000 Subject: [gpfsug-discuss] NFS issues Message-ID: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical >(office, research and clinical data) business process. We have three >large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We opened >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months >later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson >(IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: gpfsug-discuss at spectrumscale.org >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and failover >seems to work fine with SMB clients. We figured we could use the same >name and IPs and run Ganesha on the protocol servers, however we are >seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it all >works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by manually >suspending a CES node, but not always and some clients mounting from the >IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not >responding, timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Mark.Bush at siriuscom.com Tue Apr 25 15:13:58 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 14:13:58 +0000 Subject: [gpfsug-discuss] Perfmon and GUI Message-ID: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Tue Apr 25 15:29:07 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 14:29:07 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 15:31:13 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 14:31:13 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: No worries Mark. We don?t use NFS here (yet) so I can?t help there. Glad I could help. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 15:29 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Tue Apr 25 18:04:41 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 25 Apr 2017 17:04:41 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk>: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Ouwehand, JJ" j.ouwehand at vumc.nl> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: jj.ouwehand at vumc.nl > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: gpfsug-discuss at spectrumscale.org > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoang.nguyen at seagate.com Tue Apr 25 18:12:19 2017 From: hoang.nguyen at seagate.com (Hoang Nguyen) Date: Tue, 25 Apr 2017 10:12:19 -0700 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: I have a customer with a slightly different issue but sounds somewhat related. If you stop and stop the NFS service on a CES node or update an existing export which will restart Ganesha. Some of their NFS clients do not reconnect in a very similar fashion as you described. I haven't been able to reproduce it on my test system repeatedly but using soft NFS mounts seems to help. Seems like it happens more often to clients currently running NFS IO during the outage. But I'm interested to see what you guys uncover. Thanks, Hoang On Tue, Apr 25, 2017 at 7:06 AM, Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk> wrote: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Ouwehand, JJ" j.ouwehand at vumc.nl> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ibm.com_support_ > knowledgecenter_STXKQY-5F4.2.2_com.ibm.spectrum&d=DwICAg&c= > IGDlg0lD0b-nebmJJ0Kp8A&r=erT0ET1g1dsvTDYndRRTAAZ6Dneebt > G6F47PIUMDXFw&m=K3iXrW2N_HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s= > PIXnA0UQbneTHMRxvUcmsvZK6z5V2XU4jR_GIVaZP5Q&e= . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: jj.ouwehand at vumc.nl > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: gpfsug-discuss at spectrumscale.org > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://urldefense.proofpoint.com/v2/url?u=https- > 3A__bugzilla.redhat.com_show-5Fbug.cgi-3Fid-3D1354439&d= > DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=erT0ET1g1dsvTDYndRRTAAZ6Dneebt > G6F47PIUMDXFw&m=K3iXrW2N_HcdrGDuKmRWFjypuPLPJDIm9VosFII > sFoI&s=KN5WKk1vLEt0Y_17nVQeDi1lK5mSQUZQ7lPtQK3FBG4&e= > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_ > listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_ > listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > -- Hoang Nguyen *? *Sr Staff Engineer Seagate Technology office: +1 (858) 751-4487 mobile: +1 (858) 284-7846 www.seagate.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Apr 25 18:30:40 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 17:30:40 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: , Message-ID: I did some digging in the mmcesfuncs to see what happens server side on fail over. Basically the server losing the IP is supposed to terminate all sessions and the receiver server sends ACK tickles. My current supposition is that for whatever reason, the losing server isn't releasing something and the client still has hold of a connection which is mostly dead. The tickle then fails to the client from the new server. This would explain why failing the IP back to the original server usually brings the client back to life. This is only my working theory at the moment as we can't reliably reproduce this. Next time it happens we plan to grab some netstat from each side. Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the server that received the IP and see if that fixes it (i.e. the receiver server didn't tickle properly). (Usage extracted from mmcesfuncs which is ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) for anyone interested. Then try and kill he sessions on the losing server to check if there is stuff still open and re-tickle the client. If we can get steps to workaround, I'll log a PMR. I suppose I could do that now, but given its non deterministic and we want to be 100% sure it's not us doing something wrong, I'm inclined to wait until we do some more testing. I agree with the suggestion that it's probably IO pending nodes that are affected, but don't have any data to back that up yet. We did try with a read workload on a client, but may we need either long IO blocked reads or writes (from the GPFS end). We also originally had soft as the default option, but saw issues then and the docs suggested hard, so we switched and also enabled sync (we figured maybe it was NFS client with uncommited writes), but neither have resolved the issues entirely. Difficult for me to say if they improved the issue though given its sporadic. Appreciate people's suggestions! Thanks Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode Myklebust [janfrode at tanso.net] Sent: 25 April 2017 18:04 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" on behalf of j.ouwehand at vumc.nl> wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical >(office, research and clinical data) business process. We have three >large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We opened >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months >later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson >(IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: gpfsug-discuss at spectrumscale.org >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and failover >seems to work fine with SMB clients. We figured we could use the same >name and IPs and run Ganesha on the protocol servers, however we are >seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it all >works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by manually >suspending a CES node, but not always and some clients mounting from the >IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not >responding, timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Greg.Lehmann at csiro.au Wed Apr 26 00:46:35 2017 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Tue, 25 Apr 2017 23:46:35 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: , Message-ID: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Are you using infiniband or Ethernet? I'm wondering if IBM have solved the gratuitous arp issue which we see with our non-protocols NFS implementation. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: Wednesday, 26 April 2017 3:31 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I did some digging in the mmcesfuncs to see what happens server side on fail over. Basically the server losing the IP is supposed to terminate all sessions and the receiver server sends ACK tickles. My current supposition is that for whatever reason, the losing server isn't releasing something and the client still has hold of a connection which is mostly dead. The tickle then fails to the client from the new server. This would explain why failing the IP back to the original server usually brings the client back to life. This is only my working theory at the moment as we can't reliably reproduce this. Next time it happens we plan to grab some netstat from each side. Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the server that received the IP and see if that fixes it (i.e. the receiver server didn't tickle properly). (Usage extracted from mmcesfuncs which is ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) for anyone interested. Then try and kill he sessions on the losing server to check if there is stuff still open and re-tickle the client. If we can get steps to workaround, I'll log a PMR. I suppose I could do that now, but given its non deterministic and we want to be 100% sure it's not us doing something wrong, I'm inclined to wait until we do some more testing. I agree with the suggestion that it's probably IO pending nodes that are affected, but don't have any data to back that up yet. We did try with a read workload on a client, but may we need either long IO blocked reads or writes (from the GPFS end). We also originally had soft as the default option, but saw issues then and the docs suggested hard, so we switched and also enabled sync (we figured maybe it was NFS client with uncommited writes), but neither have resolved the issues entirely. Difficult for me to say if they improved the issue though given its sporadic. Appreciate people's suggestions! Thanks Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode Myklebust [janfrode at tanso.net] Sent: 25 April 2017 18:04 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" on behalf of j.ouwehand at vumc.nl> wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at >a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >critical (office, research and clinical data) business process. We have >three large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We >opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >months later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: >gpfsug-discuss-bounces at spectrumscale.orgspectrumscale.org> >[mailto:gpfsug-discuss-bounces at spectrumscale.orgbounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: >gpfsug-discuss at spectrumscale.orgg> >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and >failover seems to work fine with SMB clients. We figured we could use >the same name and IPs and run Ganesha on the protocol servers, however >we are seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it >all works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by >manually suspending a CES node, but not always and some clients >mounting from the IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related >messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server >MYNFSSERVER.bham.ac.uk not responding, >timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Mark.Bush at siriuscom.com Wed Apr 26 14:26:08 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Wed, 26 Apr 2017 13:26:08 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: My saga has come to an end. Turns out to get perf stats for NFS you need the gpfs.pm-ganesha package - duh. I typically do manual installs of scale so I just missed this one as it was buried in /usr/lpp/mmfs/4.2.3.0/zimon_rpms/rhel7. Anyway, package installed and now I get NFS stats in the gui and from cli. From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 9:31 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI No worries Mark. We don?t use NFS here (yet) so I can?t help there. Glad I could help. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 15:29 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 26 15:20:30 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 26 Apr 2017 14:20:30 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: Nope, the clients are all L3 connected, so not an arp issue. Two things we have observed: 1. It triggers when one of the CES IPs moves and quickly moves back again. The move occurs because the NFS server goes into grace: 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 2 nodeid -1 ip 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 recovery release ip 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 2017-04-25 20:37:42 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:37:44 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:37:44 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 4 nodeid 2 ip We can't see in any of the logs WHY ganesha is going into grace. Any suggestions on how to debug this further? (I.e. If we can stop the grace issues, we can solve the problem mostly). 2. Our clients are using LDAP which is bound to the CES IPs. If we shutdown nslcd on the client we can get the client to recover once all the TIME_WAIT connections have gone. Maybe this was a bad choice on our side to bind to the CES IPs - we figured it would handily move the IPs for us, but I guess the mmcesfuncs isn't aware of this and so doesn't kill the connections to the IP as it goes away. So two approaches we are going to try. Reconfigure the nslcd on a couple of clients and see if they still show up the issues when fail-over occurs. Second is to work out why the NFS servers are going into grace in the first place. Simon On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Greg.Lehmann at csiro.au" wrote: >Are you using infiniband or Ethernet? I'm wondering if IBM have solved >the gratuitous arp issue which we see with our non-protocols NFS >implementation. > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >Thompson (IT Research Support) >Sent: Wednesday, 26 April 2017 3:31 AM >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] NFS issues > >I did some digging in the mmcesfuncs to see what happens server side on >fail over. > >Basically the server losing the IP is supposed to terminate all sessions >and the receiver server sends ACK tickles. > >My current supposition is that for whatever reason, the losing server >isn't releasing something and the client still has hold of a connection >which is mostly dead. The tickle then fails to the client from the new >server. > >This would explain why failing the IP back to the original server usually >brings the client back to life. > >This is only my working theory at the moment as we can't reliably >reproduce this. Next time it happens we plan to grab some netstat from >each side. > >Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >server that received the IP and see if that fixes it (i.e. the receiver >server didn't tickle properly). (Usage extracted from mmcesfuncs which is >ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >for anyone interested. > >Then try and kill he sessions on the losing server to check if there is >stuff still open and re-tickle the client. > >If we can get steps to workaround, I'll log a PMR. I suppose I could do >that now, but given its non deterministic and we want to be 100% sure >it's not us doing something wrong, I'm inclined to wait until we do some >more testing. > >I agree with the suggestion that it's probably IO pending nodes that are >affected, but don't have any data to back that up yet. We did try with a >read workload on a client, but may we need either long IO blocked reads >or writes (from the GPFS end). > >We also originally had soft as the default option, but saw issues then >and the docs suggested hard, so we switched and also enabled sync (we >figured maybe it was NFS client with uncommited writes), but neither have >resolved the issues entirely. Difficult for me to say if they improved >the issue though given its sporadic. > >Appreciate people's suggestions! > >Thanks > >Simon >________________________________________ >From: gpfsug-discuss-bounces at spectrumscale.org >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >Myklebust [janfrode at tanso.net] >Sent: 25 April 2017 18:04 >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] NFS issues > >I *think* I've seen this, and that we then had open TCP connection from >client to NFS server according to netstat, but these connections were not >visible from netstat on NFS-server side. > >Unfortunately I don't remember what the fix was.. > > > > -jf > >tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >>: >Hi, > >From what I can see, Ganesha uses the Export_Id option in the config file >(which is managed by CES) for this. I did find some reference in the >Ganesha devs list that if its not set, then it would read the FSID from >the GPFS file-system, either way they should surely be consistent across >all the nodes. The posts I found were from someone with an IBM email >address, so I guess someone in the IBM teams. > >I checked a couple of my protocol nodes and they use the same Export_Id >consistently, though I guess that might not be the same as the FSID value. > >Perhaps someone from IBM could comment on if FSID is likely to the cause >of my problems? > >Thanks > >Simon > >On 25/04/2017, 14:51, >"gpfsug-discuss-bounces at spectrumscale.orgectrumscale.org> on behalf of Ouwehand, JJ" >ectrumscale.org> on behalf of >j.ouwehand at vumc.nl> wrote: > >>Hello, >> >>At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>critical (office, research and clinical data) business process. We have >>three large GPFS filesystems for different purposes. >> >>We also had such a situation with cNFS. A failover (IPtakeover) was >>technically good, only clients experienced "stale filehandles". We >>opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>months later, the solution appeared to be in the fsid option. >> >>An NFS filehandle is built by a combination of fsid and a hash function >>on the inode. After a failover, the fsid value can be different and the >>client has a "stale filehandle". To avoid this, the fsid value can be >>statically specified. See: >> >>https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>. >>scale.v4r22.doc/bl1adm_nfslin.htm >> >>Maybe there is also a value in Ganesha that changes after a failover. >>Certainly since most sessions will be re-established after a failback. >>Maybe you see more debug information with tcpdump. >> >> >>Kind regards, >> >>Jaap Jan Ouwehand >>ICT Specialist (Storage & Linux) >>VUmc - ICT >>E: jj.ouwehand at vumc.nl >>W: www.vumc.com >> >> >> >>-----Oorspronkelijk bericht----- >>Van: >>gpfsug-discuss-bounces at spectrumscale.org>spectrumscale.org> >>[mailto:gpfsug-discuss-bounces at spectrumscale.org>bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>Verzonden: dinsdag 25 april 2017 13:21 >>Aan: >>gpfsug-discuss at spectrumscale.org>g> >>Onderwerp: [gpfsug-discuss] NFS issues >> >>Hi, >> >>We have recently started deploying NFS in addition our existing SMB >>exports on our protocol nodes. >> >>We use a RR DNS name that points to 4 VIPs for SMB services and >>failover seems to work fine with SMB clients. We figured we could use >>the same name and IPs and run Ganesha on the protocol servers, however >>we are seeing issues with NFS clients when IP failover occurs. >> >>In normal operation on a client, we might see several mounts from >>different IPs obviously due to the way the DNS RR is working, but it >>all works fine. >> >>In a failover situation, the IP will move to another node and some >>clients will carry on, others will hang IO to the mount points referred >>to by the IP which has moved. We can *sometimes* trigger this by >>manually suspending a CES node, but not always and some clients >>mounting from the IP moving will be fine, others won't. >> >>If we resume a node an it fails back, the clients that are hanging will >>usually recover fine. We can reboot a client prior to failback and it >>will be fine, stopping and starting the ganesha service on a protocol >>node will also sometimes resolve the issues. >> >>So, has anyone seen this sort of issue and any suggestions for how we >>could either debug more or workaround? >> >>We are currently running the packages >>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >> >>At one point we were seeing it a lot, and could track it back to an >>underlying GPFS network issue that was causing protocol nodes to be >>expelled occasionally, we resolved that and the issues became less >>apparent, but maybe we just fixed one failure mode so see it less often. >> >>On the clients, we use -o sync,hard BTW as in the IBM docs. >> >>On a client showing the issues, we'll see in dmesg, NFS related >>messages >>like: >>[Wed Apr 12 16:59:53 2017] nfs: server >>MYNFSSERVER.bham.ac.uk not responding, >>timed out >> >>Which explains the client hang on certain mount points. >> >>The symptoms feel very much like those logged in this Gluster/ganesha >>bug: >>https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >> >> >>Thanks >> >>Simon >> >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Wed Apr 26 15:27:03 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 26 Apr 2017 14:27:03 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: Would it help to lower the grace time? mmnfs configuration change LEASE_LIFETIME=10 mmnfs configuration change GRACE_PERIOD=10 -jf ons. 26. apr. 2017 kl. 16.20 skrev Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk>: > Nope, the clients are all L3 connected, so not an arp issue. > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > > >Are you using infiniband or Ethernet? I'm wondering if IBM have solved > >the gratuitous arp issue which we see with our non-protocols NFS > >implementation. > > > >-----Original Message----- > >From: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon > >Thompson (IT Research Support) > >Sent: Wednesday, 26 April 2017 3:31 AM > >To: gpfsug main discussion list > >Subject: Re: [gpfsug-discuss] NFS issues > > > >I did some digging in the mmcesfuncs to see what happens server side on > >fail over. > > > >Basically the server losing the IP is supposed to terminate all sessions > >and the receiver server sends ACK tickles. > > > >My current supposition is that for whatever reason, the losing server > >isn't releasing something and the client still has hold of a connection > >which is mostly dead. The tickle then fails to the client from the new > >server. > > > >This would explain why failing the IP back to the original server usually > >brings the client back to life. > > > >This is only my working theory at the moment as we can't reliably > >reproduce this. Next time it happens we plan to grab some netstat from > >each side. > > > >Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the > >server that received the IP and see if that fixes it (i.e. the receiver > >server didn't tickle properly). (Usage extracted from mmcesfuncs which is > >ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) > >for anyone interested. > > > >Then try and kill he sessions on the losing server to check if there is > >stuff still open and re-tickle the client. > > > >If we can get steps to workaround, I'll log a PMR. I suppose I could do > >that now, but given its non deterministic and we want to be 100% sure > >it's not us doing something wrong, I'm inclined to wait until we do some > >more testing. > > > >I agree with the suggestion that it's probably IO pending nodes that are > >affected, but don't have any data to back that up yet. We did try with a > >read workload on a client, but may we need either long IO blocked reads > >or writes (from the GPFS end). > > > >We also originally had soft as the default option, but saw issues then > >and the docs suggested hard, so we switched and also enabled sync (we > >figured maybe it was NFS client with uncommited writes), but neither have > >resolved the issues entirely. Difficult for me to say if they improved > >the issue though given its sporadic. > > > >Appreciate people's suggestions! > > > >Thanks > > > >Simon > >________________________________________ > >From: gpfsug-discuss-bounces at spectrumscale.org > >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode > >Myklebust [janfrode at tanso.net] > >Sent: 25 April 2017 18:04 > >To: gpfsug main discussion list > >Subject: Re: [gpfsug-discuss] NFS issues > > > >I *think* I've seen this, and that we then had open TCP connection from > >client to NFS server according to netstat, but these connections were not > >visible from netstat on NFS-server side. > > > >Unfortunately I don't remember what the fix was.. > > > > > > > > -jf > > > >tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) > >>: > >Hi, > > > >From what I can see, Ganesha uses the Export_Id option in the config file > >(which is managed by CES) for this. I did find some reference in the > >Ganesha devs list that if its not set, then it would read the FSID from > >the GPFS file-system, either way they should surely be consistent across > >all the nodes. The posts I found were from someone with an IBM email > >address, so I guess someone in the IBM teams. > > > >I checked a couple of my protocol nodes and they use the same Export_Id > >consistently, though I guess that might not be the same as the FSID value. > > > >Perhaps someone from IBM could comment on if FSID is likely to the cause > >of my problems? > > > >Thanks > > > >Simon > > > >On 25/04/2017, 14:51, > >"gpfsug-discuss-bounces at spectrumscale.org gpfsug-discuss-bounces at sp > >ectrumscale.org> on behalf of Ouwehand, JJ" > > gpfsug-discuss-bounces at sp > >ectrumscale.org> on behalf of > >j.ouwehand at vumc.nl> wrote: > > > >>Hello, > >> > >>At first a short introduction. My name is Jaap Jan Ouwehand, I work at > >>a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of > >>IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our > >>critical (office, research and clinical data) business process. We have > >>three large GPFS filesystems for different purposes. > >> > >>We also had such a situation with cNFS. A failover (IPtakeover) was > >>technically good, only clients experienced "stale filehandles". We > >>opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few > >>months later, the solution appeared to be in the fsid option. > >> > >>An NFS filehandle is built by a combination of fsid and a hash function > >>on the inode. After a failover, the fsid value can be different and the > >>client has a "stale filehandle". To avoid this, the fsid value can be > >>statically specified. See: > >> > >> > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > >>. > >>scale.v4r22.doc/bl1adm_nfslin.htm > >> > >>Maybe there is also a value in Ganesha that changes after a failover. > >>Certainly since most sessions will be re-established after a failback. > >>Maybe you see more debug information with tcpdump. > >> > >> > >>Kind regards, > >> > >>Jaap Jan Ouwehand > >>ICT Specialist (Storage & Linux) > >>VUmc - ICT > >>E: jj.ouwehand at vumc.nl > >>W: www.vumc.com > >> > >> > >> > >>-----Oorspronkelijk bericht----- > >>Van: > >>gpfsug-discuss-bounces at spectrumscale.org >>spectrumscale.org> > >>[mailto:gpfsug-discuss-bounces at spectrumscale.org >>bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) > >>Verzonden: dinsdag 25 april 2017 13:21 > >>Aan: > >>gpfsug-discuss at spectrumscale.org >>g> > >>Onderwerp: [gpfsug-discuss] NFS issues > >> > >>Hi, > >> > >>We have recently started deploying NFS in addition our existing SMB > >>exports on our protocol nodes. > >> > >>We use a RR DNS name that points to 4 VIPs for SMB services and > >>failover seems to work fine with SMB clients. We figured we could use > >>the same name and IPs and run Ganesha on the protocol servers, however > >>we are seeing issues with NFS clients when IP failover occurs. > >> > >>In normal operation on a client, we might see several mounts from > >>different IPs obviously due to the way the DNS RR is working, but it > >>all works fine. > >> > >>In a failover situation, the IP will move to another node and some > >>clients will carry on, others will hang IO to the mount points referred > >>to by the IP which has moved. We can *sometimes* trigger this by > >>manually suspending a CES node, but not always and some clients > >>mounting from the IP moving will be fine, others won't. > >> > >>If we resume a node an it fails back, the clients that are hanging will > >>usually recover fine. We can reboot a client prior to failback and it > >>will be fine, stopping and starting the ganesha service on a protocol > >>node will also sometimes resolve the issues. > >> > >>So, has anyone seen this sort of issue and any suggestions for how we > >>could either debug more or workaround? > >> > >>We are currently running the packages > >>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >> > >>At one point we were seeing it a lot, and could track it back to an > >>underlying GPFS network issue that was causing protocol nodes to be > >>expelled occasionally, we resolved that and the issues became less > >>apparent, but maybe we just fixed one failure mode so see it less often. > >> > >>On the clients, we use -o sync,hard BTW as in the IBM docs. > >> > >>On a client showing the issues, we'll see in dmesg, NFS related > >>messages > >>like: > >>[Wed Apr 12 16:59:53 2017] nfs: server > >>MYNFSSERVER.bham.ac.uk not responding, > >>timed out > >> > >>Which explains the client hang on certain mount points. > >> > >>The symptoms feel very much like those logged in this Gluster/ganesha > >>bug: > >>https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > >> > >> > >>Thanks > >> > >>Simon > >> > >>_______________________________________________ > >>gpfsug-discuss mailing list > >>gpfsug-discuss at spectrumscale.org > >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>_______________________________________________ > >>gpfsug-discuss mailing list > >>gpfsug-discuss at spectrumscale.org > >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From peserocka at gmail.com Wed Apr 26 18:53:51 2017 From: peserocka at gmail.com (Peter Serocka) Date: Wed, 26 Apr 2017 19:53:51 +0200 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: > On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) wrote: > > Nope, the clients are all L3 connected, so not an arp issue. ...not on the client, but the server-facing L3 switch still need to manage its ARP table, and might miss the IP moving to a new MAC. Cisco switches have a default ARP cache timeout of 4 hours, fwiw. Can your network team provide you the ARP status from the switch when you see a fail-over being stuck? ? Peter > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > >> Are you using infiniband or Ethernet? I'm wondering if IBM have solved >> the gratuitous arp issue which we see with our non-protocols NFS >> implementation. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >> Thompson (IT Research Support) >> Sent: Wednesday, 26 April 2017 3:31 AM >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I did some digging in the mmcesfuncs to see what happens server side on >> fail over. >> >> Basically the server losing the IP is supposed to terminate all sessions >> and the receiver server sends ACK tickles. >> >> My current supposition is that for whatever reason, the losing server >> isn't releasing something and the client still has hold of a connection >> which is mostly dead. The tickle then fails to the client from the new >> server. >> >> This would explain why failing the IP back to the original server usually >> brings the client back to life. >> >> This is only my working theory at the moment as we can't reliably >> reproduce this. Next time it happens we plan to grab some netstat from >> each side. >> >> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >> server that received the IP and see if that fixes it (i.e. the receiver >> server didn't tickle properly). (Usage extracted from mmcesfuncs which is >> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >> for anyone interested. >> >> Then try and kill he sessions on the losing server to check if there is >> stuff still open and re-tickle the client. >> >> If we can get steps to workaround, I'll log a PMR. I suppose I could do >> that now, but given its non deterministic and we want to be 100% sure >> it's not us doing something wrong, I'm inclined to wait until we do some >> more testing. >> >> I agree with the suggestion that it's probably IO pending nodes that are >> affected, but don't have any data to back that up yet. We did try with a >> read workload on a client, but may we need either long IO blocked reads >> or writes (from the GPFS end). >> >> We also originally had soft as the default option, but saw issues then >> and the docs suggested hard, so we switched and also enabled sync (we >> figured maybe it was NFS client with uncommited writes), but neither have >> resolved the issues entirely. Difficult for me to say if they improved >> the issue though given its sporadic. >> >> Appreciate people's suggestions! >> >> Thanks >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >> Myklebust [janfrode at tanso.net] >> Sent: 25 April 2017 18:04 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I *think* I've seen this, and that we then had open TCP connection from >> client to NFS server according to netstat, but these connections were not >> visible from netstat on NFS-server side. >> >> Unfortunately I don't remember what the fix was.. >> >> >> >> -jf >> >> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >> >: >> Hi, >> >> From what I can see, Ganesha uses the Export_Id option in the config file >> (which is managed by CES) for this. I did find some reference in the >> Ganesha devs list that if its not set, then it would read the FSID from >> the GPFS file-system, either way they should surely be consistent across >> all the nodes. The posts I found were from someone with an IBM email >> address, so I guess someone in the IBM teams. >> >> I checked a couple of my protocol nodes and they use the same Export_Id >> consistently, though I guess that might not be the same as the FSID value. >> >> Perhaps someone from IBM could comment on if FSID is likely to the cause >> of my problems? >> >> Thanks >> >> Simon >> >> On 25/04/2017, 14:51, >> "gpfsug-discuss-bounces at spectrumscale.org> ectrumscale.org> on behalf of Ouwehand, JJ" >> > ectrumscale.org> on behalf of >> j.ouwehand at vumc.nl> wrote: >> >>> Hello, >>> >>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>> critical (office, research and clinical data) business process. We have >>> three large GPFS filesystems for different purposes. >>> >>> We also had such a situation with cNFS. A failover (IPtakeover) was >>> technically good, only clients experienced "stale filehandles". We >>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>> months later, the solution appeared to be in the fsid option. >>> >>> An NFS filehandle is built by a combination of fsid and a hash function >>> on the inode. After a failover, the fsid value can be different and the >>> client has a "stale filehandle". To avoid this, the fsid value can be >>> statically specified. See: >>> >>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>> . >>> scale.v4r22.doc/bl1adm_nfslin.htm >>> >>> Maybe there is also a value in Ganesha that changes after a failover. >>> Certainly since most sessions will be re-established after a failback. >>> Maybe you see more debug information with tcpdump. >>> >>> >>> Kind regards, >>> >>> Jaap Jan Ouwehand >>> ICT Specialist (Storage & Linux) >>> VUmc - ICT >>> E: jj.ouwehand at vumc.nl >>> W: www.vumc.com >>> >>> >>> >>> -----Oorspronkelijk bericht----- >>> Van: >>> gpfsug-discuss-bounces at spectrumscale.org>> spectrumscale.org> >>> [mailto:gpfsug-discuss-bounces at spectrumscale.org>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>> Verzonden: dinsdag 25 april 2017 13:21 >>> Aan: >>> gpfsug-discuss at spectrumscale.org>> g> >>> Onderwerp: [gpfsug-discuss] NFS issues >>> >>> Hi, >>> >>> We have recently started deploying NFS in addition our existing SMB >>> exports on our protocol nodes. >>> >>> We use a RR DNS name that points to 4 VIPs for SMB services and >>> failover seems to work fine with SMB clients. We figured we could use >>> the same name and IPs and run Ganesha on the protocol servers, however >>> we are seeing issues with NFS clients when IP failover occurs. >>> >>> In normal operation on a client, we might see several mounts from >>> different IPs obviously due to the way the DNS RR is working, but it >>> all works fine. >>> >>> In a failover situation, the IP will move to another node and some >>> clients will carry on, others will hang IO to the mount points referred >>> to by the IP which has moved. We can *sometimes* trigger this by >>> manually suspending a CES node, but not always and some clients >>> mounting from the IP moving will be fine, others won't. >>> >>> If we resume a node an it fails back, the clients that are hanging will >>> usually recover fine. We can reboot a client prior to failback and it >>> will be fine, stopping and starting the ganesha service on a protocol >>> node will also sometimes resolve the issues. >>> >>> So, has anyone seen this sort of issue and any suggestions for how we >>> could either debug more or workaround? >>> >>> We are currently running the packages >>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >>> >>> At one point we were seeing it a lot, and could track it back to an >>> underlying GPFS network issue that was causing protocol nodes to be >>> expelled occasionally, we resolved that and the issues became less >>> apparent, but maybe we just fixed one failure mode so see it less often. >>> >>> On the clients, we use -o sync,hard BTW as in the IBM docs. >>> >>> On a client showing the issues, we'll see in dmesg, NFS related >>> messages >>> like: >>> [Wed Apr 12 16:59:53 2017] nfs: server >>> MYNFSSERVER.bham.ac.uk not responding, >>> timed out >>> >>> Which explains the client hang on certain mount points. >>> >>> The symptoms feel very much like those logged in this Gluster/ganesha >>> bug: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >>> >>> >>> Thanks >>> >>> Simon >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Apr 26 19:00:06 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 26 Apr 2017 18:00:06 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> , Message-ID: We have no issues with L3 SMB accessing clients, so I'm pretty sure it's not arp. And some of the boxes on the other side of the L3 gateway don't see the issues. We don't use Cisco kit. I posted in a different update that we think it's related to connections to other ports on the same IP which get left open when the IP quickly gets moved away and back again. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Peter Serocka [peserocka at gmail.com] Sent: 26 April 2017 18:53 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues > On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) wrote: > > Nope, the clients are all L3 connected, so not an arp issue. ...not on the client, but the server-facing L3 switch still need to manage its ARP table, and might miss the IP moving to a new MAC. Cisco switches have a default ARP cache timeout of 4 hours, fwiw. Can your network team provide you the ARP status from the switch when you see a fail-over being stuck? ? Peter > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > >> Are you using infiniband or Ethernet? I'm wondering if IBM have solved >> the gratuitous arp issue which we see with our non-protocols NFS >> implementation. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >> Thompson (IT Research Support) >> Sent: Wednesday, 26 April 2017 3:31 AM >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I did some digging in the mmcesfuncs to see what happens server side on >> fail over. >> >> Basically the server losing the IP is supposed to terminate all sessions >> and the receiver server sends ACK tickles. >> >> My current supposition is that for whatever reason, the losing server >> isn't releasing something and the client still has hold of a connection >> which is mostly dead. The tickle then fails to the client from the new >> server. >> >> This would explain why failing the IP back to the original server usually >> brings the client back to life. >> >> This is only my working theory at the moment as we can't reliably >> reproduce this. Next time it happens we plan to grab some netstat from >> each side. >> >> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >> server that received the IP and see if that fixes it (i.e. the receiver >> server didn't tickle properly). (Usage extracted from mmcesfuncs which is >> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >> for anyone interested. >> >> Then try and kill he sessions on the losing server to check if there is >> stuff still open and re-tickle the client. >> >> If we can get steps to workaround, I'll log a PMR. I suppose I could do >> that now, but given its non deterministic and we want to be 100% sure >> it's not us doing something wrong, I'm inclined to wait until we do some >> more testing. >> >> I agree with the suggestion that it's probably IO pending nodes that are >> affected, but don't have any data to back that up yet. We did try with a >> read workload on a client, but may we need either long IO blocked reads >> or writes (from the GPFS end). >> >> We also originally had soft as the default option, but saw issues then >> and the docs suggested hard, so we switched and also enabled sync (we >> figured maybe it was NFS client with uncommited writes), but neither have >> resolved the issues entirely. Difficult for me to say if they improved >> the issue though given its sporadic. >> >> Appreciate people's suggestions! >> >> Thanks >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >> Myklebust [janfrode at tanso.net] >> Sent: 25 April 2017 18:04 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I *think* I've seen this, and that we then had open TCP connection from >> client to NFS server according to netstat, but these connections were not >> visible from netstat on NFS-server side. >> >> Unfortunately I don't remember what the fix was.. >> >> >> >> -jf >> >> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >> >: >> Hi, >> >> From what I can see, Ganesha uses the Export_Id option in the config file >> (which is managed by CES) for this. I did find some reference in the >> Ganesha devs list that if its not set, then it would read the FSID from >> the GPFS file-system, either way they should surely be consistent across >> all the nodes. The posts I found were from someone with an IBM email >> address, so I guess someone in the IBM teams. >> >> I checked a couple of my protocol nodes and they use the same Export_Id >> consistently, though I guess that might not be the same as the FSID value. >> >> Perhaps someone from IBM could comment on if FSID is likely to the cause >> of my problems? >> >> Thanks >> >> Simon >> >> On 25/04/2017, 14:51, >> "gpfsug-discuss-bounces at spectrumscale.org> ectrumscale.org> on behalf of Ouwehand, JJ" >> > ectrumscale.org> on behalf of >> j.ouwehand at vumc.nl> wrote: >> >>> Hello, >>> >>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>> critical (office, research and clinical data) business process. We have >>> three large GPFS filesystems for different purposes. >>> >>> We also had such a situation with cNFS. A failover (IPtakeover) was >>> technically good, only clients experienced "stale filehandles". We >>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>> months later, the solution appeared to be in the fsid option. >>> >>> An NFS filehandle is built by a combination of fsid and a hash function >>> on the inode. After a failover, the fsid value can be different and the >>> client has a "stale filehandle". To avoid this, the fsid value can be >>> statically specified. See: >>> >>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>> . >>> scale.v4r22.doc/bl1adm_nfslin.htm >>> >>> Maybe there is also a value in Ganesha that changes after a failover. >>> Certainly since most sessions will be re-established after a failback. >>> Maybe you see more debug information with tcpdump. >>> >>> >>> Kind regards, >>> >>> Jaap Jan Ouwehand >>> ICT Specialist (Storage & Linux) >>> VUmc - ICT >>> E: jj.ouwehand at vumc.nl >>> W: www.vumc.com >>> >>> >>> >>> -----Oorspronkelijk bericht----- >>> Van: >>> gpfsug-discuss-bounces at spectrumscale.org>> spectrumscale.org> >>> [mailto:gpfsug-discuss-bounces at spectrumscale.org>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>> Verzonden: dinsdag 25 april 2017 13:21 >>> Aan: >>> gpfsug-discuss at spectrumscale.org>> g> >>> Onderwerp: [gpfsug-discuss] NFS issues >>> >>> Hi, >>> >>> We have recently started deploying NFS in addition our existing SMB >>> exports on our protocol nodes. >>> >>> We use a RR DNS name that points to 4 VIPs for SMB services and >>> failover seems to work fine with SMB clients. We figured we could use >>> the same name and IPs and run Ganesha on the protocol servers, however >>> we are seeing issues with NFS clients when IP failover occurs. >>> >>> In normal operation on a client, we might see several mounts from >>> different IPs obviously due to the way the DNS RR is working, but it >>> all works fine. >>> >>> In a failover situation, the IP will move to another node and some >>> clients will carry on, others will hang IO to the mount points referred >>> to by the IP which has moved. We can *sometimes* trigger this by >>> manually suspending a CES node, but not always and some clients >>> mounting from the IP moving will be fine, others won't. >>> >>> If we resume a node an it fails back, the clients that are hanging will >>> usually recover fine. We can reboot a client prior to failback and it >>> will be fine, stopping and starting the ganesha service on a protocol >>> node will also sometimes resolve the issues. >>> >>> So, has anyone seen this sort of issue and any suggestions for how we >>> could either debug more or workaround? >>> >>> We are currently running the packages >>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >>> >>> At one point we were seeing it a lot, and could track it back to an >>> underlying GPFS network issue that was causing protocol nodes to be >>> expelled occasionally, we resolved that and the issues became less >>> apparent, but maybe we just fixed one failure mode so see it less often. >>> >>> On the clients, we use -o sync,hard BTW as in the IBM docs. >>> >>> On a client showing the issues, we'll see in dmesg, NFS related >>> messages >>> like: >>> [Wed Apr 12 16:59:53 2017] nfs: server >>> MYNFSSERVER.bham.ac.uk not responding, >>> timed out >>> >>> Which explains the client hang on certain mount points. >>> >>> The symptoms feel very much like those logged in this Gluster/ganesha >>> bug: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >>> >>> >>> Thanks >>> >>> Simon >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From valdis.kletnieks at vt.edu Thu Apr 27 00:44:44 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Wed, 26 Apr 2017 19:44:44 -0400 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: <52226.1493250284@turing-police.cc.vt.edu> On Wed, 26 Apr 2017 14:20:30 -0000, "Simon Thompson (IT Research Support)" said: > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). After over 3 decades of experience with 'exportfs' being totally safe to run in real time with both userspace and kernel NFSD implementations, it came as quite a surprise when we did 'mmnfs eport change --nfsadd='... and it bounced the NFS server on all 4 protocol nodes. At the same time. Fortunately for us, the set of client nodes only changes once every 2-3 months. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From secretary at gpfsug.org Thu Apr 27 09:29:41 2017 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 27 Apr 2017 09:29:41 +0100 Subject: [gpfsug-discuss] Meet other spectrum scale users in May Message-ID: <1f483faa9cb61dcdc80afb187e908745@webmail.gpfsug.org> Dear Members, Please join us and other spectrum scale users for 2 days of great talks and networking! WHEN: 9-10th May 2017 WHERE: Macdonald Manchester Hotel & Spa, Manchester, UK (right by Manchester Piccadilly train station) WHO? The event is free to attend, is open to members from all industries and welcomes users with a little and a lot of experience using Spectrum Scale. The SSUG brings together the Spectrum Scale User Community including Spectrum Scale developers and architects to share knowledge, experiences and future plans. Topics include transparent cloud tiering, AFM, automation and security best practices, Docker and HDFS support, problem determination, and an update on Elastic Storage Server (ESS). Our popular forum includes interactive problem solving, a best practices discussion and networking. We're very excited to welcome back Doris Conti the Director for Spectrum Scale (GPFS) and HPC SW Product Development at IBM. The May meeting is sponsored by IBM, DDN, Lenovo, Mellanox, Seagate, Arcastream, Ellexus, and OCF. It is an excellent opportunity to learn more and get your questions answered. Register your place today at the Eventbrite page https://goo.gl/tRptru [1] We hope to see you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://goo.gl/tRptru -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at strubi.ox.ac.uk Thu Apr 27 12:46:09 2017 From: robert at strubi.ox.ac.uk (Robert Esnouf) Date: Thu, 27 Apr 2017 12:46:09 +0100 (BST) Subject: [gpfsug-discuss] Two high-performance research computing posts in Oxford University Medical Sciences Message-ID: <201704271146.061978@mail.strubi.ox.ac.uk> Dear All, I hope that it is allowed to put job postings on this discussion list... sorry if I've broken a rule but it does mention SpectrumScale! I'd like to advertise the availability two exciting and challenging new opportunities to work in research computing/high-performance computing at Oxford University within the Nuffield Department of Medicine. The first is a Grade 8 position to expand the current Research Computing Core team at the Wellcome Trust Centre for Human Genetics. The Core now runs a cluster of about ~3800 high-memory compute cores, a further ~700 cores outside the cluster, a (growing) smattering of GPU-enabled and KNL nodes, 4PB high-performance SpectrumScale (GPFS) storage and about 4PB of lower grade (mostly XFS) storage. The facility has an FDR InfiniBand fabric providing for access to storage at up to 20GB/s and supporting MPI workloads. We mainly support the statistical genetics work of the Centre and other departments around Oxford, the work of the sequencing and bioinformatics cores and electron microscopy, but the workload is varied and interesting! Further significant update and expansion of this facility will occur during 2017 and beyond and means that we are expanding the team. http://www.well.ox.ac.uk/home http://www.well.ox.ac.uk/research-8 https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.display_form?p_company=10&p_internal_external=E&p_display_in_irish=N&p_process_type=&p_applicant_no=&p_form_profile_detail=&p_display_apply_ind=Y&p_refresh_search=Y&p_recruitment_id=126748 The second is a Grade 9 post at the newly opened Big Data Institute next door to the WTCHG - to work with me to establish a brand new Research Computing facility. The Big Data Institute Building has 32 shiny new racks ready to be filled with up to 320kW of IT load - and we won't stop there! The current plans envisage a virtualized infrastructure for secure access, a high-performance cluster supporting traditional workloads and containers, high-performance filesystem storage, a hyperconverged infrastructure supporting (OpenStack, project VMs, containers and distributed computing plaforms such as Apache Spark), a significant GPU-based artificial intelligence/deep learning platform and a large, multisite object store for managing research data in the long term. https://www.bdi.ox.ac.uk/ https://www.ndm.ox.ac.uk/current-job-vacancies/vacancy/128486-BDI-Research-Computing-Manager https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.display_form?p_company=10&p_internal_external=E&p_display_in_irish=N&p_process_type=&p_applicant_no=&p_form_profile_detail=&p_display_apply_ind=Y&p_refresh_search=Y&p_recruitment_id=128486 It is expected that the Wellcome Trust Centre and Big Data Institute facilities will develop independently for now, but in a complementary and supportive fashion given the overlap in science and technology that is likely to exist. The Research Computing support teams will therefore work extremely closely together to address the challenges facing computing in the medical sciences. If either (or both) of these vacancies seem interesting then please feel free to contact the Head of the Research Computing Core at the WTCHG (me) or the Director of Research Computing at the BDI (me). Deadline for the WTCHG post is 31st May and for the BDI post is 24th May. Please feel free to circulate this email to anyone who might be interested and apologies for any cross postings! Regards, Robert -- Dr Robert Esnouf University Research Lecturer, Director of Research Computing BDI, Head of Research Computing Core WTCHG, NDM Research Computing Strategy Officer Main office: Room 10/028, Wellcome Trust Centre for Human Genetics, Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK Emails: robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at bdi.ox.ac.uk Tel: (+44) - 1865 - 287783 From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 5 17:40:30 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Apr 2017 16:40:30 +0000 Subject: [gpfsug-discuss] Can't delete filesystem Message-ID: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Hi All, First off, I can open a PMR on this if I need to? I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. What do I need to do to resolve this issue on those 4 clients? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 5 17:47:36 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Wed, 5 Apr 2017 16:47:36 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: Do you have ILM (dsmrecalld and friends) running? They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] Sent: 05 April 2017 17:40 To: gpfsug main discussion list Subject: [gpfsug-discuss] Can't delete filesystem Hi All, First off, I can open a PMR on this if I need to? I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. What do I need to do to resolve this issue on those 4 clients? Thanks? Kevin ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 From valdis.kletnieks at vt.edu Wed Apr 5 17:54:16 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Wed, 05 Apr 2017 12:54:16 -0400 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: <7103.1491411256@turing-police.cc.vt.edu> On Wed, 05 Apr 2017 16:40:30 -0000, "Buterbaugh, Kevin L" said: > So, I have gone to all of the 4 clients and none of them say they have it > mounted according to either ???df??? or ???mount???. I???ve gone ahead and run both > ???mmunmount??? and ???umount -l??? on the filesystem anyway, but the mmdelfs still > fails saying that they have it mounted. I've over the years seen this a few times. Doing an 'mmshutdown/mmstartup' pair on the offending nodes has always cleared it up. I probably should have opened a PMR, but it always seems to happen when I'm up to in alligators with other issues. (Am I the only person who wonders why all complex software packages contain alligator-detector routines? :) -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 484 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 5 17:54:14 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 5 Apr 2017 16:54:14 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> Message-ID: <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Hi Simon, No, I do not. Let me also add that this is a filesystem that I migrated users off of and to another GPFS filesystem. I moved the last users this morning and then ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the simple solution is to use the ?-p? option to mmdelfs I?m fine with that. I?m just not sure what the right course of action is at this point. Thanks again? Kevin > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT Services) wrote: > > Do you have ILM (dsmrecalld and friends) running? > > They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] > Sent: 05 April 2017 17:40 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Can't delete filesystem > > Hi All, > > First off, I can open a PMR on this if I need to? > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. > > So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > Kevin > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Wed Apr 5 22:51:15 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 05 Apr 2017 21:51:15 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Message-ID: Maybe try mmumount -f on the remaining 4 nodes? -jf ons. 5. apr. 2017 kl. 18.54 skrev Buterbaugh, Kevin L < Kevin.Buterbaugh at vanderbilt.edu>: > Hi Simon, > > No, I do not. > > Let me also add that this is a filesystem that I migrated users off of and > to another GPFS filesystem. I moved the last users this morning and then > ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the > simple solution is to use the ?-p? option to mmdelfs I?m fine with that. > I?m just not sure what the right course of action is at this point. > > Thanks again? > > Kevin > > > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT > Services) wrote: > > > > Do you have ILM (dsmrecalld and friends) running? > > > > They can also stop the filesystem being released (e.g. mmshutdown fails > if they are up). > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin > L [Kevin.Buterbaugh at Vanderbilt.Edu] > > Sent: 05 April 2017 17:40 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] Can't delete filesystem > > > > Hi All, > > > > First off, I can open a PMR on this if I need to? > > > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that > the filesystem is still mounted on 14 nodes and therefore can?t be > deleted. 10 of those nodes are my 10 GPFS servers and they have an > ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I > need to concentrate on ? i.e. once those other 4 clients no longer have it > mounted the internal mounts will resolve themselves. Correct me if I?m > wrong on that, please. > > > > So, I have gone to all of the 4 clients and none of them say they have > it mounted according to either ?df? or ?mount?. I?ve gone ahead and run > both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs > still fails saying that they have it mounted. > > > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > > > Kevin > > > > ? > > Kevin Buterbaugh - Senior System Administrator > > Vanderbilt University - Advanced Computing Center for Research and > Education > > Kevin.Buterbaugh at vanderbilt.edu > - (615)875-9633 > > > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Thu Apr 6 02:54:07 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Thu, 6 Apr 2017 01:54:07 +0000 Subject: [gpfsug-discuss] AFM misunderstanding Message-ID: When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Thu Apr 6 09:20:31 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 6 Apr 2017 08:20:31 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption Message-ID: We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon From vpuvvada at in.ibm.com Thu Apr 6 11:45:37 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Thu, 6 Apr 2017 16:15:37 +0530 Subject: [gpfsug-discuss] AFM misunderstanding In-Reply-To: References: Message-ID: Could you explain "bits of actual file" mentioned below ? Prefetch with ?metadata-only pulls everything (xattrs, ACLs etc..) except data. Doing " ls ?ltrs" shows file allocation size as zero if data prefetch not yet completed on them. ~Venkat (vpuvvada at in.ibm.com) From: Mark Bush To: gpfsug main discussion list Date: 04/06/2017 07:24 AM Subject: [gpfsug-discuss] AFM misunderstanding Sent by: gpfsug-discuss-bounces at spectrumscale.org When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Thu Apr 6 13:28:40 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Thu, 6 Apr 2017 12:28:40 +0000 Subject: [gpfsug-discuss] AFM misunderstanding In-Reply-To: References: Message-ID: <425C32E7-B752-4B61-BDF5-83C219D89ADB@siriuscom.com> I think I was missing a key piece in that I thought that just doing a mmafmctl fs1 prefetch ?j cache would start grabbing everything (data and metadata) but it appears that the ?list-file myfiles.txt is the trigger for the prefetch to work properly. I mistakenly assumed that omitting the ?list-file switch would prefetch all the data in the fileset. From: on behalf of Venkateswara R Puvvada Reply-To: gpfsug main discussion list Date: Thursday, April 6, 2017 at 5:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM misunderstanding Could you explain "bits of actual file" mentioned below ? Prefetch with ?metadata-onlypulls everything (xattrs, ACLs etc..) except data. Doing "ls ?ltrs" shows file allocation size as zero if data prefetch not yet completed on them. ~Venkat (vpuvvada at in.ibm.com) From: Mark Bush To: gpfsug main discussion list Date: 04/06/2017 07:24 AM Subject: [gpfsug-discuss] AFM misunderstanding Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ When I setup a AFM relationship (let?s just say I?m doing RO), does prefetch bring bits of the actual file over to the cache or is it only ever metadata? I know there is a ?metadata-only switch but it appears that if I try a mmafmctl prefetch operation and then I do a ls ?ltrs on the cache it?s still 0 bytes. I do see the queue increasing when I do a mmafmctl getstate. I realize that the data truly only flows once the file is requested (I just do a dd if=mycachedfile of=/dev/null). But this is just my test env. How to I get the bits to flow before I request them assuming that I will at some point need them? Or do I just misunderstand AFM altogether? I?m more used to mirroring so maybe that?s my frame of reference and it?s not the AFM architecture. Mark This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Thu Apr 6 15:33:18 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Thu, 6 Apr 2017 14:33:18 +0000 Subject: [gpfsug-discuss] Can't delete filesystem In-Reply-To: References: <20E4B082-2BBB-478B-B1E1-2BC8125FE50F@vanderbilt.edu> <0F877E25-6C58-4790-86CD-7E2108EC8EB5@vanderbilt.edu> Message-ID: Hi JF, I actually tried that - to no effect. Yesterday evening I rebooted the 4 clients and, as expected, the 10 servers released their internal mounts as well ? and then I was able to delete the filesystem successfully. Thanks for the suggestions, all? Kevin On Apr 5, 2017, at 4:51 PM, Jan-Frode Myklebust > wrote: Maybe try mmumount -f on the remaining 4 nodes? -jf ons. 5. apr. 2017 kl. 18.54 skrev Buterbaugh, Kevin L >: Hi Simon, No, I do not. Let me also add that this is a filesystem that I migrated users off of and to another GPFS filesystem. I moved the last users this morning and then ran an ?mmunmount? across the whole cluster via mmdsh. Therefore, if the simple solution is to use the ?-p? option to mmdelfs I?m fine with that. I?m just not sure what the right course of action is at this point. Thanks again? Kevin > On Apr 5, 2017, at 11:47 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Do you have ILM (dsmrecalld and friends) running? > > They can also stop the filesystem being released (e.g. mmshutdown fails if they are up). > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] > Sent: 05 April 2017 17:40 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] Can't delete filesystem > > Hi All, > > First off, I can open a PMR on this if I need to? > > I am trying to delete a GPFS filesystem but mmdelfs is telling me that the filesystem is still mounted on 14 nodes and therefore can?t be deleted. 10 of those nodes are my 10 GPFS servers and they have an ?internal mount? still mounted. IIRC, it?s the other 4 (client) nodes I need to concentrate on ? i.e. once those other 4 clients no longer have it mounted the internal mounts will resolve themselves. Correct me if I?m wrong on that, please. > > So, I have gone to all of the 4 clients and none of them say they have it mounted according to either ?df? or ?mount?. I?ve gone ahead and run both ?mmunmount? and ?umount -l? on the filesystem anyway, but the mmdelfs still fails saying that they have it mounted. > > What do I need to do to resolve this issue on those 4 clients? Thanks? > > Kevin > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633 > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ewahl at osc.edu Thu Apr 6 15:54:42 2017 From: ewahl at osc.edu (Wahl, Edward) Date: Thu, 6 Apr 2017 14:54:42 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption In-Reply-To: References: Message-ID: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> This is rather dependant on SS version. So what used to happen before 4.2.2.* is that a client would be unable to mount the filesystem in question and would give an error in the mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount the file system and then give errors on file access instead. (just tested this on 4.2.2.3) I'll have to read through the changelogs looking for this one. Depending on your policy for encryption then, this might be exactly what you want, but I REALLY REALLY dislike this behaviour. To me this means clients can now mount an encrypted FS now and then fail during operation. If I get a client node that comes up improperly, user work will start, and it will fail with "Operation not permitted" errors on file access. I imagine my batch system could run through a massive amount of jobs on a bad client without anyone noticing immeadiately. Yet another thing we now have to monitor now I guess. *shrug* A couple other gotcha's we've seen with Encryption: Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Ed Wahl OSC ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Thursday, April 06, 2017 4:20 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale Encryption We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Thu Apr 6 16:11:38 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Thu, 6 Apr 2017 15:11:38 +0000 Subject: [gpfsug-discuss] Spectrum Scale Encryption Message-ID: Hi Ed, Thanks. We already have several SKLM servers (tape backups). For me, we plan to encrypt specific parts of the FS (probably by file-set), so as long as all that is needed is an empty RKM.conf file, sounds like it will work. I suppose I could have an MEK that is granted to all clients, but then never actually use it for encryption if RKM.conf needs at least one key (hack hack hack). (We are at 4.2.2-2 (mostly) or higher (a few nodes)). I *thought* the FEK was wrapped in the metadata with the MEK (possibly multiple times with different MEKs), so what the docs say about operation continuing with no SKLM server sounds sensible, but of course that might not be what actually happens I guess... Simon On 06/04/2017, 15:54, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Wahl, Edward" wrote: >This is rather dependant on SS version. > >So what used to happen before 4.2.2.* is that a client would be unable to >mount the filesystem in question and would give an error in the >mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount >the file system and then give errors on file access instead. (just >tested this on 4.2.2.3) I'll have to read through the changelogs looking >for this one. > >Depending on your policy for encryption then, this might be exactly what >you want, but I REALLY REALLY dislike this behaviour. > >To me this means clients can now mount an encrypted FS now and then fail >during operation. If I get a client node that comes up improperly, user >work will start, and it will fail with "Operation not permitted" errors >on file access. I imagine my batch system could run through a massive >amount of jobs on a bad client without anyone noticing immeadiately. Yet >another thing we now have to monitor now I guess. *shrug* > >A couple other gotcha's we've seen with Encryption: > >Encrypted file systems do not store data in large MD blocks. Makes >sense. This means large MD blocks aren't as useful as they are in >unencrypted FS, if you are using this. > >Having at least one backup SKLM server is a good idea. >"kmipServerUri[N+1]" in the conf. > >While the documentation claims the FS can continue operation once it >caches the MEK if an SKLM server goes away, in operation this does NOT >work as you may expect. Your users still need access to the FEKs for the >files your clients work on. Logs will fill with Key could not be >fetched. errors. > >Ed Wahl >OSC > >________________________________________ >From: gpfsug-discuss-bounces at spectrumscale.org >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson >(Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] >Sent: Thursday, April 06, 2017 4:20 AM >To: gpfsug-discuss at spectrumscale.org >Subject: [gpfsug-discuss] Spectrum Scale Encryption > >We are currently looking at adding encryption to our deployment for some >of our data sets and for some of our nodes. Apologies in advance if some >of this is a bit vague, we're not yet at the point where we can test this >stuff out, so maybe some of it will become clear when we try it out. > > >For a node that we don't want to have access to any encrypted data, what >do we need to set up? > >According to the docs: >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >s >cale.v4r22.doc/bl1adv_encryption_prep.htm > > >"After the file system is configured with encryption policy rules, the >file system is considered encrypted. From that point on, each node that >has access to that file system must have an RKM.conf file present. >Otherwise, the file system might not be mounted or might become >unmounted." > >So on a node which I don't want to have access to any encrypted files, do >I just need to have an empty RKM.conf file? > >(If this is the case, would be good to have this added to the docs) > > >Secondly ... (and maybe I'm misunderstanding the docs here) > >For the Policy >https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectr >u >m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm > > >KEYS ('Keyname'[, 'Keyname', ... ]) > > >KeyId:RkmId > > >RkmId should match the stanza name in RKM.conf? > >If so, it would be useful if the docs used the same names in the examples >(RKMKMIP3 vs rkmname3) > >And KeyId should match a "Key UUID" in SKLM? > > >Third. My understanding from talking to various IBM people is that we need >ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways >(probably), do we have to do any kind of node registration in ISKLM? Or is >this purely based on the certificates being distributed to clients and >keys are mapped in ISKLM to the client cert to determine if the node is >able to request the key? > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Jon.Edwards at newbase.com.au Fri Apr 7 05:56:33 2017 From: Jon.Edwards at newbase.com.au (Jon Edwards) Date: Fri, 7 Apr 2017 04:56:33 +0000 Subject: [gpfsug-discuss] Spectrum scale sending cluster traffic across the management network Message-ID: <7929c064d6df4d7b88065b4d882daa98@newbase.com.au> Hi All, Just getting started with spectrum scale, Just wondering if anyone has come across the issue where when doing a mmcrfs or mmdelfs you get the error Failed to connect to file system daemon: Connection timed out mmdelfs: tsdelfs failed. mmdelfs: Command failed. Examine previous error messages to determine cause. When viewing the logs in /var/mmfs/gen/mmfslog on a node other than the one I am running the command on i get: 2017-04-07_14:03:13.354+1000: [N] Filtered log entry: 'connect to node 192.168.0.1:1191' occurred 10 times between 2017-04-07_11:38:19.058+1000 and 2017-04-07_11:54:58.649+1000 192.168.0.0/24 In this case is the management network configured on eth0 of all the nodes. It is failing because port 1191 is not allowed on this interface. The dns and hostname for each node resolves to a dedicated cluster network, lets say 10.0.0.0/24 (ETH1) For some reason when I run the mmcrfs or mmdelfs it tries to talk back over the management network instead of the cluster network which fails to connect due to firewall blocking cluster traffic over management. Anyone seen this before? Kind Regards, Jon Edwards Senior Systems Engineer NewBase Email: jon.edwards at newbase.com.au Ph: + 61 7 3216 0776 Fax: + 61 7 3216 0779 http://www.newbase.com.au Opinions contained in this e-mail do not necessarily reflect the opinions of NewBase Computer Services Pty Ltd. This e-mail is for the exclusive use of the addressee and should not be disseminated further or copied without permission of the sender. If you have received this message in error, please immediately notify the sender and delete the message from your computer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Jon.Edwards at newbase.com.au Fri Apr 7 06:26:56 2017 From: Jon.Edwards at newbase.com.au (Jon Edwards) Date: Fri, 7 Apr 2017 05:26:56 +0000 Subject: [gpfsug-discuss] Spectrum scale sending cluster traffic across the management network Message-ID: <6e02ed91cb404d46b7b5cd3515ad8fe9@newbase.com.au> Please disregard, found the solution. Found the subnets= parameter for the cluster config mmchconfig subnets="192.168.0.0/24 192.168.1.0/24" Which forces it to use this subnet. Kind Regards, Jon Edwards | Senior Systems Engineer NewBase Ph: + 61 7 3216 0776 | Email: jon.edwards at newbase.com.au http://www.newbase.com.au From: Jon Edwards Sent: Friday, 7 April 2017 2:56 PM To: 'gpfsug-discuss at spectrumscale.org' Cc: 'Andrew Beattie' Subject: Spectrum scale sending cluster traffic across the management network Hi All, Just getting started with spectrum scale, Just wondering if anyone has come across the issue where when doing a mmcrfs or mmdelfs you get the error Failed to connect to file system daemon: Connection timed out mmdelfs: tsdelfs failed. mmdelfs: Command failed. Examine previous error messages to determine cause. When viewing the logs in /var/mmfs/gen/mmfslog on a node other than the one I am running the command on i get: 2017-04-07_14:03:13.354+1000: [N] Filtered log entry: 'connect to node 192.168.0.1:1191' occurred 10 times between 2017-04-07_11:38:19.058+1000 and 2017-04-07_11:54:58.649+1000 192.168.0.0/24 In this case is the management network configured on eth0 of all the nodes. It is failing because port 1191 is not allowed on this interface. The dns and hostname for each node resolves to a dedicated cluster network, lets say 10.0.0.0/24 (ETH1) For some reason when I run the mmcrfs or mmdelfs it tries to talk back over the management network instead of the cluster network which fails to connect due to firewall blocking cluster traffic over management. Anyone seen this before? Kind Regards, Jon Edwards Senior Systems Engineer NewBase Email: jon.edwards at newbase.com.au Ph: + 61 7 3216 0776 Fax: + 61 7 3216 0779 http://www.newbase.com.au Opinions contained in this e-mail do not necessarily reflect the opinions of NewBase Computer Services Pty Ltd. This e-mail is for the exclusive use of the addressee and should not be disseminated further or copied without permission of the sender. If you have received this message in error, please immediately notify the sender and delete the message from your computer. -------------- next part -------------- An HTML attachment was scrubbed... URL: From knop at us.ibm.com Fri Apr 7 15:00:09 2017 From: knop at us.ibm.com (Felipe Knop) Date: Fri, 7 Apr 2017 10:00:09 -0400 Subject: [gpfsug-discuss] Spectrum Scale Encryption In-Reply-To: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> References: <9DA9EC7A281AC7428A9618AFDC490499591F4BDB@CIO-KRC-D1MBX02.osuad.osu.edu> Message-ID: All, A few comments on the topics raised below. 1) All nodes that mount an encrypted file system, and also the nodes with management roles on the file system will need access to the keys have the proper setup (RKM.conf, etc). Edward is correct that there was some change in behavior, introduced in 4.2.1 . Before the change, a mount would fail unless RKM.conf is present on the node. In addition, once a policy with encryption rules was applied, nodes without the proper encryption setup would unmount the file system. With the change, the error gets delayed to when encrypted files are accessed. The change in behavior was introduced based on feedback that unmounting the file system in that case was too drastic in that scenario. >> So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? All nodes which mount an encrypted file system should have proper setup for encryption, even for a node from where only unencrypted files are being accessed. 2) >> Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Correct. Data is not stored in the inode for encrypted files. On the other hand, since encryption metadata is stored as an extended attribute in the inode, 4K inodes are still recommended -- especially in cases where a more complicated encryption policy is used. 3) >> Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Using a backup key server is strongly recommended. While it's true that the files may still be accessed for a while if the key server becomes unreachable, this was not something to be counted on. First because keys (MEK) may expire at any time, requiring the key to be retrieved from the key server again. Second because a file may require a key may be needed that has not been cached before. 4) >> Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? Correct. >> If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Correct. We'll review the documentation to ensure that the meaning of the RkmId in the examples is clear. 5) >> Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? I'll work on getting clarifications from the ISKLM folks on this aspect. Felipe ---- Felipe Knop knop at us.ibm.com GPFS Development and Security IBM Systems IBM Building 008 2455 South Rd, Poughkeepsie, NY 12601 (845) 433-9314 T/L 293-9314 From: "Wahl, Edward" To: gpfsug main discussion list Date: 04/06/2017 10:55 AM Subject: Re: [gpfsug-discuss] Spectrum Scale Encryption Sent by: gpfsug-discuss-bounces at spectrumscale.org This is rather dependant on SS version. So what used to happen before 4.2.2.* is that a client would be unable to mount the filesystem in question and would give an error in the mmfs.log.latest for an SGPanic, In 4.2.2.* It appears it will now mount the file system and then give errors on file access instead. (just tested this on 4.2.2.3) I'll have to read through the changelogs looking for this one. Depending on your policy for encryption then, this might be exactly what you want, but I REALLY REALLY dislike this behaviour. To me this means clients can now mount an encrypted FS now and then fail during operation. If I get a client node that comes up improperly, user work will start, and it will fail with "Operation not permitted" errors on file access. I imagine my batch system could run through a massive amount of jobs on a bad client without anyone noticing immeadiately. Yet another thing we now have to monitor now I guess. *shrug* A couple other gotcha's we've seen with Encryption: Encrypted file systems do not store data in large MD blocks. Makes sense. This means large MD blocks aren't as useful as they are in unencrypted FS, if you are using this. Having at least one backup SKLM server is a good idea. "kmipServerUri[N+1]" in the conf. While the documentation claims the FS can continue operation once it caches the MEK if an SKLM server goes away, in operation this does NOT work as you may expect. Your users still need access to the FEKs for the files your clients work on. Logs will fill with Key could not be fetched. errors. Ed Wahl OSC ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Simon Thompson (Research Computing - IT Services) [S.J.Thompson at bham.ac.uk] Sent: Thursday, April 06, 2017 4:20 AM To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Spectrum Scale Encryption We are currently looking at adding encryption to our deployment for some of our data sets and for some of our nodes. Apologies in advance if some of this is a bit vague, we're not yet at the point where we can test this stuff out, so maybe some of it will become clear when we try it out. For a node that we don't want to have access to any encrypted data, what do we need to set up? According to the docs: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.s cale.v4r22.doc/bl1adv_encryption_prep.htm "After the file system is configured with encryption policy rules, the file system is considered encrypted. From that point on, each node that has access to that file system must have an RKM.conf file present. Otherwise, the file system might not be mounted or might become unmounted." So on a node which I don't want to have access to any encrypted files, do I just need to have an empty RKM.conf file? (If this is the case, would be good to have this added to the docs) Secondly ... (and maybe I'm misunderstanding the docs here) For the Policy https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.2/com.ibm.spectru m.scale.v4r22.doc/bl1adv_encryptionpolicyrules.htm KEYS ('Keyname'[, 'Keyname', ... ]) KeyId:RkmId RkmId should match the stanza name in RKM.conf? If so, it would be useful if the docs used the same names in the examples (RKMKMIP3 vs rkmname3) And KeyId should match a "Key UUID" in SKLM? Third. My understanding from talking to various IBM people is that we need ISKLM entitlements for NSD Servers, Protocol nodes and AFM gateways (probably), do we have to do any kind of node registration in ISKLM? Or is this purely based on the certificates being distributed to clients and keys are mapped in ISKLM to the client cert to determine if the node is able to request the key? Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Fri Apr 7 15:58:29 2017 From: mweil at wustl.edu (Matt Weil) Date: Fri, 7 Apr 2017 09:58:29 -0500 Subject: [gpfsug-discuss] AFM gateways Message-ID: Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From vpuvvada at in.ibm.com Mon Apr 10 11:56:16 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Mon, 10 Apr 2017 16:26:16 +0530 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: gpfsug main discussion list Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Sandra.McLaughlin at astrazeneca.com Mon Apr 10 12:20:53 2017 From: Sandra.McLaughlin at astrazeneca.com (McLaughlin, Sandra M) Date: Mon, 10 Apr 2017 11:20:53 +0000 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn't do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From Christian.Fey at sva.de Mon Apr 10 17:04:31 2017 From: Christian.Fey at sva.de (Fey, Christian) Date: Mon, 10 Apr 2017 16:04:31 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: <455e54150cd04cd8808619acbf7d8d2b@sva.de> Hi, I'm just dealing with a maybe similar issue that also seems to be related to the output of "tsctl shownodes up" (before CES i actually never had to do with this command). In my case the output of a "mmlscluster" for example shows the nodes like "node1.acme.local" but in " tsctl shownodes up" they are displayed as "node1.acme.local.acme.local" for example. This maybe causes a fresh CES implementation in a existing GPFS cluster to also not spread ip-adresses. It instead loops in the same way as it did in your case @jonathon. I think it tries to search for "node1.acme.local" but doesn't find it since tsctl shows it with doubled suffix. Can anyone explain, from where the "tsctl shownodes up" reads the data? Additionally does anyone have an idea why the dns suffix is doubled? Kind regards Christian -----Urspr?ngliche Nachricht----- Von: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Im Auftrag von Jonathon A Anderson Gesendet: Donnerstag, 23. M?rz 2017 16:02 An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Achtung! Die Absender-Adresse ist m?glicherweise gef?lscht. Bitte ?berpr?fen Sie die Plausibilit?t der Email und lassen bei enthaltenen Anh?ngen und Links besondere Vorsicht walten. Wenden Sie sich im Zweifelsfall an das CIT unter cit at sva.de oder 06122 536 350. (Stichwort: DKIM Test Fehlgeschlagen) ---------------------------------------------------------------------------------------------------------------- Thanks! I?m looking forward to upgrading our CES nodes and resuming work on the project. ~jonathon On 3/23/17, 8:24 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Olaf Weiser" wrote: the issue is fixed, an APAR will be released soon - IV93100 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Simon Thompson (Research Computing - IT Services)" An:"gpfsug main discussion list" Datum:Di. 31.01.2017 21:07Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/x-pkcs7-signature Size: 5467 bytes Desc: not available URL: From service at metamodul.com Mon Apr 10 17:47:41 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 10 Apr 2017 18:47:41 +0200 (CEST) Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network Message-ID: <788130355.197989.1491842861235@email.1und1.de> An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Apr 10 17:58:36 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 10 Apr 2017 12:58:36 -0400 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: <788130355.197989.1491842861235@email.1und1.de> References: <788130355.197989.1491842861235@email.1und1.de> Message-ID: 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers wrote: > My understanding of the GPFS networks is not quite clear. > > For an GPFS setup i would like to use 2 Networks > > 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24 > > 2 Admin Network using for example: 192.168.1.0/24 network > > Questions > > 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - > Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. > network or is it sufficient for the tiebreaker node to be configured as > part of the admin 192.168.1 network ? > > 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 > network or is it sufficient for the remote cluster to access the 10.1.1 > network ? If so i assume that remotecluster commands and ping to/from > remote cluster are going via the Daemon network ? > > Note: > > I am aware and read https://www.ibm.com/developerworks/community/ > wikis/home?lang=en#!/wiki/General%20Parallel%20File% > 20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview > > -- > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 <+49%20177%204393994> > Mail: service at metamodul.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From laurence at qsplace.co.uk Mon Apr 10 18:13:08 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 10 Apr 2017 18:13:08 +0100 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de> Message-ID: <3a8f72c6-407a-0f4d-cf3c-f4698ca7b8e5@qsplace.co.uk> All nodes in a GPFS cluster need to be able to communicate over the data and admin network with the exception of remote clusters which can have their own separate admin network (for their own cluster that they are a member of) but still require communications over the daemon network. The networks can be routed and on different subnets, however the each member of the cluster will need to be able to communicate with every other member. With this in mind: 1) The quorum node will need to be accessible on both the 10.1.1.0/24 and 192.168.1.0/24 however again the network that the quorum node is on could be routed. 2) Remote clusters don't need access to the home clusters admin network, as they will use their own clusters admin network. As Eric has mentioned I would double check your 2+1 cluster suggestion, do you mean 2 x Servers with NSD's (with a quorum role) and 1 quorum node without NSD's? which gives you 3 quorum, or are you only going to have 1 quorum? If the latter that I would suggest using all 3 servers for quorum as they should be licensed as GPFS servers anyway due to their roles. -- Lauz On 10/04/2017 17:58, J. Eric Wonderley wrote: > 1) You want more that one quorum node on your server cluster. The > non-quorum node does need a daemon network interface exposed to the > client cluster as does the quorum nodes. > > 2) No. Admin network is for intra cluster communications...not inter > cluster(between clusters). Daemon interface(port 1191) is used for > communications between clusters. I think there is little benefit > gained by having designated an admin network...maybe someone can point > out benefits of an admin network. > > > > Eric Wonderley > > On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > > wrote: > > My understanding of the GPFS networks is not quite clear. > > For an GPFS setup i would like to use 2 Networks > > 1 Daemon (data) network using port 1191 using for example. > 10.1.1.0/24 > > 2 Admin Network using for example: 192.168.1.0/24 > network > > Questions > > 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) > Config - Does the Tiebreaker Node needs to have access to the > daemon(data) 10.1.1. network or is it sufficient for the > tiebreaker node to be configured as part of the admin 192.168.1 > network ? > > 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 > network or is it sufficient for the remote cluster to access the > 10.1.1 network ? If so i assume that remotecluster commands and > ping to/from remote cluster are going via the Daemon network ? > > Note: > > I am aware and read > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview > > > -- > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 > Mail: service at metamodul.com > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Apr 10 18:26:42 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Mon, 10 Apr 2017 17:26:42 +0000 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de>, Message-ID: If you have network congestion, then a separate admin network is of benefit. Maybe less important if you have 10GbE networks, but if (for example), you normally rely on IB to talk data, and gpfs fails back to the Ethernet (which may be only 1GbE), then you may have cluster issues, for example missing gpfs pings. Having a separate physical admin network can protect you from this. Having been bitten by this several years back, it's a good idea IMHO to have a separate admin network. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of J. Eric Wonderley [eric.wonderley at vt.edu] Sent: 10 April 2017 17:58 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > wrote: My understanding of the GPFS networks is not quite clear. For an GPFS setup i would like to use 2 Networks 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24 2 Admin Network using for example: 192.168.1.0/24 network Questions 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. network or is it sufficient for the tiebreaker node to be configured as part of the admin 192.168.1 network ? 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 network or is it sufficient for the remote cluster to access the 10.1.1 network ? If so i assume that remotecluster commands and ping to/from remote cluster are going via the Daemon network ? Note: I am aware and read https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From service at metamodul.com Mon Apr 10 18:44:47 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 10 Apr 2017 19:44:47 +0200 (CEST) Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: References: <788130355.197989.1491842861235@email.1und1.de>, Message-ID: <795203366.199195.1491846287405@email.1und1.de> An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Apr 10 19:02:30 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 10 Apr 2017 21:02:30 +0300 Subject: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network In-Reply-To: <795203366.199195.1491846287405@email.1und1.de> References: <788130355.197989.1491842861235@email.1und1.de>, <795203366.199195.1491846287405@email.1und1.de> Message-ID: Hi Out of curiosity. Are you using Failure groups and doing replication of data/metadata too? If you you do need to deal with the file system descriptors as well on the 3rd node. Thanks From: Hans-Joachim Ehlers To: gpfsug main discussion list Date: 10/04/2017 20:44 Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network Sent by: gpfsug-discuss-bounces at spectrumscale.org Sorry for not being clear. The setup is of course a 3 Node Cluster where each node is a quorum node - 2 NSD Server and 1 TieBreaker/Quorum Buster node. For me it was not clear if the Tiebreaker/Quorum Buster node - which does nothing in terms of data serving - must be part of the daemon/data network or not. So i get the understanding that a Tiebreaker Node must be also part of the Daemon network. Thx a lot to all Hajo "Simon Thompson (IT Research Support)" hat am 10. April 2017 um 19:26 geschrieben: If you have network congestion, then a separate admin network is of benefit. Maybe less important if you have 10GbE networks, but if (for example), you normally rely on IB to talk data, and gpfs fails back to the Ethernet (which may be only 1GbE), then you may have cluster issues, for example missing gpfs pings. Having a separate physical admin network can protect you from this. Having been bitten by this several years back, it's a good idea IMHO to have a separate admin network. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of J. Eric Wonderley [eric.wonderley at vt.edu] Sent: 10 April 2017 17:58 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] GPFS Network Configuration - 1 Daemon Network , 1 Admin Network 1) You want more that one quorum node on your server cluster. The non-quorum node does need a daemon network interface exposed to the client cluster as does the quorum nodes. 2) No. Admin network is for intra cluster communications...not inter cluster(between clusters). Daemon interface(port 1191) is used for communications between clusters. I think there is little benefit gained by having designated an admin network...maybe someone can point out benefits of an admin network. Eric Wonderley On Mon, Apr 10, 2017 at 12:47 PM, Hans-Joachim Ehlers > wrote: My understanding of the GPFS networks is not quite clear. For an GPFS setup i would like to use 2 Networks 1 Daemon (data) network using port 1191 using for example. 10.1.1.0/24< http://10.1.1.0/24> 2 Admin Network using for example: 192.168.1.0/24 network Questions 1) Thus in a 2+1 Cluster ( 2 GPFS Server + 1 Quorum Server ) Config - Does the Tiebreaker Node needs to have access to the daemon(data) 10.1.1. network or is it sufficient for the tiebreaker node to be configured as part of the admin 192.168.1 network ? 2) Does a remote cluster needs access to the GPFS Admin 192.168.1 network or is it sufficient for the remote cluster to access the 10.1.1 network ? If so i assume that remotecluster commands and ping to/from remote cluster are going via the Daemon network ? Note: I am aware and read https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20Network%20Communication%20Overview -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Apr 10 21:15:38 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 10 Apr 2017 15:15:38 -0500 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: Message-ID: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mitsugi at linux.vnet.ibm.com Tue Apr 11 05:29:16 2017 From: mitsugi at linux.vnet.ibm.com (Masanori Mitsugi) Date: Tue, 11 Apr 2017 13:29:16 +0900 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Message-ID: Hello, Does anyone have experience to do mmapplypolicy against billion files for ILM/HSM? Currently I'm planning/designing * 1 Scale filesystem (5-10 PB) * 10-20 filesets which includes 1 billion files each And our biggest concern is "How log does it take for mmapplypolicy policy scan against billion files?" I know it depends on how to write the policy, but I don't have no billion files policy scan experience, so I'd like to know the order of time (min/hour/day...). It would be helpful if anyone has experience of such large number of files scan and let me know any considerations or points for policy design. -- Masanori Mitsugi mitsugi at linux.vnet.ibm.com From zgiles at gmail.com Tue Apr 11 05:49:10 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 11 Apr 2017 00:49:10 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com From olaf.weiser at de.ibm.com Tue Apr 11 07:51:48 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Tue, 11 Apr 2017 08:51:48 +0200 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <455e54150cd04cd8808619acbf7d8d2b@sva.de> References: <455e54150cd04cd8808619acbf7d8d2b@sva.de> Message-ID: An HTML attachment was scrubbed... URL: From ckrafft at de.ibm.com Tue Apr 11 09:24:35 2017 From: ckrafft at de.ibm.com (Christoph Krafft) Date: Tue, 11 Apr 2017 10:24:35 +0200 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Message-ID: Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A788784.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From p.childs at qmul.ac.uk Tue Apr 11 09:57:44 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Tue, 11 Apr 2017 08:57:44 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London From jonathan at buzzard.me.uk Tue Apr 11 11:21:05 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Tue, 11 Apr 2017 11:21:05 +0100 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <1491906065.4102.87.camel@buzzard.me.uk> On Tue, 2017-04-11 at 00:49 -0400, Zachary Giles wrote: [SNIP] > * Then throw ~8 well tuned Infiniband attached nodes at it using -N, > If they're the same as the NSD servers serving the flash, even better. > Exactly how much are you going to gain from Infiniband over 40Gbps or even 100Gbps Ethernet? Not a lot I would have thought. Even with flash all your latency is going to be in the flash not the Ethernet. Unless you have a compute cluster and need Infiniband for the MPI traffic, it is surely better to stick to Ethernet. Infiniband is rather esoteric, what I call a minority sport best avoided if at all possible. Even if you have an Infiniband fabric, I would argue that give current core counts and price points for 10Gbps Ethernet, that actually you are better off keeping your storage traffic on the Ethernet, and reserving the Infiniband for MPI duties. That is 10Gbps Ethernet to the compute nodes and 40/100Gbps Ethernet on the storage nodes. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From zgiles at gmail.com Tue Apr 11 12:50:26 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 11 Apr 2017 07:50:26 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: <1491906065.4102.87.camel@buzzard.me.uk> References: <1491906065.4102.87.camel@buzzard.me.uk> Message-ID: Yeah, that can be true. I was just trying to show the size/shape that can achieve this. There's a good chance 10G or 40G ethernet would yield similar results, especially if you're running the policy on the NSD servers. On Tue, Apr 11, 2017 at 6:21 AM, Jonathan Buzzard wrote: > On Tue, 2017-04-11 at 00:49 -0400, Zachary Giles wrote: > > [SNIP] > >> * Then throw ~8 well tuned Infiniband attached nodes at it using -N, >> If they're the same as the NSD servers serving the flash, even better. >> > > Exactly how much are you going to gain from Infiniband over 40Gbps or > even 100Gbps Ethernet? Not a lot I would have thought. Even with flash > all your latency is going to be in the flash not the Ethernet. > > Unless you have a compute cluster and need Infiniband for the MPI > traffic, it is surely better to stick to Ethernet. Infiniband is rather > esoteric, what I call a minority sport best avoided if at all possible. > > Even if you have an Infiniband fabric, I would argue that give current > core counts and price points for 10Gbps Ethernet, that actually you are > better off keeping your storage traffic on the Ethernet, and reserving > the Infiniband for MPI duties. That is 10Gbps Ethernet to the compute > nodes and 40/100Gbps Ethernet on the storage nodes. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com From stockf at us.ibm.com Tue Apr 11 12:53:33 2017 From: stockf at us.ibm.com (Frederick Stock) Date: Tue, 11 Apr 2017 07:53:33 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: As Zachary noted the location of your metadata is the key and for the scanning you have planned flash is necessary. If you have the resources you may consider setting up your flash in a mirrored RAID configuration (RAID1/RAID10) and have GPFS only keep one copy of metadata since the underlying storage is replicating it via the RAID. This should improve metadata write performance but likely has little impact on your scanning, assuming you are just reading through the metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Zachary Giles To: gpfsug main discussion list Date: 04/11/2017 12:49 AM Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Sent by: gpfsug-discuss-bounces at spectrumscale.org It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Apr 11 16:18:01 2017 From: chair at spectrumscale.org (Spectrum Scale UG Chair (Simon Thompson)) Date: Tue, 11 Apr 2017 16:18:01 +0100 Subject: [gpfsug-discuss] May Meeting Registration Message-ID: Hi all, Just a reminder that the next UK user group meeting is taking place on 9th/10th May. If you are planning on attending, please do register at: https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi stration-32113696932 (or try https://goo.gl/tRptru ) As last year, this is a 2 day event and we're planning a fun evening event on the Tuesday night at Manchester Museum of Science. Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, OCF and Seagate for helping make this happen! We also still have some customer talk slots to fill, so please let me know if you are interested in speaking. Thanks Simon From bbanister at jumptrading.com Tue Apr 11 16:29:25 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:29:25 +0000 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <1e86aa0c2e4344f19cb5eedf8f03efa9@jumptrading.com> A word of caution, be careful about where you run this kind of policy scan as the sort process can consume all memory on your hosts and that could lead to issues with the OS deciding to kill off GPFS or other similar bad things can occur. I recommend restricting the ILM policy scan to a subset of servers, no quorum nodes, and ensuring at least one NSD server is available for all NSDs in the file system(s). Watch the memory consumption on your nodes during the sort operations to see if you need to tune that down in the mmapplypolicy options. Hope that helps, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frederick Stock Sent: Tuesday, April 11, 2017 6:54 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM As Zachary noted the location of your metadata is the key and for the scanning you have planned flash is necessary. If you have the resources you may consider setting up your flash in a mirrored RAID configuration (RAID1/RAID10) and have GPFS only keep one copy of metadata since the underlying storage is replicating it via the RAID. This should improve metadata write performance but likely has little impact on your scanning, assuming you are just reading through the metadata. Fred __________________________________________________ Fred Stock | IBM Pittsburgh Lab | 720-430-8821 stockf at us.ibm.com From: Zachary Giles > To: gpfsug main discussion list > Date: 04/11/2017 12:49 AM Subject: Re: [gpfsug-discuss] Policy scan against billion files for ILM/HSM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ It's definitely doable, and these days not too hard. Flash for metadata is the key. The basics of it are: * Latest GPFS for performance benefits. * A few 10's of TBs of flash ( or more ! ) setup in a good design.. lots of SAS, well balanced RAID that can consume the flash fully, tuned for IOPs, and available in parallel from multiple servers. * Tune up mmapplypolicy with -g somewhere-on-gpfs; --choice-algorithm fast; -a, -m and -n to reasonable values ( number of cores on the servers ); -A to ~1000 * Test first on a smaller fileset to confirm you like it. -I test should work well and be around the same speed minus the migration phase. * Then throw ~8 well tuned Infiniband attached nodes at it using -N, If they're the same as the NSD servers serving the flash, even better. Should be able to do 1B in 5-30m depending on the idiosyncrasies of above choices. Even 60m isn't bad and quite respectable if less gear is used or if they system is busy while the policy is running. Parallel metadata, it's a beautiful thing. On Tue, Apr 11, 2017 at 12:29 AM, Masanori Mitsugi > wrote: > Hello, > > Does anyone have experience to do mmapplypolicy against billion files for > ILM/HSM? > > Currently I'm planning/designing > > * 1 Scale filesystem (5-10 PB) > * 10-20 filesets which includes 1 billion files each > > And our biggest concern is "How log does it take for mmapplypolicy policy > scan against billion files?" > > I know it depends on how to write the policy, > but I don't have no billion files policy scan experience, > so I'd like to know the order of time (min/hour/day...). > > It would be helpful if anyone has experience of such large number of files > scan and let me know any considerations or points for policy design. > > -- > Masanori Mitsugi > mitsugi at linux.vnet.ibm.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Zach Giles zgiles at gmail.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From k.leach at ed.ac.uk Tue Apr 11 16:32:41 2017 From: k.leach at ed.ac.uk (Kieran Leach) Date: Tue, 11 Apr 2017 16:32:41 +0100 Subject: [gpfsug-discuss] May Meeting Registration In-Reply-To: References: Message-ID: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> Hi Simon, would you be interested in a customer talk about the RDF (http://rdf.ac.uk/). We manage the RDF at EPCC, providing a 23PB filestore to complement ARCHER (the national research HPC service) and other UK Research HPC services. This is of course a GPFS system. If you've any questions or want more info please let me know but I thought I'd get an email off to you while I remember. Cheers Kieran On 11/04/17 16:18, Spectrum Scale UG Chair (Simon Thompson) wrote: > Hi all, > > Just a reminder that the next UK user group meeting is taking place on > 9th/10th May. If you are planning on attending, please do register at: > > https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi > stration-32113696932 > > > (or try https://goo.gl/tRptru ) > > As last year, this is a 2 day event and we're planning a fun evening event > on the Tuesday night at Manchester Museum of Science. > > Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, > OCF and Seagate for helping make this happen! > > We also still have some customer talk slots to fill, so please let me know > if you are interested in speaking. > > Thanks > > Simon > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From k.leach at ed.ac.uk Tue Apr 11 16:33:29 2017 From: k.leach at ed.ac.uk (Kieran Leach) Date: Tue, 11 Apr 2017 16:33:29 +0100 Subject: [gpfsug-discuss] May Meeting Registration In-Reply-To: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> References: <275b54d9-6779-774e-69bb-d26fead278a2@ed.ac.uk> Message-ID: Apologies all, wrong reply button. Cheers Kieran On 11/04/17 16:32, Kieran Leach wrote: > Hi Simon, > would you be interested in a customer talk about the RDF > (http://rdf.ac.uk/). We manage the RDF at EPCC, providing a 23PB > filestore to complement ARCHER (the national research HPC service) and > other UK Research HPC services. This is of course a GPFS system. If > you've any questions or want more info please let me know but I > thought I'd get an email off to you while I remember. > > Cheers > > Kieran > > On 11/04/17 16:18, Spectrum Scale UG Chair (Simon Thompson) wrote: >> Hi all, >> >> Just a reminder that the next UK user group meeting is taking place on >> 9th/10th May. If you are planning on attending, please do register at: >> >> https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi >> >> stration-32113696932 >> >> >> (or try https://goo.gl/tRptru ) >> >> As last year, this is a 2 day event and we're planning a fun evening >> event >> on the Tuesday night at Manchester Museum of Science. >> >> Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, >> OCF and Seagate for helping make this happen! >> >> We also still have some customer talk slots to fill, so please let me >> know >> if you are interested in speaking. >> >> Thanks >> >> Simon >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336. From makaplan at us.ibm.com Tue Apr 11 16:36:47 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 11 Apr 2017 11:36:47 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: As primary developer of mmapplypolicy, please allow me to comment: 1) Fast access to metadata in system pool is most important, as several have commented on. These days SSD is the favorite, but you can still go with "spinning" media. If you do go with disks, it's extremely important to spread your metadata over independent disk "arms" -- so you can have many concurrent seeks in progress at the same time. IOW, if there is a virtualization/mapping layer, watchout that your logical disks don't get mapped to the same physical disk. 2) Crucial to use both -g and -N :: -g /gpfs-not-necessarily-the-same-fs-as-Im-scanning/tempdir and -N several-nodes-that-will-be-accessing-the-system-pool 3a) If at all possible, encourage your data and application designers to "pack" their directories with lots of files. Keep in mind that, mmapplypolicy will read every directory. The more directories, the more seeks, more time spent waiting for IO. OTOH, in more typical Unix/Linux usage, we tend to low average number of files per directory. 3b) As admin, you may not be able to change your data design to pack hundreds of files per directory, BUT you can make sure you are running a sufficiently modern release of Spectrum Scale that supports "data in inode" -- "Data in inode" also means "directory entries in inode" -- which means practically any small directory, up to a few hundred files, will fit in an an inode -- which means mmapplypolicy can read small directories with one seek, instead of two. (Someone will please remind us of the release number that first supported "directories in inode".) 4) Sorry, Fred, but the recommendation to use RAID mirroring of metadata on SSD, is not necessarily, important for metadata scanning. In fact it may work against you. If you use GPFS replication of metadata - that can work for you -- since then GPFS can direct read operations to either copy, preferring a locally attached copy, depending on how storage is attached to node, etc, etc. Choice of how to replicate metadata - either using GPFS replication or the RAID controller - is probably best made based on reliability and recoverability requirements. 5) YMMV - We'd love to hear/see your performance results for mmapplypolicy, especially if they're good. Even if they're bad, come back here for more tuning tips! -- marc of Spectrum Scale (ne GPFS) -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Tue Apr 11 16:51:56 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:51:56 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From S.J.Thompson at bham.ac.uk Tue Apr 11 16:55:35 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 11 Apr 2017 15:55:35 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jonathon.anderson at colorado.edu Tue Apr 11 16:56:56 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Tue, 11 Apr 2017 15:56:56 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories Message-ID: Bryan, That looks like a really useful set of presentation slides! Thanks for sharing! Which one in particular is the one Yuri gave that you?re referring to? ~jonathon On 4/11/17, 9:51 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From bbanister at jumptrading.com Tue Apr 11 16:59:51 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Tue, 11 Apr 2017 15:59:51 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Problem Determination and GPFS Internals. My security group won't let me go to the google docs site from my work compute... I'm sure there is malicious malware on that site!! j/k, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Jonathon A Anderson Sent: Tuesday, April 11, 2017 10:57 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories Bryan, That looks like a really useful set of presentation slides! Thanks for sharing! Which one in particular is the one Yuri gave that you?re referring to? ~jonathon On 4/11/17, 9:51 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: There are so many things to look at and many tools for doing so (iostat, htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would recommend a review of the presentation that Yuri gave at the most recent GPFS User Group: https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs Cheers, -Bryan -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter Childs Sent: Tuesday, April 11, 2017 3:58 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories This is a curious issue which I'm trying to get to the bottom of. We currently have two Spectrum Scale file systems, both are running GPFS 4.2.1-1 some of the servers have been upgraded to 4.2.1-2. The older one which was upgraded from GPFS 3.5 works find create a directory is always fast and no issue. The new one, which has nice new SSD for metadata and hence should be faster. can take up to 30 seconds to create a directory but usually takes less than a second, The longer directory creates usually happen on busy nodes that have not used the new storage in a while. (Its new so we've not moved much of the data over yet) But it can also happen randomly anywhere, including from the NSD servers them selves. (times of 3-4 seconds from the NSD servers have been seen, on a single directory create) We've been pointed at the network and suggested we check all network settings, and its been suggested to build an admin network, but I'm not sure I entirely understand why and how this would help. Its a mixed 1G/10G network with the NSD servers connected at 40G with an MTU of 9000. However as I say, the older filesystem is fine, and it does not matter if the nodes are connected to the old GPFS cluster or the new one, (although the delay is worst on the old gpfs cluster), So I'm really playing spot the difference. and the network is not really an obvious difference. Its been suggested to look at a trace when it occurs but as its difficult to recreate collecting one is difficult. Any ideas would be most helpful. Thanks Peter Childs ITS Research Infrastructure Queen Mary, University of London _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. From p.childs at qmul.ac.uk Tue Apr 11 20:35:40 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Tue, 11 Apr 2017 19:35:40 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Can you remember what version you were running? Don't worry if you can't remember. It looks like ibm may have withdrawn 4.2.1 from fix central and wish to forget its existences. Never a good sign, 4.2.0, 4.2.2, 4.2.3 and even 3.5, so maybe upgrading is worth a try. I've looked at all the standard trouble shouting guides and got nowhere hence why I asked. But another set of slides always helps. Thank-you for the help, still head scratching.... Which only makes the issue more random. Peter Childs Research Storage ITS Research and Teaching Support Queen Mary, University of London ---- Simon Thompson (IT Research Support) wrote ---- We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mitsugi at linux.vnet.ibm.com Wed Apr 12 02:51:03 2017 From: mitsugi at linux.vnet.ibm.com (Masanori Mitsugi) Date: Wed, 12 Apr 2017 10:51:03 +0900 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <0851d194-088e-d93a-303d-ceb0de3dbaa8@linux.vnet.ibm.com> Marc, Zachary, Fred, Bryan, Thank you for providing great advice! It's pretty useful for me to tune our policy with best performance. As for "directories in inode", we plan to use latest version, so I believe we can leverage this function. -- Masanori Mitsugi mitsugi at linux.vnet.ibm.com From vpuvvada at in.ibm.com Wed Apr 12 10:53:25 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Wed, 12 Apr 2017 15:23:25 +0530 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> References: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Message-ID: Gateway node requires server license. ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: Date: 04/11/2017 01:46 AM Subject: Re: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: gpfsug main discussion list Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Wed Apr 12 15:52:48 2017 From: mweil at wustl.edu (Matt Weil) Date: Wed, 12 Apr 2017 09:52:48 -0500 Subject: [gpfsug-discuss] AFM gateways In-Reply-To: References: <524d253e-b825-4e6a-7cbf-884af394ddc5@wustl.edu> Message-ID: yes it tells you that when you attempt to make the node a gateway and is does not have a server license designation. On 4/12/17 4:53 AM, Venkateswara R Puvvada wrote: Gateway node requires server license. ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil To: Date: 04/11/2017 01:46 AM Subject: Re: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Thanks for the answers.. For fail over I believe we will want to keep it separate then. Next question. Is it licensed as a client or a server? On 4/10/17 6:20 AM, McLaughlin, Sandra M wrote: Hi, I agree with Venkat. I did exactly what you said below, enabled my NSD servers as gateways to get additional throughput (with both native gpfs protocol and NFS protocol), which worked well; we definitely got the increased traffic. However, I wouldn?t do it again through choice. As Venkat says, if there is a problem with the remote cluster, that can affect any of the gateway nodes (if using gpfs protocol), but also, we had a problem with one of the gateway nodes, where it kept crashing (which is now resolved) and then all filesets for which that node was the gateway had to failover to other gateway servers and this really messes everything up while the failover is taking place. I am also, stupidly, serving NFS and samba from the NSD servers (via ctdb) which I also, would not do again ! It would be nice if there was a way to specify which gateway server is the primary gateway for a specific fileset. Regards, Sandra From: gpfsug-discuss-bounces at spectrumscale.org[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Venkateswara R Puvvada Sent: 10 April 2017 11:56 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM gateways It is not recommended to make NSD servers as gateway nodes for native GPFS protocol. Unresponsive remote cluster mount might cause gateway node to hang on synchronous operations (ex. Lookup, Read, Open etc..), this will affect NSD server functionality. More information is documented @ https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1ins_NFSvsGPFSAFM.htm ~Venkat (vpuvvada at in.ibm.com) From: Matt Weil > To: gpfsug main discussion list > Date: 04/07/2017 08:28 PM Subject: [gpfsug-discuss] AFM gateways Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hello, any reason to not enable all NSD servers as gateway when using native gpfs AFM? Will they all pass traffic? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ AstraZeneca UK Limited is a company incorporated in England and Wales with registered number:03674842 and its registered office at 1 Francis Crick Avenue, Cambridge Biomedical Campus, Cambridge, CB2 0AA. This e-mail and its attachments are intended for the above named recipient only and may contain confidential and privileged information. If they have come to you in error, you must not copy or show them to anyone; instead, please reply to this e-mail, highlighting the error to the sender and then immediately delete the message. For information about how AstraZeneca UK Limited and its affiliates may process information, personal data and monitor communications, please see our privacy notice at www.astrazeneca.com _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail._______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Apr 12 22:01:45 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 12 Apr 2017 14:01:45 -0700 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: References: Message-ID: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> On 4/11/17 8:36 AM, Marc A Kaplan wrote: > > 5) YMMV - We'd love to hear/see your performance results for > mmapplypolicy, especially if they're good. Even if they're bad, come > back here for more tuning tips! I have a filesystem that currently has 267919775 (roughly quarter billion, 250 million) used inodes. The metadata is on SSD behind a DDN 12K. We do use 4K inodes, and files smaller than 4K fit into the inodes. Here is the command I use to apply a policy: mmapplypolicy gsfs0 -P policy.txt -N scg-gs0,scg-gs1,scg-gs2,scg-gs3,scg-gs4,scg-gs5,scg-gs6,scg-gs7 -g /srv/gsfs0/admin_stuff/ -I test -B 500 -A 61 -a 4 That takes approximately 10 minutes to do the whole scan. The "-B 500 -A 61 -a 4" numbers we determined just by trying different values with the same policy file and seeing the resulting scan duration. 10mins is short enough to do almost "interactive" type of file list policies and look at the results. E.g. list all files over 1TB in size. This was a couple of years ago, probably on a different GPFS version, but on same storage and NSD hardware, so now I just copy those parameters. You should probably not just copy them but try some other values yourself. Regards, Alex From makaplan at us.ibm.com Wed Apr 12 23:43:20 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 12 Apr 2017 18:43:20 -0400 Subject: [gpfsug-discuss] Policy scan against billion files for ILM/HSM In-Reply-To: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> References: <284246a2-b14b-0a73-6dad-4c73caef58c9@stanford.edu> Message-ID: >>>Here is the command I use to apply a policy: mmapplypolicy gsfs0 -P policy.txt -N scg-gs0,scg-gs1,scg-gs2,scg-gs3,scg-gs4,scg-gs5,scg-gs6,scg-gs7 -g /srv/gsfs0/admin_stuff/ -I test -B 500 -A 61 -a 4 That takes approximately 10 minutes to do the whole scan. The "-B 500 -A 61 -a 4" numbers we determined just by trying different values with the same policy file and seeing the resulting scan duration. <<< That's pretty good. BUT, FYI, the -A number-of-buckets parameter should be scaled with the total number of files you expect to find in the argument filesystem or directory. If you don't set it the command will default to number-of-inodes-allocated / million, but capped at a minimum of 7 and a maximum of 4096. -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Apr 13 11:35:19 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 13 Apr 2017 10:35:19 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: , Message-ID: After a load more debugging, and switching off the quota's the issue looks to be quota related. in that the issue has gone away since I switched quota's off. I will need to switch them back on, but at least we know the issue is not the network and is likely to be fixed by upgrading..... Peter Childs ITS Research Infrastructure Queen Mary, University of London ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Peter Childs Sent: Tuesday, April 11, 2017 8:35:40 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories Can you remember what version you were running? Don't worry if you can't remember. It looks like ibm may have withdrawn 4.2.1 from fix central and wish to forget its existences. Never a good sign, 4.2.0, 4.2.2, 4.2.3 and even 3.5, so maybe upgrading is worth a try. I've looked at all the standard trouble shouting guides and got nowhere hence why I asked. But another set of slides always helps. Thank-you for the help, still head scratching.... Which only makes the issue more random. Peter Childs Research Storage ITS Research and Teaching Support Queen Mary, University of London ---- Simon Thompson (IT Research Support) wrote ---- We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From scale at us.ibm.com Fri Apr 14 08:34:06 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 14 Apr 2017 15:34:06 +0800 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? In-Reply-To: References: Message-ID: If you can use " mmchconfig usePersistentReserve=yes" successfully, then it is supported, we will check the compatibility during the command, and you can also use "tsprinquiry device(no /dev prefix)" check the vendor output. Thanks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Christoph Krafft" To: "gpfsug main discussion list" Cc: Achim Christ , Petra Christ Date: 04/11/2017 04:25 PM Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A696179.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A223532.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From scale at us.ibm.com Fri Apr 14 08:34:06 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Fri, 14 Apr 2017 15:34:06 +0800 Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? In-Reply-To: References: Message-ID: If you can use " mmchconfig usePersistentReserve=yes" successfully, then it is supported, we will check the compatibility during the command, and you can also use "tsprinquiry device(no /dev prefix)" check the vendor output. Thanks. Regards, The Spectrum Scale (GPFS) team ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Christoph Krafft" To: "gpfsug main discussion list" Cc: Achim Christ , Petra Christ Date: 04/11/2017 04:25 PM Subject: [gpfsug-discuss] Does SVC / Spectrum Virtualize support IBM Spectrum Scale with SCSI-3 Persistent Reservations? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi folks, there is a list of storage devices that support SCSI-3 PR in the GPFS FAQ Doc (see Answer 4.5). https://www.ibm.com/support/knowledgecenter/STXKQY/gpfsclustersfaq.html#scsi3 Since this list contains IBM V-model storage subsystems that include Storage Virtualization - I was wondering if SVC / Spectrum Virtualize can also support SCSI-3 PR (although not explicitly on the list)? Any hints and help is warmla welcome - thank you in advance. Mit freundlichen Gr??en / Sincerely Christoph Krafft Client Technical Specialist - Power Systems, IBM Systems Certified IT Specialist @ The Open Group Phone: +49 (0) 7034 643 2171 IBM Deutschland GmbH Mobile: +49 (0) 160 97 81 86 12 Am Weiher 24 Email: ckrafft at de.ibm.com 65451 Kelsterbach Germany IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Nicole Reimer, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Stefan Lutz Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A696179.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 1A223532.gif Type: image/gif Size: 1851 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Sun Apr 16 14:47:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Sun, 16 Apr 2017 13:47:20 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Message-ID: Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Sun Apr 16 17:20:15 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Sun, 16 Apr 2017 16:20:15 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Message-ID: <252ABBB2-7E94-41F6-AD76-B6D836E5C916@nuance.com> I think the first thing I would do is turn up the ?-L? level to a large value (like ?6?) and see what it tells you about files that are being chosen and which ones aren?t being migrated and why. You could run it in test mode, write the output to a file and see what it says. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of "Buterbaugh, Kevin L" Reply-To: gpfsug main discussion list Date: Sunday, April 16, 2017 at 8:47 AM To: gpfsug main discussion list Subject: [EXTERNAL] [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Sun Apr 16 20:15:40 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sun, 16 Apr 2017 15:15:40 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From makaplan at us.ibm.com Sun Apr 16 20:39:21 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Sun, 16 Apr 2017 15:39:21 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Correction: So that's why it chooses to migrate "only" 67TB.... (67000 GB) -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Apr 17 16:24:02 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 17 Apr 2017 15:24:02 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Mon Apr 17 19:49:12 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Mon, 17 Apr 2017 11:49:12 -0700 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: <09e154ef-15ed-3217-db65-51e693e28faa@stanford.edu> Hi Kevin, IMHO, safe to just run it again. You can also run it with '-I test -L 6' again and look through the output. But I don't think you can "break" anything by having it scan and/or move data. Can you post the full command line that you use to run it? The behavior you describe is odd; you say it prints out the "files migrated successfully" message, but the files didn't actually get migrated? Turn up the debug param and have it print every file as it is moving it or something. Regards, Alex On 4/17/17 8:24 AM, Buterbaugh, Kevin L wrote: > Hi Marc, > > I do understand what you?re saying about mmapplypolicy deciding it only > needed to move ~1.8 million files to fill the capacity pool to ~98% > full. However, it is now more than 24 hours since the mmapplypolicy > finished ?successfully? and: > > Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) > eon35Ansd 58.2T 35 No Yes 29.66T ( > 51%) 64.16G ( 0%) > eon35Dnsd 58.2T 35 No Yes 29.66T ( > 51%) 64.61G ( 0%) > ------------- > -------------------- ------------------- > (pool total) 116.4T 59.33T ( > 51%) 128.8G ( 0%) > > And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the > partially redacted command line: > > /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g another gpfs filesystem> -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy > -N some,list,of,NSD,server,nodes > > And here?s that policy file: > > define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) > define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) > > RULE 'OldStuff' > MIGRATE FROM POOL 'gpfs23data' > TO POOL 'gpfs23capacity' > LIMIT(98) > WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) > > RULE 'INeedThatAfterAll' > MIGRATE FROM POOL 'gpfs23capacity' > TO POOL 'gpfs23data' > LIMIT(75) > WHERE (access_age < 14) > > The one thing that has changed is that formerly I only ran the migration > in one direction at a time ? i.e. I used to have those two rules in two > separate files and would run an mmapplypolicy using the OldStuff rule > the 1st weekend of the month and run the other rule the other weekends > of the month. This is the 1st weekend that I attempted to run an > mmapplypolicy that did both at the same time. Did I mess something up > with that? > > I have not run it again yet because we also run migrations on the other > filesystem that we are still in the process of migrating off of. So > gpfs23 goes 1st and as soon as it?s done the other filesystem migration > kicks off. I don?t like to run two migrations simultaneously if at all > possible. The 2nd migration ran until this morning, when it was > unfortunately terminated by a network switch crash that has also had me > tied up all morning until now. :-( > > And yes, there is something else going on ? well, was going on - the > network switch crash killed this too ? I have been running an rsync on > one particular ~80TB directory tree from the old filesystem to gpfs23. > I understand that the migration wouldn?t know about those files and > that?s fine ? I just don?t understand why mmapplypolicy said it was > going to fill the capacity pool to 98% but didn?t do it ? wait, > mmapplypolicy hasn?t gone into politics, has it?!? ;-) > > Thanks - and again, if I should open a PMR for this please let me know... > > Kevin > >> On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > > wrote: >> >> Let's look at how mmapplypolicy does the reckoning. >> Before it starts, it see your pools as: >> >> [I] GPFS Current Data Pool Utilization in KB and % >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 55365193728 124983549952 44.297984614% >> gpfs23data 166747037696 343753326592 48.507759721% >> system 0 0 >> 0.000000000% (no user data) >> [I] 75142046 of 209715200 inodes used: 35.830520%. >> >> Your rule says you want to migrate data to gpfs23capacity, up to 98% full: >> >> RULE 'OldStuff' >> MIGRATE FROM POOL 'gpfs23data' >> TO POOL 'gpfs23capacity' >> LIMIT(98) WHERE ... >> >> We scan your files and find and reckon... >> [I] Summary of Rule Applicability and File Choices: >> Rule# Hit_Cnt KB_Hit Chosen KB_Chosen >> KB_Ill Rule >> 0 5255960 237675081344 1868858 67355430720 >> 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO >> POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) >> >> So yes, 5.25Million files match the rule, but the utility chooses >> 1.868Million files that add up to 67,355GB and figures that if it >> migrates those to gpfs23capacity, >> (and also figuring the other migrations by your second rule)then >> gpfs23 will end up 97.9999% full. >> We show you that with our "predictions" message. >> >> Predicted Data Pool Utilization in KB and %: >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 122483878944 124983549952 97.999999993% >> gpfs23data 104742360032 343753326592 30.470209865% >> >> So that's why it chooses to migrate "only" 67GB.... >> >> See? Makes sense to me. >> >> Questions: >> Did you run with -I yes or -I defer ? >> >> Were some of the files illreplicated or illplaced? >> >> Did you give the cluster-wide space reckoning protocols time to see >> the changes? mmdf is usually "behind" by some non-neglible amount of >> time. >> >> What else is going on? >> If you're moving or deleting or creating data by other means while >> mmapplypolicy is running -- it doesn't "know" about that! >> >> Run it again! >> >> >> >> >> >> From: "Buterbaugh, Kevin L" > > >> To: gpfsug main discussion list >> > > >> Date: 04/16/2017 09:47 AM >> Subject: [gpfsug-discuss] mmapplypolicy didn't migrate >> everything it should have - why not? >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> >> ------------------------------------------------------------------------ >> >> >> >> Hi All, >> >> First off, I can open a PMR for this if I need to. Second, I am far >> from an mmapplypolicy guru. With that out of the way ? I have an >> mmapplypolicy job that didn?t migrate anywhere close to what it could >> / should have. From the log file I have it create, here is the part >> where it shows the policies I told it to invoke: >> >> [I] Qos 'maintenance' configured as inf >> [I] GPFS Current Data Pool Utilization in KB and % >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 55365193728 124983549952 44.297984614% >> gpfs23data 166747037696 343753326592 48.507759721% >> system 0 0 >> 0.000000000% (no user data) >> [I] 75142046 of 209715200 inodes used: 35.830520%. >> [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. >> Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC >> Parsed 2 policy rules. >> >> RULE 'OldStuff' >> MIGRATE FROM POOL 'gpfs23data' >> TO POOL 'gpfs23capacity' >> LIMIT(98) >> WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND >> (KB_ALLOCATED > 3584)) >> >> RULE 'INeedThatAfterAll' >> MIGRATE FROM POOL 'gpfs23capacity' >> TO POOL 'gpfs23data' >> LIMIT(75) >> WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) >> >> And then the log shows it scanning all the directories and then says, >> "OK, here?s what I?m going to do": >> >> [I] Summary of Rule Applicability and File Choices: >> Rule# Hit_Cnt KB_Hit Chosen KB_Chosen >> KB_Ill Rule >> 0 5255960 237675081344 1868858 67355430720 >> 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO >> POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) >> 1 611 236745504 611 236745504 >> 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL >> 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) >> >> [I] Filesystem objects with no applicable rules: 414911602. >> >> [I] GPFS Policy Decisions and File Choice Totals: >> Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; >> Predicted Data Pool Utilization in KB and %: >> Pool_Name KB_Occupied KB_Total Percent_Occupied >> gpfs23capacity 122483878944 124983549952 97.999999993% >> gpfs23data 104742360032 343753326592 30.470209865% >> system 0 0 >> 0.000000000% (no user data) >> >> Notice that it says it?s only going to migrate less than 2 million of >> the 5.25 million candidate files!! And sure enough, that?s all it did: >> >> [I] A total of 1869469 files have been migrated, deleted or processed >> by an EXTERNAL EXEC/script; >> 0 'skipped' files and/or errors. >> >> And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere >> near 98% full: >> >> Disks in storage pool: gpfs23capacity (Maximum disk size allowed is >> 519 TB) >> eon35Ansd 58.2T 35 No Yes 29.54T ( >> 51%) 63.93G ( 0%) >> eon35Dnsd 58.2T 35 No Yes 29.54T ( >> 51%) 64.39G ( 0%) >> ------------- >> -------------------- ------------------- >> (pool total) 116.4T 59.08T ( >> 51%) 128.3G ( 0%) >> >> I don?t understand why it only migrated a small subset of what it >> could / should have? >> >> We are doing a migration from one filesystem (gpfs21) to gpfs23 and I >> really need to stuff my gpfs23capacity pool as full of data as I can >> to keep the migration going. Any ideas anyone? Thanks in advance? >> >> ? >> Kevin Buterbaugh - Senior System Administrator >> Vanderbilt University - Advanced Computing Center for Research and >> Education >> _Kevin.Buterbaugh at vanderbilt.edu_ >> - (615)875-9633 From makaplan at us.ibm.com Mon Apr 17 21:11:18 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 17 Apr 2017 16:11:18 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Mon Apr 17 21:18:42 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 17 Apr 2017 16:18:42 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: Oops... If you want to see the list of what would be migrated '-I test -L 2' If you want to migrate and see each file migrated '-I yes -L 2' I don't recommend -L 4 or higher, unless you want to see the files that do not match your rules. -L 3 will show you all the files that match the rules, including those that are NOT chosen for migration. See the command gu -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Apr 17 22:16:57 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 17 Apr 2017 21:16:57 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: Message-ID: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan > wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Apr 18 14:31:20 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 18 Apr 2017 13:31:20 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> Message-ID: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Hi All, but especially Marc, I ran the mmapplypolicy again last night and, unfortunately, it again did not fill the capacity pool like it said it would. From the log file: [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 3632859 181380873184 1620175 61434283936 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 88 99230048 88 99230048 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 442962867. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 61533513984KB: 1620263 of 3632947 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878464 124983549952 97.999999609% gpfs23data 128885076416 343753326592 37.493477574% system 0 0 0.000000000% (no user data) [I] 2017-04-18 at 02:52:48.402 Policy execution. 0 files dispatched. And the tail end of the log file says that it moved those files: [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. But mmdf (and how quickly the mmapplypolicy itself ran) say otherwise: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.73T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.73T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.45T ( 51%) 128.8G ( 0%) Ideas? Or is it time for me to open a PMR? Thanks? Kevin On Apr 17, 2017, at 4:16 PM, Buterbaugh, Kevin L > wrote: Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan > wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Tue Apr 18 14:56:43 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 18 Apr 2017 09:56:43 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: Kevin, Here's a silly theory: Have you tried putting a weight value in? I wonder if during migration it hits some large file that would go over the threshold and stops. With a weight flag you could move all small files in first or by lack of heat etc to pack the tier more tightly. Just something else to try before the PMR process. Zach On Apr 18, 2017 9:32 AM, "Buterbaugh, Kevin L" < Kevin.Buterbaugh at vanderbilt.edu> wrote: Hi All, but especially Marc, I ran the mmapplypolicy again last night and, unfortunately, it again did not fill the capacity pool like it said it would. From the log file: [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 3632859 181380873184 1620175 61434283936 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 88 99230048 88 99230048 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 442962867. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 61533513984KB: 1620263 of 3632947 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878464 124983549952 97.999999609% gpfs23data 128885076416 343753326592 37.493477574% system 0 0 0.000000000% (no user data) [I] 2017-04-18 at 02:52:48.402 Policy execution. 0 files dispatched. And the tail end of the log file says that it moved those files: [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. But mmdf (and how quickly the mmapplypolicy itself ran) say otherwise: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.73T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.73T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.45T ( 51%) 128.8G ( 0%) Ideas? Or is it time for me to open a PMR? Thanks? Kevin On Apr 17, 2017, at 4:16 PM, Buterbaugh, Kevin L < Kevin.Buterbaugh at Vanderbilt.Edu> wrote: Hi Marc, Alex, all, Thank you for the responses. To answer Alex?s questions first ? the full command line I used (except for some stuff I?m redacting but you don?t need the exact details anyway) was: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And yes, it printed out the very normal, ?Hey, I migrated all 1.8 million files I said I would successfully, so I?m done here? message: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. Marc - I ran what you suggest in your response below - section 3a. The output of a ?test? mmapplypolicy and mmdf was very consistent. Therefore, I?m moving on to 3b and running against the full filesystem again ? the only difference between the command line above and what I?m doing now is that I?m running with ?-L 2? this time around. I?m not fond of doing this during the week but I need to figure out what?s going on and I *really* need to get some stuff moved from my ?data? pool to my ?capacity? pool. I will respond back on the list again where there?s something to report. Thanks again, all? Kevin On Apr 17, 2017, at 3:11 PM, Marc A Kaplan wrote: Kevin, 1. Running with both fairly simple rules so that you migrate "in both directions" is fine. It was designed to do that! 2. Glad you understand the logic of "rules hit" vs "files chosen". 3. To begin to understand "what the hxxx is going on" (as our fearless leader liked to say before he was in charge ;-) ) I suggest: (a) Run mmapplypolicy on directory of just a few files `mmapplypolicy /gpfs23/test-directory -I test ...` and check that the [I] ... Current data pool utilization message is consistent with the output of `mmdf gpfs23`. They should be, but if they're not, that's a weird problem right there since they're supposed to be looking at the same metadata! You can do this anytime, should complete almost instantly... (b) When time and resources permit, re-run mmapplypolicy on the full FS with your desired migration policy. Again, do the "Current", "Chosen" and "Predicted" messages make sense, and "add up"? Do the file counts seem reasonable, considering that you recently did migrations/deletions that should have changed the counts compared to previous runs of mmapplypolicy? If you just want to look and not actually change anything, use `-I test` which will skip the migration steps. If you want to see the list of files chosen (c) If you continue to see significant discrepancies between mmapplypolicy and mmdf, let us know. (d) Also at some point you may consider running mmrestripefs with options to make sure every file has its data blocks where they are supposed to be and is replicated as you have specified. Let's see where those steps take us... -- marc of Spectrum Scale (n? GPFS) From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/17/2017 11:25 AM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org ------------------------------ Hi Marc, I do understand what you?re saying about mmapplypolicy deciding it only needed to move ~1.8 million files to fill the capacity pool to ~98% full. However, it is now more than 24 hours since the mmapplypolicy finished ?successfully? and: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.66T ( 51%) 64.16G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.66T ( 51%) 64.61G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.33T ( 51%) 128.8G ( 0%) And yes, I did run the mmapplypolicy with ?-I yes? ? here?s the partially redacted command line: /usr/lpp/mmfs/bin/mmapplypolicy gpfs23 -A 75 -a 4 -g -I yes -L 1 -P ~/gpfs/gpfs23_migration.policy -N some,list,of,NSD,server,nodes And here?s that policy file: define(access_age,(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME))) define(GB_ALLOCATED,(KB_ALLOCATED/1048576.0)) RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ((access_age > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE (access_age < 14) The one thing that has changed is that formerly I only ran the migration in one direction at a time ? i.e. I used to have those two rules in two separate files and would run an mmapplypolicy using the OldStuff rule the 1st weekend of the month and run the other rule the other weekends of the month. This is the 1st weekend that I attempted to run an mmapplypolicy that did both at the same time. Did I mess something up with that? I have not run it again yet because we also run migrations on the other filesystem that we are still in the process of migrating off of. So gpfs23 goes 1st and as soon as it?s done the other filesystem migration kicks off. I don?t like to run two migrations simultaneously if at all possible. The 2nd migration ran until this morning, when it was unfortunately terminated by a network switch crash that has also had me tied up all morning until now. :-( And yes, there is something else going on ? well, was going on - the network switch crash killed this too ? I have been running an rsync on one particular ~80TB directory tree from the old filesystem to gpfs23. I understand that the migration wouldn?t know about those files and that?s fine ? I just don?t understand why mmapplypolicy said it was going to fill the capacity pool to 98% but didn?t do it ? wait, mmapplypolicy hasn?t gone into politics, has it?!? ;-) Thanks - and again, if I should open a PMR for this please let me know... Kevin On Apr 16, 2017, at 2:15 PM, Marc A Kaplan <*makaplan at us.ibm.com* > wrote: Let's look at how mmapplypolicy does the reckoning. Before it starts, it see your pools as: [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. Your rule says you want to migrate data to gpfs23capacity, up to 98% full: RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE ... We scan your files and find and reckon... [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) So yes, 5.25Million files match the rule, but the utility chooses 1.868Million files that add up to 67,355GB and figures that if it migrates those to gpfs23capacity, (and also figuring the other migrations by your second rule)then gpfs23 will end up 97.9999% full. We show you that with our "predictions" message. Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% So that's why it chooses to migrate "only" 67GB.... See? Makes sense to me. Questions: Did you run with -I yes or -I defer ? Were some of the files illreplicated or illplaced? Did you give the cluster-wide space reckoning protocols time to see the changes? mmdf is usually "behind" by some non-neglible amount of time. What else is going on? If you're moving or deleting or creating data by other means while mmapplypolicy is running -- it doesn't "know" about that! Run it again! From: "Buterbaugh, Kevin L" <*Kevin.Buterbaugh at Vanderbilt.Edu* > To: gpfsug main discussion list <*gpfsug-discuss at spectrumscale.org* > Date: 04/16/2017 09:47 AM Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: *gpfsug-discuss-bounces at spectrumscale.org* ------------------------------ Hi All, First off, I can open a PMR for this if I need to. Second, I am far from an mmapplypolicy guru. With that out of the way ? I have an mmapplypolicy job that didn?t migrate anywhere close to what it could / should have. >From the log file I have it create, here is the part where it shows the policies I told it to invoke: [I] Qos 'maintenance' configured as inf [I] GPFS Current Data Pool Utilization in KB and % Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 55365193728 124983549952 44.297984614% gpfs23data 166747037696 343753326592 48.507759721% system 0 0 0.000000000% (no user data) [I] 75142046 of 209715200 inodes used: 35.830520%. [I] Loaded policy rules from /root/gpfs/gpfs23_migration.policy. Evaluating policy rules with CURRENT_TIMESTAMP = 2017-04-15 at 01:13:02 UTC Parsed 2 policy rules. RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98) WHERE (((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) > 14) AND (KB_ALLOCATED > 3584)) RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75) WHERE ((DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) < 14) And then the log shows it scanning all the directories and then says, "OK, here?s what I?m going to do": [I] Summary of Rule Applicability and File Choices: Rule# Hit_Cnt KB_Hit Chosen KB_Chosen KB_Ill Rule 0 5255960 237675081344 1868858 67355430720 0 RULE 'OldStuff' MIGRATE FROM POOL 'gpfs23data' TO POOL 'gpfs23capacity' LIMIT(98.000000) WHERE(.) 1 611 236745504 611 236745504 0 RULE 'INeedThatAfterAll' MIGRATE FROM POOL 'gpfs23capacity' TO POOL 'gpfs23data' LIMIT(75.000000) WHERE(.) [I] Filesystem objects with no applicable rules: 414911602. [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 67592176224KB: 1869469 of 5256571 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 122483878944 124983549952 97.999999993% gpfs23data 104742360032 343753326592 30.470209865% system 0 0 0.000000000% (no user data) Notice that it says it?s only going to migrate less than 2 million of the 5.25 million candidate files!! And sure enough, that?s all it did: [I] A total of 1869469 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. And, not surprisingly, the gpfs23capacity pool on gpfs23 is nowhere near 98% full: Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 29.54T ( 51%) 63.93G ( 0%) eon35Dnsd 58.2T 35 No Yes 29.54T ( 51%) 64.39G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 59.08T ( 51%) 128.3G ( 0%) I don?t understand why it only migrated a small subset of what it could / should have? We are doing a migration from one filesystem (gpfs21) to gpfs23 and I really need to stuff my gpfs23capacity pool as full of data as I can to keep the migration going. Any ideas anyone? Thanks in advance? ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education *Kevin.Buterbaugh at vanderbilt.edu* - (615)875-9633 <(615)%20875-9633> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at *spectrumscale.org* *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at *spectrumscale.org* *http://gpfsug.org/mailman/listinfo/gpfsug-discuss* _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633> _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 16:11:19 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 11:11:19 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Tue Apr 18 16:31:12 2017 From: david_johnson at brown.edu (David D. Johnson) Date: Tue, 18 Apr 2017 11:31:12 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> I have an observation, which may merely serve to show my ignorance: Is it significant that the words "EXTERNAL EXEC/script? are seen below? If migrating between storage pools within the cluster, I would expect the PIT engine to do the migration. When doing HSM (off cluster, tape libraries, etc) is where I would expect to need a script to actually do the work. > [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. > [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; > 0 'skipped' files and/or errors. ? ddj Dave Johnson Brown University > On Apr 18, 2017, at 11:11 AM, Marc A Kaplan wrote: > > ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? > > ------ > > Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. > > So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? > > Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... > > While we're waiting for that... Here's what I suggest next. > > Add a clause ... > > SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) > > before the WHERE clause to each of your rules. > > Re-run the command with options '-I test -L 2' and collect the output. > > We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... > > You should see 1.6 million lines that look kind of like this: > > /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) > > Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed > add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). > > That sanity checks the policy arithmetic. Let's assume that's okay. > > Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as > find some of the biggest of those files and check that they really are that big.... > > At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... > and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... > > HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are > not recognized by mmapplypolicy as sharing storage... > This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? > > The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... > Optimistically that means it works fine for most customers... > > So sorry, something unusual about your installation or usage... > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 17:06:16 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 12:06:16 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> Message-ID: That is a summary message. It says one way or another, the command has dealt with 1.6 million files. For the case under discussion there are no EXTERNAL pools, nor any DELETions, just intra-GPFS MIGRATions. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Tue Apr 18 17:32:24 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Tue, 18 Apr 2017 16:32:24 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> Message-ID: <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> Hi Marc, Two things: 1. I have a PMR open now. 2. You *may* have identified the problem ? I?m still checking ? but files with hard links may be our problem. I wrote a simple Perl script to interate over the log file I had mmapplypolicy create. Here?s the code (don?t laugh, I?m a SysAdmin, not a programmer, and I whipped this out in < 5 minutes ? and yes, I realize the fact that I used Perl instead of Python shows my age as well ): #!/usr/bin/perl # use strict; use warnings; my $InputFile = "/tmp/mmapplypolicy.gpfs23.log"; my $TotalFiles = 0; my $TotalLinks = 0; my $TotalSize = 0; open INPUT, $InputFile or die "Couldn\'t open $InputFile for read: $!\n"; while () { next unless /MIGRATED/; $TotalFiles++; my $FileName = (split / /)[3]; if ( -f $FileName ) { # some files may have been deleted since mmapplypolicy ran my ($NumLinks, $FileSize) = (stat($FileName))[3,7]; $TotalLinks += $NumLinks; $TotalSize += $FileSize; } } close INPUT; print "Number of files / links = $TotalFiles / $TotalLinks, Total size = $TotalSize\n"; exit 0; And here?s what it kicked out: Number of files / links = 1620263 / 80818483, Total size = 53966202814094 1.6 million files but 80 million hard links!!! I?m doing some checking right now, but it appears that it is one particular group - and therefore one particular fileset - that is responsible for this ? they?ve got thousands of files with 50 or more hard links each ? and they?re not inconsequential in size. IIRC (and keep in mind I?m far from a GPFS policy guru), there is a way to say something to the effect of ?and the path does not contain /gpfs23/fileset/path? ? may need a little help getting that right. I?ll post this information to the ticket as well but wanted to update the list. This wouldn?t be the first time we were an ?edge case? for something in GPFS? ;-) Thanks... Kevin On Apr 18, 2017, at 10:11 AM, Marc A Kaplan > wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Tue Apr 18 17:56:11 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Tue, 18 Apr 2017 12:56:11 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? hard links! A workaround In-Reply-To: <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <968C356B-8FDD-44F8-9814-F3D2470369B0@Vanderbilt.Edu> Message-ID: Kevin, Wow. Never underestimate the power of ... Anyhow try this as a fix. Add the clause SIZE(KB_ALLOCATED/NLINK) to your MIGRATE rules. This spreads the total actual size over each hardlink... From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 04/18/2017 12:33 PM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Marc, Two things: 1. I have a PMR open now. 2. You *may* have identified the problem ? I?m still checking ? but files with hard links may be our problem. I wrote a simple Perl script to interate over the log file I had mmapplypolicy create. Here?s the code (don?t laugh, I?m a SysAdmin, not a programmer, and I whipped this out in < 5 minutes ? and yes, I realize the fact that I used Perl instead of Python shows my age as well ): #!/usr/bin/perl # use strict; use warnings; my $InputFile = "/tmp/mmapplypolicy.gpfs23.log"; my $TotalFiles = 0; my $TotalLinks = 0; my $TotalSize = 0; open INPUT, $InputFile or die "Couldn\'t open $InputFile for read: $!\n"; while () { next unless /MIGRATED/; $TotalFiles++; my $FileName = (split / /)[3]; if ( -f $FileName ) { # some files may have been deleted since mmapplypolicy ran my ($NumLinks, $FileSize) = (stat($FileName))[3,7]; $TotalLinks += $NumLinks; $TotalSize += $FileSize; } } close INPUT; print "Number of files / links = $TotalFiles / $TotalLinks, Total size = $TotalSize\n"; exit 0; And here?s what it kicked out: Number of files / links = 1620263 / 80818483, Total size = 53966202814094 1.6 million files but 80 million hard links!!! I?m doing some checking right now, but it appears that it is one particular group - and therefore one particular fileset - that is responsible for this ? they?ve got thousands of files with 50 or more hard links each ? and they?re not inconsequential in size. IIRC (and keep in mind I?m far from a GPFS policy guru), there is a way to say something to the effect of ?and the path does not contain /gpfs23/fileset/path? ? may need a little help getting that right. I?ll post this information to the ticket as well but wanted to update the list. This wouldn?t be the first time we were an ?edge case? for something in GPFS? ;-) Thanks... Kevin On Apr 18, 2017, at 10:11 AM, Marc A Kaplan wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 14:12:16 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 13:12:16 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu> <764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu> <4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> Message-ID: <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Hi All, I think we *may* be able to wrap this saga up? ;-) Dave - in regards to your question, all I know is that the tail end of the log file is ?normal? for all the successful pool migrations I?ve done in the past few years. It looks like the hard links were the problem. We have one group with a fileset on our filesystem that they use for backing up Linux boxes in their lab. That one fileset has thousands and thousands (I haven?t counted, but based on the output of that Perl script I wrote it could well be millions) of files with anywhere from 50 to 128 hard links each ? those files ranged from a few KB to a few MB in size. From what Marc said, my understanding is that with the way I had my policy rule written mmapplypolicy was seeing each of those as separate files and therefore thinking it was moving 50 to 128 times as much space to the gpfs23capacity pool as it really was for those files. Marc can correct me or clarify further if necessary. He directed me to add: SIZE(KB_ALLOCATED/NLINK) to both of my migrate rules in my policy file. I did so and kicked off another mmapplypolicy last night, which is still running. However, the prediction section now says: [I] GPFS Policy Decisions and File Choice Totals: Chose to migrate 40050141920KB: 2051495 of 2051495 candidates; Predicted Data Pool Utilization in KB and %: Pool_Name KB_Occupied KB_Total Percent_Occupied gpfs23capacity 104098980256 124983549952 83.290145220% gpfs23data 168478368352 343753326592 49.011414674% system 0 0 0.000000000% (no user data) So now it?s going to move every file it can that matches my policies because it?s figured out that a lot of those are hard links ? and I don?t have enough files matching the criteria to fill the gpfs23capacity pool to the 98% limit like mmapplypolicy thought I did before. According to the log file, it?s happily chugging along migrating files, and mmdf agrees that my gpfs23capacity pool is gradually getting more full (I have it QOSed, of course): Disks in storage pool: gpfs23capacity (Maximum disk size allowed is 519 TB) eon35Ansd 58.2T 35 No Yes 25.33T ( 44%) 68.13G ( 0%) eon35Dnsd 58.2T 35 No Yes 25.33T ( 44%) 68.49G ( 0%) ------------- -------------------- ------------------- (pool total) 116.4T 50.66T ( 44%) 136.6G ( 0%) My sincere thanks to all who took the time to respond to my questions. Of course, that goes double for Marc. We (Vanderbilt) seem to have a long tradition of finding some edge cases in GPFS going all the way back to when we originally moved off of an NFS server to GPFS (2.2, 2.3?) back in 2005. I was creating individual tarballs of each users? home directory on the NFS server, copying the tarball to one of the NSD servers, and untarring it there (don?t remember why we weren?t rsync?ing, but there was a reason). Everything was working just fine except for one user. Every time I tried to untar her home directory on GPFS it barfed part of the way thru ? turns out that until then IBM hadn?t considered that someone would want to put 6 million files in one directory. Gotta love those users! ;-) Kevin On Apr 18, 2017, at 10:31 AM, David D. Johnson > wrote: I have an observation, which may merely serve to show my ignorance: Is it significant that the words "EXTERNAL EXEC/script? are seen below? If migrating between storage pools within the cluster, I would expect the PIT engine to do the migration. When doing HSM (off cluster, tape libraries, etc) is where I would expect to need a script to actually do the work. [I] 2017-04-18 at 09:06:51.124 Policy execution. 1620263 files dispatched. [I] A total of 1620263 files have been migrated, deleted or processed by an EXTERNAL EXEC/script; 0 'skipped' files and/or errors. ? ddj Dave Johnson Brown University On Apr 18, 2017, at 11:11 AM, Marc A Kaplan > wrote: ANYONE else reading this saga? Who uses mmapplypolicy to migrate files within multi-TB file systems? Problems? Or all working as expected? ------ Well, again mmapplypolicy "thinks" it has "chosen" 1.6 million files whose total size is 61 Terabytes and migrating those will bring the occupancy of gpfs23capacity pool to 98% and then we're done. So now I'm wondering where this is going wrong. Is there some bug in the reckoning inside of mmapplypolicy or somewhere else in GPFS? Sure you can put in an PMR, and probably should. I'm guessing whoever picks up the PMR will end up calling or emailing me ... but maybe she can do some of the clerical work for us... While we're waiting for that... Here's what I suggest next. Add a clause ... SHOW(varchar(KB_ALLOCATED) || ' n=' || varchar(NLINK)) before the WHERE clause to each of your rules. Re-run the command with options '-I test -L 2' and collect the output. We're not actually going to move any data, but we're going to look at the files and file sizes that are "chosen"... You should see 1.6 million lines that look kind of like this: /yy/dat/bigC RULE 'msx' MIGRATE FROM POOL 'system' TO POOL 'xtra' WEIGHT(inf) SHOW( 1024 n=1) Run a script over the output to add up all the SHOW() values in the lines that contain TO POOL 'gpfs23capacity' and verify that they do indeed add up to 61TB... (The show is in KB so the SHOW numbers should add up to 61 billion). That sanity checks the policy arithmetic. Let's assume that's okay. Then the next question is whether the individual numbers are correct... Zach Giles made a suggestion... which I'll interpret as find some of the biggest of those files and check that they really are that big.... At this point, I really don't know, but I'm guessing there's some discrepances in the reported KB_ALLOCATED numbers for many of the files... and/or they are "illplaced" - the data blocks aren't all in the pool FROM POOL ... HMMMM.... I just thought about this some more and added the NLINK statistic. It would be unusual for this to be a big problem, but files that are hard linked are not recognized by mmapplypolicy as sharing storage... This has not come to my attention as a significant problem -- does the file system in question have significant GBs of hard linked files? The truth is that you're the first customer/user/admin in a long time to question/examine how mmapplypolicy does its space reckoning ... Optimistically that means it works fine for most customers... So sorry, something unusual about your installation or usage... _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 15:37:29 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 10:37:29 -0400 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Apr 19 17:18:50 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 19 Apr 2017 16:18:50 +0000 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu> <458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Hey Marc, I'm having some issues where a simple ILM list policy never completes, but I have yet to open a PMR or enable additional logging. But I was wondering if there are known reasons that this would not complete, such as when there is a symbolic link that creates a loop within the directory structure or something simple like that. Do you know of any cases like this, Marc, that I should try to find in my file systems? Thanks in advance! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, April 19, 2017 9:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Wed Apr 19 17:23:12 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Wed, 19 Apr 2017 19:23:12 +0300 Subject: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu><458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: Hi Maybe the temp list file - fill the FS that they build on. Try to monitor the FS where the temp filelist is created. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: Bryan Banister To: gpfsug main discussion list Date: 04/19/2017 07:19 PM Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Sent by: gpfsug-discuss-bounces at spectrumscale.org Hey Marc, I?m having some issues where a simple ILM list policy never completes, but I have yet to open a PMR or enable additional logging. But I was wondering if there are known reasons that this would not complete, such as when there is a symbolic link that creates a loop within the directory structure or something simple like that. Do you know of any cases like this, Marc, that I should try to find in my file systems? Thanks in advance! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [ mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Marc A Kaplan Sent: Wednesday, April 19, 2017 9:37 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] mmapplypolicy didn't migrate everything it should have - why not? Well I'm glad we followed Mr. S. Holmes dictum which I'll paraphrase... eliminate the impossible and what remains, even if it seems improbable, must hold. BTW - you may want to look at mmclone. Personally, I find the doc and terminology confusing, but mmclone was designed to efficiently store copies and near-copies of large (virtual machine) images. Uses copy-on-write strategy, similar to GPFS snapshots, but at a file by file granularity. BBTW - we fixed directories - they can now be huge (up to about 2^30 files) and automagically, efficiently grow and shrink in size. Also small directories can be stored efficiently in the inode. The last major improvement was just a few years ago. Before that they could be huge, but would never shrink. Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From makaplan at us.ibm.com Wed Apr 19 18:10:28 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 13:10:28 -0400 Subject: [gpfsug-discuss] mmapplypolicy not terminating properly? In-Reply-To: References: <90BBFFED-C308-41E2-A614-A0AE5DA764CD@vanderbilt.edu><764081F7-56BD-40D4-862D-9BBBD02ED214@vanderbilt.edu><4EC20B6E-8172-492D-B2ED-017359A48D03@brown.edu><458DAA01-0766-4ACB-964C-255BAC6E7975@vanderbilt.edu> Message-ID: (Bryan B asked...) Open a PMR. The first response from me will be ... Run the mmapplypolicy command again, except with additional option `-d 017` and collect output with something equivalent to `2>&1 | tee /tmp/save-all-command-output-here-to-be-passed-along-to-IBM-service ` If you are convinced that mmapplypolicy is "looping" or "hung" - wait another 2 minutes, terminate, and then pass along the saved-all-command-output. -d 017 will dump a lot of additional diagnostics -- If you want to narrow it by baby steps we could try `-d 03` first and see if there are enough clues in that. To answer two of your questions: 1. mmapplypolicy does not follow symlinks, so no "infinite loop" possible with symlinks. 2a. loops in directory are file system bugs in GPFS, (in fact in any posixish file system), (mm)fsck! 2b. mmapplypolicy does impose a limit on total length of pathnames, so even if there is a loop in the directory, mmapplypolicy will "trim" the directory walk. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 20:53:42 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 19:53:42 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data Message-ID: Hi All, We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Wed Apr 19 20:59:18 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 19 Apr 2017 19:59:18 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: Message-ID: Hi I'll give my opinion. Worth what you pay for. Do as many as you can, six in this case for the good reason you mentioned. But play with the callbacks so the migration happens on watermarks when it happens. Otherwise you might hit no space till your next policy run. The second is well documented on the redbook AFAIK Cheers -- Cheers > On 19 Apr 2017, at 22.54, Buterbaugh, Kevin L wrote: > > Hi All, > > We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. > > Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. > > So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. > > Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 19 21:05:49 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 19 Apr 2017 20:05:49 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Buterbaugh, Kevin L [Kevin.Buterbaugh at Vanderbilt.Edu] Sent: 19 April 2017 20:53 To: gpfsug main discussion list Subject: [gpfsug-discuss] RAID config for SSD's used for data Hi All, We currently have what I believe is a fairly typical setup ? metadata for our GPFS filesystems is the only thing in the system pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB usable space. Now lets just say that you have a little bit of money to spend. Your I/O demands aren?t great - in fact, they?re way on the low end ? typical (cumulative) usage is 200 - 600 MB/sec read, less than that for writes. But while GPFS has always been great and therefore you don?t need to Make GPFS Great Again, you do want to provide your users with the best possible environment. So you?re considering the purchase of a dual-controller FC storage array with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage would be in its? own storage pool and that pool would be the default location for I/O for your main filesystem ? at least for smaller files. You intend to use mmapplypolicy nightly to move data to / from this pool and the spinning disk pools. Given all that ? would you configure those disks as 6 RAID 1 mirrors and have 6 different primary NSD servers or would it be feasible to configure one big RAID 6 LUN? I?m thinking the latter is not a good idea as there could only be one primary NSD server for that one LUN, but given that: 1) I have no experience with this, and 2) I have been wrong once or twice before (), I?m looking for advice. Thanks! ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 From aaron.s.knister at nasa.gov Wed Apr 19 21:13:14 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Wed, 19 Apr 2017 16:13:14 -0400 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: On 4/19/17 4:05 PM, Simon Thompson (IT Research Support) wrote: > By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... > > And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) You mean like HAWC but for writes larger than 64K? ;-) Or I guess "HARC" as it might be called for a read cache... -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From luis.bolinches at fi.ibm.com Wed Apr 19 21:20:20 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Wed, 19 Apr 2017 20:20:20 +0000 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: Message-ID: I assume you are making the joke of external LROC. But not sure I would use external storage for LROC, as the whole point is to have really fast storage as close to the node (L for local) as possible. Maybe those SSD that will get replaced with the fancy external storage? -- Cheers > On 19 Apr 2017, at 23.13, Aaron Knister wrote: > > > >> On 4/19/17 4:05 PM, Simon Thompson (IT Research Support) wrote: >> By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... >> >> And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) > > You mean like HAWC but for writes larger than 64K? ;-) > > Or I guess "HARC" as it might be called for a read cache... > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 21:49:56 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 16:49:56 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Kevin.Buterbaugh at Vanderbilt.Edu Wed Apr 19 22:12:35 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Wed, 19 Apr 2017 21:12:35 +0000 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Hi Marc, But the limitation on GPFS replication is that I can set replication separately for metadata and data, but no matter whether I have one data pool or ten data pools they all must have the same replication, correct? And believe me I *love* GPFS replication ? I would hope / imagine that I am one of the few people on this mailing list who has actually gotten to experience a ?fire scenario? ? electrical fire, chemical suppressant did it?s thing, and everything in the data center had a nice layer of soot, ash, and chemical suppressant on and in it and therefore had to be professionally cleaned. Insurance bought us enough disk space that we could (temporarily) turn on GPFS data replication and clean storage arrays one at a time! But in my current hypothetical scenario I?m stretching the budget just to get that one storage array with 12 x 1.8 TB SSD?s in it. Two are out of the question. My current metadata that I?ve got on SSDs is on RAID 1 mirrors and has GPFS replication set to 2. I thought the multiple RAID 1 mirrors approach was the way to go for SSDs for data as well, as opposed to one big RAID 6 LUN, but wanted to get the advice of those more knowledgeable than me. Thanks! Kevin On Apr 19, 2017, at 3:49 PM, Marc A Kaplan > wrote: As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: * Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. * GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From chekh at stanford.edu Wed Apr 19 22:23:15 2017 From: chekh at stanford.edu (Alex Chekholko) Date: Wed, 19 Apr 2017 14:23:15 -0700 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> On 04/19/2017 12:53 PM, Buterbaugh, Kevin L wrote: > > So you?re considering the purchase of a dual-controller FC storage array > with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage > would be in its? own storage pool and that pool would be the default > location for I/O for your main filesystem ? at least for smaller files. > You intend to use mmapplypolicy nightly to move data to / from this > pool and the spinning disk pools. We did this and failed in interesting (but in retrospect obvious) ways. You will want to ensure that your users cannot fill your write target pool within a day. The faster the storage, the more likely that is to happen. Or else your users will get ENOSPC. You will want to ensure that your pools can handle the additional I/O from the migration in aggregate with all the user I/O. Or else your users will see worse performance from the fast pool than the slow pool while the migration is running. You will want to make sure that the write throughput of your slow pool is faster than the read throughput of your fast pool. In our case, the fast pool was undersized in capacity, and oversized in terms of performance. And overall the filesystem was oversubscribed (~100 10GbE clients, 8 x 10GbE NSD servers) So the fast pool would fill very quickly. Then I would switch the placement policy to the big slow pool and performance would drop dramatically, and then if I ran a migration it would either (depending on parameters) take up all the I/O to the slow pool (leaving none for the users), or else take forever (weeks) because the user I/O was maxing out the slow pool. Things should better today with QoS stuff, but your relative pool capacities (in our case it was like 1% fast, 99% slow) and your relative pool performance (in our case, slow pool had fewer IOPS than fast pool) are still going to matter a lot. -- Alex Chekholko chekh at stanford.edu From makaplan at us.ibm.com Wed Apr 19 22:58:24 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 17:58:24 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Kevin asked: " ... data pools they all must have the same replication, correct?" Actually no! You can use policy RULE ... SET POOL 'x' REPLICATE(2) to set the replication factor when a file is created. Use mmchattr or mmapplypolicy to change the replication factor after creation. You specify the maximum data replication factor when you create the file system (1,2,3), but any given file can have replication factor set to 1 or 2 or 3. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From kums at us.ibm.com Wed Apr 19 23:03:33 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Wed, 19 Apr 2017 18:03:33 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Hi, >> As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: >>Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. >>This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. As you pointed out, the RAID choices for GPFS may not be simple and we need to take into consideration factors such as storage subsystem configuration/capabilities such as if all drives are homogenous or there is mix of drives. If all the drives are homogeneous, then create dataAndMetadata NSDs across RAID-6 and if the storage controller supports write-cache + write-cache mirroring (WC + WM) then enable this (WC +WM) can alleviate read-modify-write for small writes (typical in metadata). If there is MIX of SSD and HDD (e.g. 15K RPM), then we need to take into consideration the aggregate IOPS of RAID-1 SSD volumes vs. RAID-6 HDDs before separating data and metadata into separate media. For example, if the storage subsystem has 2 x SSDs and ~300 x 15K RPM or NL_SAS HDDs then most likely aggregate IOPS of RAID-6 HDD volumes will be higher than RAID-1 SSD volumes. It would be recommended to also assess the I/O performance on different configuration (dataAndMetadata vs dataOnly/metadataOnly NSDs) with some application workload + production scenarios before deploying the final solution. >> GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more >>robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. For high-resiliency (for e.g. metadataOnly) and if there are multiple storage across different failure domains (different racks/rooms/DC etc), it will be good to enable BOTH hardware RAID-1 as well as GPFS metadata replication enabled (at the minimum, -m 2). If there is single shared storage for GPFS file-system storage and metadata is separated from data, then RAID-1 would minimize administrative overhead compared to GPFS replication in the event of drive failure (since with GPFS replication across single SSD would require mmdeldisk/mmdelnsd/mmcrnsd/mmadddisk every time disk goes faulty and needs to be replaced). Best, -Kums From: Marc A Kaplan/Watson/IBM at IBMUS To: gpfsug main discussion list Date: 04/19/2017 04:50 PM Subject: Re: [gpfsug-discuss] RAID config for SSD's - potential pitfalls Sent by: gpfsug-discuss-bounces at spectrumscale.org As I've mentioned before, RAID choices for GPFS are not so simple. Here are a couple points to consider, I'm sure there's more. And if I'm wrong, someone will please correct me - but I believe the two biggest pitfalls are: Some RAID configurations (classically 5 and 6) work best with large, full block writes. When the file system does a partial block write, RAID may have to read a full "stripe" from several devices, compute the differences and then write back the modified data to several devices. This is certainly true with RAID that is configured over several storage devices, with error correcting codes. SO, you do NOT want to put GPFS metadata (system pool!) on RAID configured with large stripes and error correction. This is the Read-Modify-Write Raid pitfall. GPFS has built-in replication features - consider using those instead of RAID replication (classically Raid-1). GPFS replication can work with storage devices that are in different racks, separated by significant physical space, and from different manufacturers. This can be more robust than RAID in a single box or single rack. Consider a fire scenario, or exploding power supply or similar physical disaster. Consider that storage devices and controllers from the same manufacturer may have the same bugs, defects, failures. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From makaplan at us.ibm.com Wed Apr 19 23:41:19 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Wed, 19 Apr 2017 18:41:19 -0400 Subject: [gpfsug-discuss] RAID config for SSD's - potential pitfalls In-Reply-To: References: Message-ID: Kums is our performance guru, so weigh that appropriately and relative to my own remarks... Nevertheless, I still think RAID-5or6 is a poor choice for GPFS metadata. The write cache will NOT mitigate the read-modify-write problem of a workload that has a random or hop-scotch access pattern of small writes. In the end you've still got to read and write several times more disk blocks than you actually set out to modify. Same goes for any large amount of data that will be written in a pattern of non-sequential small writes. (Define a small write as less than a full RAID stripe). For sure, non-volatile write caches are a good thing - but not a be all end all solution. Relying on RAID-1 to protect your metadata may well be easier to administer, but still GPFS replication can be more robust. Doing both - belt and suspenders is fine -- if you can afford it. Either is buying 2x storage, both is 4x. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Apr 20 00:16:08 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Wed, 19 Apr 2017 23:16:08 +0000 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Message-ID: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Thu Apr 20 01:10:51 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 19 Apr 2017 20:10:51 -0400 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values In-Reply-To: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> References: <3F3E9259-1601-4473-A827-7CD5418B8C58@nuance.com> Message-ID: Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Apr 20 01:21:04 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 20 Apr 2017 00:21:04 +0000 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Message-ID: Hi Eric Looks like your assumption is correct - no negative values from ?mmfsadm eventsExporter mmpmon vfss?. I don?t normally view these via ?mmfsadm?, I use the zimon stats. But, It?s a bug that should be fixed. What?s the best way to get this fixed? root at cnt-r01r07u15 ~]# mmfsadm eventsExporter mmpmon vfss _response_ begin mmpmon vfss _mmpmon::vfss_ _n_ 10.30.100.193 _nn_ cnt-r01r07u15 _rc_ 0 _t_ 1492647309 _tu_ 311964 _access_ 8472897 56529.874886 _close_ 1460223848 49854.938090 _create_ 2101927 155055.515041 _fclear_ 0 0.000000 _fsync_ 20 0.024288 _fsync_range_ 0 0.000000 _ftrunc_ 0 0.000000 _getattr_ 859626332 101183.720281 _link_ 2175473 625.343799 _lockctl_ 17326 5.229828 _lookup_ 200378610 1201985.264220 _map_lloff_ 220854519 8561.860515 _mkdir_ 817943 217390.170859 _mknod_ 3 0.001422 _open_ 1460217712 134812.649162 _read_ 3883163461 3971457.463527 _write_ 186078410 137927.496812 _mmapRead_ 17108947 10665.929860 _mmapWrite_ 0 0.000000 _aioRead_ 0 0.000000 _aioWrite_ 0 0.000000 _readdir_ 142262897 6999.189450 _readlink_ 485337171 2111.634286 _readpage_ 3646233600 14346.331414 _remove_ 4241324 93277.463798 _rename_ 350679 19334.235924 _rmdir_ 342042 2736.048976 _setacl_ 0 0.000000 _setattr_ 3709289 16963.901179 _symlink_ 161336 8522.670079 _unmap_ 3929805828 1735.740690 _writepage_ 0 0.000000 _tsfattr_ 0 0.000000 _tsfsattr_ 0 0.000000 _flock_ 0 0.000000 _setxattr_ 119 0.001042 _getxattr_ 4077218348 628418.213008 _listxattr_ 0 0.000000 _removexattr_ 15 0.000042 _encode_fh_ 0 0.000000 _decode_fh_ 0 0.000000 _get_dentry_ 0 0.000000 _get_parent_ 0 0.000000 _mount_ 0 0.000000 _statfs_ 2625497 214.309671 _sync_ 0 0.000000 _vget_ 0 0.000000 _response_ end Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Wednesday, April 19, 2017 at 7:10 PM To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479. If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From scale at us.ibm.com Thu Apr 20 02:03:16 2017 From: scale at us.ibm.com (IBM Spectrum Scale) Date: Wed, 19 Apr 2017 21:03:16 -0400 Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values In-Reply-To: References: Message-ID: Thanks Bob. Yes, it looks good for the hypothesis. ZIMon gets its VFSS stats from the mmpmon code that we just exercised with "mmfsadm eventsExporter mmpmon vfss"; so the ZIMon stats are also probably correct. Having said that, I agree with you that the "mmfsadm vfsstats" problem is a bug that should be fixed. If you would like to open a PMR so an APAR gets generated, it might help speed the routing of the PMR if you include in the PMR text our email exchange, and highlight Eric Agar is the GPFS developer with whom you've already discussed this issue. You could also mention that I believe I have no need for a gpfs snap. Having an APAR will help ensure the fix makes it into a PTF for the release you are using. If you do not want to open a PMR, I still intend to fix the problem in the development stream. Thanks again. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Cc: IBM Spectrum Scale/Poughkeepsie/IBM at IBMUS Date: 04/19/2017 08:21 PM Subject: Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Hi Eric Looks like your assumption is correct - no negative values from ?mmfsadm eventsExporter mmpmon vfss?. I don?t normally view these via ?mmfsadm?, I use the zimon stats. But, It?s a bug that should be fixed. What?s the best way to get this fixed? root at cnt-r01r07u15 ~]# mmfsadm eventsExporter mmpmon vfss _response_ begin mmpmon vfss _mmpmon::vfss_ _n_ 10.30.100.193 _nn_ cnt-r01r07u15 _rc_ 0 _t_ 1492647309 _tu_ 311964 _access_ 8472897 56529.874886 _close_ 1460223848 49854.938090 _create_ 2101927 155055.515041 _fclear_ 0 0.000000 _fsync_ 20 0.024288 _fsync_range_ 0 0.000000 _ftrunc_ 0 0.000000 _getattr_ 859626332 101183.720281 _link_ 2175473 625.343799 _lockctl_ 17326 5.229828 _lookup_ 200378610 1201985.264220 _map_lloff_ 220854519 8561.860515 _mkdir_ 817943 217390.170859 _mknod_ 3 0.001422 _open_ 1460217712 134812.649162 _read_ 3883163461 3971457.463527 _write_ 186078410 137927.496812 _mmapRead_ 17108947 10665.929860 _mmapWrite_ 0 0.000000 _aioRead_ 0 0.000000 _aioWrite_ 0 0.000000 _readdir_ 142262897 6999.189450 _readlink_ 485337171 2111.634286 _readpage_ 3646233600 14346.331414 _remove_ 4241324 93277.463798 _rename_ 350679 19334.235924 _rmdir_ 342042 2736.048976 _setacl_ 0 0.000000 _setattr_ 3709289 16963.901179 _symlink_ 161336 8522.670079 _unmap_ 3929805828 1735.740690 _writepage_ 0 0.000000 _tsfattr_ 0 0.000000 _tsfsattr_ 0 0.000000 _flock_ 0 0.000000 _setxattr_ 119 0.001042 _getxattr_ 4077218348 628418.213008 _listxattr_ 0 0.000000 _removexattr_ 15 0.000042 _encode_fh_ 0 0.000000 _decode_fh_ 0 0.000000 _get_dentry_ 0 0.000000 _get_parent_ 0 0.000000 _mount_ 0 0.000000 _statfs_ 2625497 214.309671 _sync_ 0 0.000000 _vget_ 0 0.000000 _response_ end Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: on behalf of IBM Spectrum Scale Reply-To: gpfsug main discussion list Date: Wednesday, April 19, 2017 at 7:10 PM To: gpfsug main discussion list Cc: "gpfsug-discuss-bounces at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Bob, I also noticed this recently. I think it may be a simple matter of a printf()-like statement in the code that handles "mmfsadm vfsstats" using an incorrect conversion specifier --- one that treats the counter as signed instead of unsigned and treats the counter as being smaller than it really is. To help confirm that hypothesis, could you please run the following commands on the node, at the same time, so the output can be compared: # mmfsadm vfsstats # mmfsadm eventsExporter mmpmon vfss I believe the code that handles "mmfsadm eventsExporter mmpmon vfss" uses the correct printf()-like conversion specifier. So, it should so good numbers where "mmfsadm vfsstats" shows negative numbers. Regards, The Spectrum Scale (GPFS) team Eric Agar ------------------------------------------------------------------------------------------------------------------ If you feel that your question can benefit other users of Spectrum Scale (GPFS), then please post it to the public IBM developerWroks Forum at https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479 . If your query concerns a potential software error in Spectrum Scale (GPFS) and you have an IBM software maintenance contract please contact 1-800-237-5511 in the United States or your local IBM Service Center in other countries. The forum is informally monitored as time permits and should not be used for priority messages to the Spectrum Scale (GPFS) team. From: "Oesterlin, Robert" To: gpfsug main discussion list Date: 04/19/2017 07:16 PM Subject: [gpfsug-discuss] mmfsadm dump vfsstats - negative values Sent by: gpfsug-discuss-bounces at spectrumscale.org I assume the counter has wrapped on some of these - would a PMR fix this? (4.2.1) [root at cnt-r01r07u15 ~]# mmfsadm vfsstats vfs statistics currently enabled started at: Fri Jan 27 16:22:02.702 2017 duration: 7091405.800 sec name calls time per call total time -------------------- -------- -------------- -------------- access 8472691 0.006672 56529.863993 close 1460175509 0.000034 49854.695358 create 2101110 0.073797 155055.263775 fsync 20 0.001214 0.024288 getattr 859449161 0.000118 101183.699413 link 2175473 0.000287 625.343799 lockctl 17326 0.000302 5.229828 lookup 200369809 0.005999 1201980.046683 map_lloff 220850355 0.000039 8561.791963 mkdir 817894 0.265793 217390.095681 mknod 3 0.000474 0.001422 open 1460169409 0.000092 134811.724068 read -412143552 0.001023 3971403.879911 write 164739329 0.000829 136616.948900 mmapRead 17108252 0.000623 10665.877349 readdir 142261835 0.000049 6999.159121 readlink 485335656 0.000004 2111.627292 readpage -648839570 0.000004 14346.195128 remove 4239806 0.022000 93277.124289 rename 350671 0.055135 19334.226490 rmdir 342019 0.008000 2736.037074 setattr 3709237 0.004573 16963.899331 symlink 160610 0.053061 8522.185175 unmap -365476297 0.000000 1735.669373 setxattr 119 0.000009 0.001042 getxattr -218316996 0.000154 628416.355002 removexattr 15 0.000003 0.000042 statfs 2624067 0.000082 214.306646 fastOpen 1456944934 0.000000 0.000000 fastClose 1515612004 0.000000 0.000000 fastLookup 77981387 0.000000 0.000000 fastRead -922882405 0.000000 0.000000 fastWrite 102606402 0.000000 0.000000 revalidate 899677 0.000000 0.000000 aio write sync 21331080 0.000061 1309.773528 Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Apr 20 09:11:15 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 20 Apr 2017 10:11:15 +0200 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: Some thoughts: you give typical cumulative usage values. However, a fast pool might matter most for spikes of the traffic. Do you have spikes driving your current system to the edge? Then: using the SSD pool for writes is straightforward (placement), using it for reads will only pay off if data are either pre-fetched to the pool somehow, or read more than once before getting migrated back to the HDD pool(s). Write traffic is less than read as you wrote. RAID1 vs RAID6: RMW penalty of parity-based RAIDs was mentioned, which strikes at writes smaller than the full stripe width of your RAID - what type of write I/O do you have (or expect)? (This may also be important for choosing the quality of SSDs, with RMW in mind you will have a comparably huge amount of data written on the SSD devices if your I/O traffic consists of myriads of small IOs and you organized the SSDs in a RAID5 or RAID6) I suppose your current system is well set to provide the required aggregate throughput. Now, what kind of improvement do you expect? How are the clients connected? Would they have sufficient network bandwidth to see improvements at all? Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 gpfsug-discuss-bounces at spectrumscale.org wrote on 04/19/2017 09:53:42 PM: > From: "Buterbaugh, Kevin L" > To: gpfsug main discussion list > Date: 04/19/2017 09:54 PM > Subject: [gpfsug-discuss] RAID config for SSD's used for data > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > Hi All, > > We currently have what I believe is a fairly typical setup ? > metadata for our GPFS filesystems is the only thing in the system > pool and it?s on SSD, while data is on spinning disk (RAID 6 LUNs). > Everything connected via 8 Gb FC SAN. 8 NSD servers. Roughly 1 PB > usable space. > > Now lets just say that you have a little bit of money to spend. > Your I/O demands aren?t great - in fact, they?re way on the low end > ? typical (cumulative) usage is 200 - 600 MB/sec read, less than > that for writes. But while GPFS has always been great and therefore > you don?t need to Make GPFS Great Again, you do want to provide your > users with the best possible environment. > > So you?re considering the purchase of a dual-controller FC storage > array with 12 or so 1.8 TB SSD?s in it, with the idea being that > that storage would be in its? own storage pool and that pool would > be the default location for I/O for your main filesystem ? at least > for smaller files. You intend to use mmapplypolicy nightly to move > data to / from this pool and the spinning disk pools. > > Given all that ? would you configure those disks as 6 RAID 1 mirrors > and have 6 different primary NSD servers or would it be feasible to > configure one big RAID 6 LUN? I?m thinking the latter is not a good > idea as there could only be one primary NSD server for that one LUN, > but given that: 1) I have no experience with this, and 2) I have > been wrong once or twice before (), I?m looking for advice. Thanks! > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From jonathan at buzzard.me.uk Thu Apr 20 10:25:40 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 20 Apr 2017 10:25:40 +0100 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> References: <4f16617c-0ae9-18ef-bfb5-206507762fd9@stanford.edu> Message-ID: <1492680340.4102.120.camel@buzzard.me.uk> On Wed, 2017-04-19 at 14:23 -0700, Alex Chekholko wrote: > On 04/19/2017 12:53 PM, Buterbaugh, Kevin L wrote: > > > > So you?re considering the purchase of a dual-controller FC storage array > > with 12 or so 1.8 TB SSD?s in it, with the idea being that that storage > > would be in its? own storage pool and that pool would be the default > > location for I/O for your main filesystem ? at least for smaller files. > > You intend to use mmapplypolicy nightly to move data to / from this > > pool and the spinning disk pools. > > We did this and failed in interesting (but in retrospect obvious) ways. > You will want to ensure that your users cannot fill your write target > pool within a day. The faster the storage, the more likely that is to > happen. Or else your users will get ENOSPC. Eh? Seriously you should have a fail over rule so that when your "fast" pool is filled up it starts allocating in the "slow" pool (nice good names that are descriptive and less than 8 characters including termination character). Now there are issues when you get close to very full so you need to set the fail over to as sizeable bit less than the full size, 95% is a good starting point. The pool names size is important because if the fast pool is less than eight characters and the slow is more because you called in "nearline" (which is 9 including termination character) once the files get moved they get backed up again by TSM, yeah!!! The 95% bit comes about from this. Imagine you had 12KB left in the fast pool and you go to write a file. You open the file with 0B in size and then start writing. At 12KB you run out of space in the fast pool and as the file can only be in one pool you get a ENOSPC, and the file gets canned. This then starts repeating on a regular basis. So if you start allocating at significantly less than 100%, say 95% where that 5% is larger than the largest file you expect that file works, but all subsequent files get allocated in the slow pool, till you flush the fast pool. Something like this as the last two rules in your policy should do the trick. /* by default new files to the fast disk unless full, then to slow */ RULE 'new' SET POOL 'fast' LIMIT(95) RULE 'spillover' SET POOL 'slow' However in general your fast pool needs to have sufficient capacity to take your daily churn and then some. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From jonathan at buzzard.me.uk Thu Apr 20 10:32:20 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Thu, 20 Apr 2017 10:32:20 +0100 Subject: [gpfsug-discuss] RAID config for SSD's used for data In-Reply-To: References: Message-ID: <1492680740.4102.126.camel@buzzard.me.uk> On Wed, 2017-04-19 at 20:05 +0000, Simon Thompson (IT Research Support) wrote: > By having many LUNs, you get many IO queues for Linux to play with. Also the raid6 overhead can be quite significant, so it might be better to go with raid1 anyway depending on the controller... > > And if only gpfs had some sort of auto tier back up the pools for hot or data caching :-) > If you have sized the "fast" pool correctly then the "slow" pool will be spending most of it's time doing diddly squat, aka under 10 IOPS per second unless you are flushing the pool of old files to make space. I have graphs that show this. Then two things happen, if you are just reading the file then fine, probably coming from the cache or the disks are not very busy anyway so you won't notice. If you happen to *change* the file and start doing things actively with it again, then because most programs approach this by creating an entirely new file with a temporary name, then doing a rename and delete shuffle so a crash will leave you with a valid file somewhere then the changed version ends up on the fast disk by virtue of being a new file. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From p.childs at qmul.ac.uk Thu Apr 20 12:38:09 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 20 Apr 2017 11:38:09 +0000 Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories In-Reply-To: References: Message-ID: Simon, We've managed to resolve this issue by switching off quota's and switching them back on again and rebuilding the quota file. Can I check if you run quota's on your cluster. See you 2 weeks in Manchester Thanks in advance. Peter Childs Research Storage Expert ITS Research Infrastructure Queen Mary, University of London Phone: 020 7882 8393 ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Simon Thompson (IT Research Support) Sent: Tuesday, April 11, 2017 4:55:35 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Spectrum Scale Slow to create directories We actually saw this for a while on one of our clusters which was new. But by the time I'd got round to looking deeper, it had gone, maybe we were using the NSDs more heavily, or possibly we'd upgraded. We are at 4.2.2-2, so might be worth trying to bump the version and see if it goes away. We saw it on the NSD servers directly as well, so not some client trying to talk to it, so maybe there was some buggy code? Simon On 11/04/2017, 16:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Bryan Banister" wrote: >There are so many things to look at and many tools for doing so (iostat, >htop, nsdperf, mmdiag, mmhealth, mmlsconfig, mmlsfs, etc). I would >recommend a review of the presentation that Yuri gave at the most recent >GPFS User Group: >https://drive.google.com/drive/folders/0B124dhp9jJC-UjFlVjJTa2ZaVWs > >Cheers, >-Bryan > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Peter >Childs >Sent: Tuesday, April 11, 2017 3:58 AM >To: gpfsug main discussion list >Subject: [gpfsug-discuss] Spectrum Scale Slow to create directories > >This is a curious issue which I'm trying to get to the bottom of. > >We currently have two Spectrum Scale file systems, both are running GPFS >4.2.1-1 some of the servers have been upgraded to 4.2.1-2. > >The older one which was upgraded from GPFS 3.5 works find create a >directory is always fast and no issue. > >The new one, which has nice new SSD for metadata and hence should be >faster. can take up to 30 seconds to create a directory but usually takes >less than a second, The longer directory creates usually happen on busy >nodes that have not used the new storage in a while. (Its new so we've >not moved much of the data over yet) But it can also happen randomly >anywhere, including from the NSD servers them selves. (times of 3-4 >seconds from the NSD servers have been seen, on a single directory create) > >We've been pointed at the network and suggested we check all network >settings, and its been suggested to build an admin network, but I'm not >sure I entirely understand why and how this would help. Its a mixed >1G/10G network with the NSD servers connected at 40G with an MTU of 9000. > >However as I say, the older filesystem is fine, and it does not matter if >the nodes are connected to the old GPFS cluster or the new one, (although >the delay is worst on the old gpfs cluster), So I'm really playing spot >the difference. and the network is not really an obvious difference. > >Its been suggested to look at a trace when it occurs but as its difficult >to recreate collecting one is difficult. > >Any ideas would be most helpful. > >Thanks > > > >Peter Childs >ITS Research Infrastructure >Queen Mary, University of London >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >________________________________ > >Note: This email is for the confidential use of the named addressee(s) >only and may contain proprietary, confidential or privileged information. >If you are not the intended recipient, you are hereby notified that any >review, dissemination or copying of this email is strictly prohibited, >and to please notify the sender immediately and destroy this email and >any attachments. Email transmission cannot be guaranteed to be secure or >error-free. The Company, therefore, does not make any guarantees as to >the completeness or accuracy of this email or any attachments. This email >is for informational purposes only and does not constitute a >recommendation, offer, request or solicitation of any kind to buy, sell, >subscribe, redeem or perform any type of transaction of a financial >product. >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From kenneth.waegeman at ugent.be Thu Apr 20 15:53:29 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Thu, 20 Apr 2017 16:53:29 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Message-ID: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for > writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on > the order of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that > reducing maxMBpS from 10000 to 100 fixed the problem! Digging further > I found that reducing prefetchThreads from default=72 to 32 also fixed > it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > >: > > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the > load up on one socket, you push all the interrupt handling to the > other socket where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org > > [gpfsug-discuss-bounces at spectrumscale.org > ] on behalf of > Aaron Knister [aaron.s.knister at nasa.gov > ] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going > out to > > the clients. I was having a really hard time getting anything > resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do > better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load > I saw > > an almost 4x performance jump which is pretty much goes against > every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated > crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling > shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 > processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I > still have > > to run something to drive up the CPU load and then performance > improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm > curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Thu Apr 20 16:04:20 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Thu, 20 Apr 2017 15:04:20 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> , <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From UWEFALKE at de.ibm.com Thu Apr 20 16:07:32 2017 From: UWEFALKE at de.ibm.com (Uwe Falke) Date: Thu, 20 Apr 2017 17:07:32 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: Hi Kennmeth, is prefetching off or on at your storage backend? Raw sequential is very different from GPFS sequential at the storage device ! GPFS does its own prefetching, the storage would never know what sectors sequential read at GPFS level maps to at storage level! Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/20/2017 04:53 PM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From marcusk at nz1.ibm.com Fri Apr 21 02:21:51 2017 From: marcusk at nz1.ibm.com (Marcus Koenig1) Date: Fri, 21 Apr 2017 14:21:51 +1300 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: Hi Kennmeth, we also had similar performance numbers in our tests. Native was far quicker than through GPFS. When we learned though that the client tested the performance on the FS at a big blocksize (512k) with small files - we were able to speed it up significantly using a smaller FS blocksize (obviously we had to recreate the FS). So really depends on how you do your tests. Cheers, Marcus Koenig Lab Services Storage & Power Specialist IBM Australia & New Zealand Advanced Technical Skills IBM Systems-Hardware |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | | |Mobile: +64 21 67 34 27 | | | |E-mail: marcusk at nz1.ibm.com | | | | | | | | | | | | | | | |82 Wyndham Street | | | |Auckland, AUK 1010 | | | |New Zealand | | | | | | | | | | | | | | | | | | | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| |---------------+------------------------------------------+--------------------------------------------------------------------------------> | | | | |---------------+------------------------------------------+--------------------------------------------------------------------------------> >--------------------------------------------------------------------------------| | | >--------------------------------------------------------------------------------| From: "Uwe Falke" To: gpfsug main discussion list Date: 04/21/2017 03:07 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Kennmeth, is prefetching off or on at your storage backend? Raw sequential is very different from GPFS sequential at the storage device ! GPFS does its own prefetching, the storage would never know what sectors sequential read at GPFS level maps to at storage level! Mit freundlichen Gr??en / Kind regards Dr. Uwe Falke IT Specialist High Performance Computing Services / Integrated Technology Services / Data Center Services ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Rathausstr. 7 09111 Chemnitz Phone: +49 371 6978 2165 Mobile: +49 175 575 2877 E-Mail: uwefalke at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: Andreas Hasse, Thorsten Moehring Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 17122 From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/20/2017 04:53 PM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: ecblank.gif Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17773863.gif Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17405449.jpg Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: 17997200.gif Type: image/gif Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: graycol.gif Type: image/gif Size: 105 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Fri Apr 21 08:25:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 21 Apr 2017 09:25:22 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From kenneth.waegeman at ugent.be Fri Apr 21 10:43:25 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:43:25 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> Message-ID: <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Interesting. Could you share a little more about your architecture? Is > it possible to mount the fs on an NSD server and do some dd's from the > fs on the NSD server? If that gives you decent performance perhaps try > NSDPERF next > https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf > > > -Aaron > > > > > On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman > wrote: >> >> Hi, >> >> >> Having an issue that looks the same as this one: >> >> We can do sequential writes to the filesystem at 7,8 GB/s total , >> which is the expected speed for our current storage >> backend. While we have even better performance with sequential reads >> on raw storage LUNS, using GPFS we can only reach 1GB/s in total >> (each nsd server seems limited by 0,5GB/s) independent of the number >> of clients >> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev >> params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as >> discussed in this thread, but nothing seems to impact this read >> performance. >> >> Any ideas? >> >> Thanks! >> >> Kenneth >> >> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>> I just had a similar experience from a sandisk infiniflash system >>> SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for >>> writes. and 250-300 Mbyte/s on sequential reads!! Random reads were >>> on the order of 2 Gbyte/s. >>> >>> After a bit head scratching snd fumbling around I found out that >>> reducing maxMBpS from 10000 to 100 fixed the problem! Digging >>> further I found that reducing prefetchThreads from default=72 to 32 >>> also fixed it, while leaving maxMBpS at 10000. Can now also read at >>> 3,2 GByte/s. >>> >>> Could something like this be the problem on your box as well? >>> >>> >>> >>> -jf >>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >>> >: >>> >>> Well, I'm somewhat scrounging for hardware. This is in our test >>> environment :) And yep, it's got the 2U gpu-tray in it although even >>> without the riser it has 2 PCIe slots onboard (excluding the >>> on-board >>> dual-port mezz card) so I think it would make a fine NSD server even >>> without the riser. >>> >>> -Aaron >>> >>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT >>> Services) >>> wrote: >>> > Maybe its related to interrupt handlers somehow? You drive the >>> load up on one socket, you push all the interrupt handling to >>> the other socket where the fabric card is attached? >>> > >>> > Dunno ... (Though I am intrigued you use idataplex nodes as >>> NSD servers, I assume its some 2U gpu-tray riser one or something !) >>> > >>> > Simon >>> > ________________________________________ >>> > From: gpfsug-discuss-bounces at spectrumscale.org >>> >>> [gpfsug-discuss-bounces at spectrumscale.org >>> ] on behalf of >>> Aaron Knister [aaron.s.knister at nasa.gov >>> ] >>> > Sent: 17 February 2017 15:52 >>> > To: gpfsug main discussion list >>> > Subject: [gpfsug-discuss] bizarre performance behavior >>> > >>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>> > connections coming in and 1x FDR10 and 1x QDR connection going >>> out to >>> > the clients. I was having a really hard time getting anything >>> resembling >>> > sensible performance out of it (4-5Gb/s writes but maybe >>> 1.2Gb/s for >>> > reads). The back-end is a DDN SFA12K and I *know* it can do >>> better than >>> > that. >>> > >>> > I don't remember quite how I figured this out but simply by >>> running >>> > "openssl speed -multi 16" on the nsd server to drive up the >>> load I saw >>> > an almost 4x performance jump which is pretty much goes >>> against every >>> > sysadmin fiber in me (i.e. "drive up the cpu load with >>> unrelated crap to >>> > quadruple your i/o performance"). >>> > >>> > This feels like some type of C-states frequency scaling >>> shenanigans that >>> > I haven't quite ironed down yet. I booted the box with the >>> following >>> > kernel parameters "intel_idle.max_cstate=0 >>> processor.max_cstate=0" which >>> > didn't seem to make much of a difference. I also tried setting the >>> > frequency governer to userspace and setting the minimum >>> frequency to >>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I >>> still have >>> > to run something to drive up the CPU load and then performance >>> improves. >>> > >>> > I'm wondering if this could be an issue with the C1E state? >>> I'm curious >>> > if anyone has seen anything like this. The node is a dx360 M4 >>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>> > >>> > -Aaron >>> > >>> > -- >>> > Aaron Knister >>> > NASA Center for Climate Simulation (Code 606.2) >>> > Goddard Space Flight Center >>> > (301) 286-2776 >>> > _______________________________________________ >>> > gpfsug-discuss mailing list >>> > gpfsug-discuss at spectrumscale.org >>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > _______________________________________________ >>> > gpfsug-discuss mailing list >>> > gpfsug-discuss at spectrumscale.org >>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> > >>> >>> -- >>> Aaron Knister >>> NASA Center for Climate Simulation (Code 606.2) >>> Goddard Space Flight Center >>> (301) 286-2776 >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 10:50:55 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:50:55 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <2b0824a1-e1a2-8dd8-4a55-a57d7b00e09f@ugent.be> Hi, prefetching was already disabled at our storage backend, but a good thing to recheck :) thanks! On 20/04/17 17:07, Uwe Falke wrote: > Hi Kennmeth, > > is prefetching off or on at your storage backend? > Raw sequential is very different from GPFS sequential at the storage > device ! > GPFS does its own prefetching, the storage would never know what sectors > sequential read at GPFS level maps to at storage level! > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Andreas Hasse, Thorsten Moehring > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Kenneth Waegeman > To: gpfsug main discussion list > Date: 04/20/2017 04:53 PM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > > Having an issue that looks the same as this one: > We can do sequential writes to the filesystem at 7,8 GB/s total , which is > the expected speed for our current storage > backend. While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd > server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in > this thread, but nothing seems to impact this read performance. > Any ideas? > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. > and 250-300 Mbyte/s on sequential reads!! Random reads were on the order > of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that reducing > maxMBpS from 10000 to 100 fixed the problem! Digging further I found that > reducing prefetchThreads from default=72 to 32 also fixed it, while > leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > : > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: >> Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? >> Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, > I assume its some 2U gpu-tray riser one or something !) >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] >> Sent: 17 February 2017 15:52 >> To: gpfsug main discussion list >> Subject: [gpfsug-discuss] bizarre performance behavior >> >> This is a good one. I've got an NSD server with 4x 16GB fibre >> connections coming in and 1x FDR10 and 1x QDR connection going out to >> the clients. I was having a really hard time getting anything resembling >> sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for >> reads). The back-end is a DDN SFA12K and I *know* it can do better than >> that. >> >> I don't remember quite how I figured this out but simply by running >> "openssl speed -multi 16" on the nsd server to drive up the load I saw >> an almost 4x performance jump which is pretty much goes against every >> sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to >> quadruple your i/o performance"). >> >> This feels like some type of C-states frequency scaling shenanigans that >> I haven't quite ironed down yet. I booted the box with the following >> kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which >> didn't seem to make much of a difference. I also tried setting the >> frequency governer to userspace and setting the minimum frequency to >> 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have >> to run something to drive up the CPU load and then performance improves. >> >> I'm wondering if this could be an issue with the C1E state? I'm curious >> if anyone has seen anything like this. The node is a dx360 M4 >> (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >> >> -Aaron >> >> -- >> Aaron Knister >> NASA Center for Climate Simulation (Code 606.2) >> Goddard Space Flight Center >> (301) 286-2776 >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From kenneth.waegeman at ugent.be Fri Apr 21 10:52:58 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 11:52:58 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> Message-ID: <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> Hi, Tried these settings, but sadly I'm not seeing any changes. Thanks, Kenneth On 21/04/17 09:25, Olaf Weiser wrote: > pls check > workerThreads (assuming you 're > 4.2.2) start with 128 .. increase > iteratively > pagepool at least 8 G > ignorePrefetchLunCount=yes (1) > > then you won't see a difference and GPFS is as fast or even faster .. > > > > From: "Marcus Koenig1" > To: gpfsug main discussion list > Date: 04/21/2017 03:24 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Hi Kennmeth, > > we also had similar performance numbers in our tests. Native was far > quicker than through GPFS. When we learned though that the client > tested the performance on the FS at a big blocksize (512k) with small > files - we were able to speed it up significantly using a smaller FS > blocksize (obviously we had to recreate the FS). > > So really depends on how you do your tests. > > *Cheers,* > * > Marcus Koenig* > Lab Services Storage & Power Specialist/ > IBM Australia & New Zealand Advanced Technical Skills/ > IBM Systems-Hardware > ------------------------------------------------------------------------ > > *Mobile:*+64 21 67 34 27* > E-mail:*_marcusk at nz1.ibm.com_ > > 82 Wyndham Street > Auckland, AUK 1010 > New Zealand > > > > > > > > > > Inactive hide details for "Uwe Falke" ---04/21/2017 03:07:48 AM---Hi > Kennmeth, is prefetching off or on at your storage backe"Uwe Falke" > ---04/21/2017 03:07:48 AM---Hi Kennmeth, is prefetching off or on at > your storage backend? > > From: "Uwe Falke" > To: gpfsug main discussion list > Date: 04/21/2017 03:07 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------------------------------------------------ > > > > Hi Kennmeth, > > is prefetching off or on at your storage backend? > Raw sequential is very different from GPFS sequential at the storage > device ! > GPFS does its own prefetching, the storage would never know what sectors > sequential read at GPFS level maps to at storage level! > > > Mit freundlichen Gr??en / Kind regards > > > Dr. Uwe Falke > > IT Specialist > High Performance Computing Services / Integrated Technology Services / > Data Center Services > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland > Rathausstr. 7 > 09111 Chemnitz > Phone: +49 371 6978 2165 > Mobile: +49 175 575 2877 > E-Mail: uwefalke at de.ibm.com > ------------------------------------------------------------------------------------------------------------------------------------------- > IBM Deutschland Business & Technology Services GmbH / Gesch?ftsf?hrung: > Andreas Hasse, Thorsten Moehring > Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, > HRB 17122 > > > > > From: Kenneth Waegeman > To: gpfsug main discussion list > Date: 04/20/2017 04:53 PM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > > > Hi, > > Having an issue that looks the same as this one: > We can do sequential writes to the filesystem at 7,8 GB/s total , > which is > the expected speed for our current storage > backend. While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd > server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in > this thread, but nothing seems to impact this read performance. > Any ideas? > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. > and 250-300 Mbyte/s on sequential reads!! Random reads were on the order > of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that reducing > maxMBpS from 10000 to 100 fixed the problem! Digging further I found that > reducing prefetchThreads from default=72 to 32 also fixed it, while > leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, > I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org_ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 3720 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/jpeg Size: 2741 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 13421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 45 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 105 bytes Desc: not available URL: From makaplan at us.ibm.com Fri Apr 21 13:58:26 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Fri, 21 Apr 2017 08:58:26 -0400 Subject: [gpfsug-discuss] bizarre performance behavior - prefetchThreads In-Reply-To: <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <94f2ca6e-cf6b-ef6a-1b27-45d7a449a379@ugent.be> Message-ID: Seems counter-logical, but we have testimony that you may need to reduce the prefetchThreads parameter. Of all the parameters, that's the one that directly affects prefetching, so worth trying. Jan-Frode Myklebust wrote: ...Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s.... I can speculate that having prefetchThreads to high may create a contention situation where more threads causes overall degradation in system performance. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 21994 bytes Desc: not available URL: From aaron.s.knister at nasa.gov Fri Apr 21 14:10:49 2017 From: aaron.s.knister at nasa.gov (Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]) Date: Fri, 21 Apr 2017 13:10:49 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister <aaron.s.knister at nasa.gov>: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Fri Apr 21 14:18:34 2017 From: david_johnson at brown.edu (David D Johnson) Date: Fri, 21 Apr 2017 09:18:34 -0400 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <02C0BD31-E743-4F1C-91E7-20555099CBF5@brown.edu> We had some luck making the client and server IB performance consistently decent by configuring tuned with the profile "latency-performance". The key is the line /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=1 which prevents cpu from going to sleep just when the next burst of IB traffic is about to arrive. -- ddj Dave Johnson Brown University CCV On Apr 21, 2017, at 9:10 AM, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > > Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. > > A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: >> Hi, >> We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. >> We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. >> When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. >> >> I'll use the nsdperf tool to see if we can find anything, >> >> thanks! >> >> K >> >> On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: >>> Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf >>> >>> -Aaron >>> >>> >>> >>> >>> On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: >>>> Hi, >>>> >>>> >>>> Having an issue that looks the same as this one: >>>> >>>> We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage >>>> backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>>> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. >>>> Any ideas? >>>> >>>> Thanks! >>>> >>>> Kenneth >>>> >>>> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>>>> I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. >>>>> >>>>> After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. >>>>> >>>>> Could something like this be the problem on your box as well? >>>>> >>>>> >>>>> >>>>> -jf >>>>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister < aaron.s.knister at nasa.gov >: >>>>> Well, I'm somewhat scrounging for hardware. This is in our test >>>>> environment :) And yep, it's got the 2U gpu-tray in it although even >>>>> without the riser it has 2 PCIe slots onboard (excluding the on-board >>>>> dual-port mezz card) so I think it would make a fine NSD server even >>>>> without the riser. >>>>> >>>>> -Aaron >>>>> >>>>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) >>>>> wrote: >>>>> > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? >>>>> > >>>>> > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) >>>>> > >>>>> > Simon >>>>> > ________________________________________ >>>>> > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org ] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov ] >>>>> > Sent: 17 February 2017 15:52 >>>>> > To: gpfsug main discussion list >>>>> > Subject: [gpfsug-discuss] bizarre performance behavior >>>>> > >>>>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>>>> > connections coming in and 1x FDR10 and 1x QDR connection going out to >>>>> > the clients. I was having a really hard time getting anything resembling >>>>> > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for >>>>> > reads). The back-end is a DDN SFA12K and I *know* it can do better than >>>>> > that. >>>>> > >>>>> > I don't remember quite how I figured this out but simply by running >>>>> > "openssl speed -multi 16" on the nsd server to drive up the load I saw >>>>> > an almost 4x performance jump which is pretty much goes against every >>>>> > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to >>>>> > quadruple your i/o performance"). >>>>> > >>>>> > This feels like some type of C-states frequency scaling shenanigans that >>>>> > I haven't quite ironed down yet. I booted the box with the following >>>>> > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which >>>>> > didn't seem to make much of a difference. I also tried setting the >>>>> > frequency governer to userspace and setting the minimum frequency to >>>>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have >>>>> > to run something to drive up the CPU load and then performance improves. >>>>> > >>>>> > I'm wondering if this could be an issue with the C1E state? I'm curious >>>>> > if anyone has seen anything like this. The node is a dx360 M4 >>>>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>>>> > >>>>> > -Aaron >>>>> > >>>>> > -- >>>>> > Aaron Knister >>>>> > NASA Center for Climate Simulation (Code 606.2) >>>>> > Goddard Space Flight Center >>>>> > (301) 286-2776 >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kums at us.ibm.com Fri Apr 21 15:01:33 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 21 Apr 2017 14:01:33 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) Turbo Mode - Enable QPI Link Frequency - Max Performance Operating Mode - Maximum Performance >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To: gpfsug main discussion list Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From bbanister at jumptrading.com Fri Apr 21 16:01:54 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 21 Apr 2017 15:01:54 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov>, <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <7dcbac92e19043faa7968702d852668f@jumptrading.com> I think we have a new topic and new speaker for the next UG meeting at SC! Kums presenting "Performance considerations for Spectrum Scale"!! Kums, I have to say you do have a lot to offer here... ;o) -Bryan Disclaimer: There are some selfish reasons of me wanting to hang out with you again involved in this suggestion From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Kumaran Rajaram Sent: Friday, April 21, 2017 9:02 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] bizarre performance behavior Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) * Turbo Mode - Enable * QPI Link Frequency - Max Performance * Operating Mode - Maximum Performance * >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). * [cid:image001.gif at 01D2BA86.4D4B4C10] [cid:image002.gif at 01D2BA86.4D4B4C10] [cid:image003.gif at 01D2BA86.4D4B4C10] Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" > To: gpfsug main discussion list > Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >: Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 61023 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 85131 bytes Desc: image002.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image003.gif Type: image/gif Size: 84819 bytes Desc: image003.gif URL: From g.mangeot at gmail.com Fri Apr 21 16:04:58 2017 From: g.mangeot at gmail.com (Guillaume Mangeot) Date: Fri, 21 Apr 2017 17:04:58 +0200 Subject: [gpfsug-discuss] HA on snapshot scheduling in GPFS GUI Message-ID: Hi, I'm looking for a way to get the GUI working in HA to schedule snapshots. I have 2 servers with gpfs.gui service running on them. I checked a bit with lssnaprule in /usr/lpp/mmfs/gui/cli and the file /var/lib/mmfs/gui/snapshots.json But it doesn't look to be shared between all the GUI servers. Is there a way to get GPFS GUI working in HA to schedule snapshots? (keeping the coherency: avoiding to trigger snapshots on both servers in the same time) Regards, Guillaume Mangeot DDN Storage -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 16:33:16 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 17:33:16 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <41475044-c195-5561-c94a-b54ee30c7e68@ugent.be> On 21/04/17 15:10, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: > Fantastic news! It might also be worth running "cpupower monitor" or > "turbostat" on your NSD servers while you're running dd tests from the > clients to see what CPU frequency your cores are actually running at. Thanks! I verified with turbostat and cpuinfo, our cpus are running in high performance mode and frequency is always at highest level. > > A typical NSD server workload (especially with IB verbs and for reads) > can be pretty light on CPU which might not prompt your CPU crew > governor to up the frequency (which can affect throughout). If your > frequency scaling governor isn't kicking up the frequency of your CPUs > I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: >> >> Hi, >> >> We are running a test setup with 2 NSD Servers backed by 4 Dell >> Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of >> the 4 powervaults, nsd02 is primary serving LUNS of controller B. >> >> We are testing from 2 testing machines connected to the nsds with >> infiniband, verbs enabled. >> >> When we do dd from the NSD servers, we see indeed performance going >> to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is >> able to get the data at a decent speed. Since we can write from the >> clients at a good speed, I didn't suspect the communication between >> clients and nsds being the issue, especially since total performance >> stays the same using 1 or multiple clients. >> >> I'll use the nsdperf tool to see if we can find anything, >> >> thanks! >> >> K >> >> On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE >> CORP] wrote: >>> Interesting. Could you share a little more about your architecture? >>> Is it possible to mount the fs on an NSD server and do some dd's >>> from the fs on the NSD server? If that gives you decent performance >>> perhaps try NSDPERF next >>> https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf >>> >>> >>> >>> -Aaron >>> >>> >>> >>> >>> On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman >>> wrote: >>>> >>>> Hi, >>>> >>>> >>>> Having an issue that looks the same as this one: >>>> >>>> We can do sequential writes to the filesystem at 7,8 GB/s total , >>>> which is the expected speed for our current storage >>>> backend. While we have even better performance with sequential >>>> reads on raw storage LUNS, using GPFS we can only reach 1GB/s in >>>> total (each nsd server seems limited by 0,5GB/s) independent of the >>>> number of clients >>>> (1,2,4,..) or ways we tested (fio,dd). We played with blockdev >>>> params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. >>>> as discussed in this thread, but nothing seems to impact this read >>>> performance. >>>> >>>> Any ideas? >>>> >>>> Thanks! >>>> >>>> Kenneth >>>> >>>> On 17/02/17 19:29, Jan-Frode Myklebust wrote: >>>>> I just had a similar experience from a sandisk infiniflash system >>>>> SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for >>>>> writes. and 250-300 Mbyte/s on sequential reads!! Random reads >>>>> were on the order of 2 Gbyte/s. >>>>> >>>>> After a bit head scratching snd fumbling around I found out that >>>>> reducing maxMBpS from 10000 to 100 fixed the problem! Digging >>>>> further I found that reducing prefetchThreads from default=72 to >>>>> 32 also fixed it, while leaving maxMBpS at 10000. Can now also >>>>> read at 3,2 GByte/s. >>>>> >>>>> Could something like this be the problem on your box as well? >>>>> >>>>> >>>>> >>>>> -jf >>>>> fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister >>>>> >: >>>>> >>>>> Well, I'm somewhat scrounging for hardware. This is in our test >>>>> environment :) And yep, it's got the 2U gpu-tray in it >>>>> although even >>>>> without the riser it has 2 PCIe slots onboard (excluding the >>>>> on-board >>>>> dual-port mezz card) so I think it would make a fine NSD >>>>> server even >>>>> without the riser. >>>>> >>>>> -Aaron >>>>> >>>>> On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT >>>>> Services) >>>>> wrote: >>>>> > Maybe its related to interrupt handlers somehow? You drive >>>>> the load up on one socket, you push all the interrupt handling >>>>> to the other socket where the fabric card is attached? >>>>> > >>>>> > Dunno ... (Though I am intrigued you use idataplex nodes as >>>>> NSD servers, I assume its some 2U gpu-tray riser one or >>>>> something !) >>>>> > >>>>> > Simon >>>>> > ________________________________________ >>>>> > From: gpfsug-discuss-bounces at spectrumscale.org >>>>> >>>>> [gpfsug-discuss-bounces at spectrumscale.org >>>>> ] on behalf >>>>> of Aaron Knister [aaron.s.knister at nasa.gov >>>>> ] >>>>> > Sent: 17 February 2017 15:52 >>>>> > To: gpfsug main discussion list >>>>> > Subject: [gpfsug-discuss] bizarre performance behavior >>>>> > >>>>> > This is a good one. I've got an NSD server with 4x 16GB fibre >>>>> > connections coming in and 1x FDR10 and 1x QDR connection >>>>> going out to >>>>> > the clients. I was having a really hard time getting >>>>> anything resembling >>>>> > sensible performance out of it (4-5Gb/s writes but maybe >>>>> 1.2Gb/s for >>>>> > reads). The back-end is a DDN SFA12K and I *know* it can do >>>>> better than >>>>> > that. >>>>> > >>>>> > I don't remember quite how I figured this out but simply by >>>>> running >>>>> > "openssl speed -multi 16" on the nsd server to drive up the >>>>> load I saw >>>>> > an almost 4x performance jump which is pretty much goes >>>>> against every >>>>> > sysadmin fiber in me (i.e. "drive up the cpu load with >>>>> unrelated crap to >>>>> > quadruple your i/o performance"). >>>>> > >>>>> > This feels like some type of C-states frequency scaling >>>>> shenanigans that >>>>> > I haven't quite ironed down yet. I booted the box with the >>>>> following >>>>> > kernel parameters "intel_idle.max_cstate=0 >>>>> processor.max_cstate=0" which >>>>> > didn't seem to make much of a difference. I also tried >>>>> setting the >>>>> > frequency governer to userspace and setting the minimum >>>>> frequency to >>>>> > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I >>>>> still have >>>>> > to run something to drive up the CPU load and then >>>>> performance improves. >>>>> > >>>>> > I'm wondering if this could be an issue with the C1E state? >>>>> I'm curious >>>>> > if anyone has seen anything like this. The node is a dx360 M4 >>>>> > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. >>>>> > >>>>> > -Aaron >>>>> > >>>>> > -- >>>>> > Aaron Knister >>>>> > NASA Center for Climate Simulation (Code 606.2) >>>>> > Goddard Space Flight Center >>>>> > (301) 286-2776 >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > _______________________________________________ >>>>> > gpfsug-discuss mailing list >>>>> > gpfsug-discuss at spectrumscale.org >>>>> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> > >>>>> >>>>> -- >>>>> Aaron Knister >>>>> NASA Center for Climate Simulation (Code 606.2) >>>>> Goddard Space Flight Center >>>>> (301) 286-2776 >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>>> >>>>> >>>>> >>>>> _______________________________________________ >>>>> gpfsug-discuss mailing list >>>>> gpfsug-discuss at spectrumscale.org >>>>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>>> >>> >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From kenneth.waegeman at ugent.be Fri Apr 21 16:42:34 2017 From: kenneth.waegeman at ugent.be (Kenneth Waegeman) Date: Fri, 21 Apr 2017 17:42:34 +0200 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> <4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be> <67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov> <9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> Message-ID: <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> Hi, We already verified this on our nsds: [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed QpiSpeed=maxdatarate [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode turbomode=enable [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ?-SysProfile SysProfile=perfoptimized so sadly this is not the issue. Also the output of the verbs commands look ok, there are connections from the client to the nsds are there is data being read and writen. Thanks again! Kenneth On 21/04/17 16:01, Kumaran Rajaram wrote: > Hi, > > Try enabling the following in the BIOS of the NSD servers (screen > shots below) > > * Turbo Mode - Enable > * QPI Link Frequency - Max Performance > * Operating Mode - Maximum Performance > * > > >>>>While we have even better performance with sequential reads on > raw storage LUNS, using GPFS we can only reach 1GB/s in total > (each nsd server seems limited by 0,5GB/s) independent of the > number of clients > > >>We are testing from 2 testing machines connected to the nsds > with infiniband, verbs enabled. > > > Also, It will be good to verify that all the GPFS nodes have Verbs > RDMA started using "mmfsadm test verbs status" and that the NSD > client-server communication from client to server during "dd" is > actually using Verbs RDMA using "mmfsadm test verbs conn" command (on > NSD client doing dd). If not, then GPFS might be using TCP/IP network > over which the cluster is configured impacting performance (If this is > the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and > resolve). > > * > > > > > > > Regards, > -Kums > > > > > > > From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" > > To: gpfsug main discussion list > Date: 04/21/2017 09:11 AM > Subject: Re: [gpfsug-discuss] bizarre performance behavior > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------------------------------------------------ > > > > Fantastic news! It might also be worth running "cpupower monitor" or > "turbostat" on your NSD servers while you're running dd tests from the > clients to see what CPU frequency your cores are actually running at. > > A typical NSD server workload (especially with IB verbs and for reads) > can be pretty light on CPU which might not prompt your CPU crew > governor to up the frequency (which can affect throughout). If your > frequency scaling governor isn't kicking up the frequency of your CPUs > I've seen that cause this behavior in my testing. > > -Aaron > > > > > On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman > wrote: > > Hi, > > We are running a test setup with 2 NSD Servers backed by 4 Dell > Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of > the 4 powervaults, nsd02 is primary serving LUNS of controller B. > > We are testing from 2 testing machines connected to the nsds with > infiniband, verbs enabled. > > When we do dd from the NSD servers, we see indeed performance going to > 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is > able to get the data at a decent speed. Since we can write from the > clients at a good speed, I didn't suspect the communication between > clients and nsds being the issue, especially since total performance > stays the same using 1 or multiple clients. > > I'll use the nsdperf tool to see if we can find anything, > > thanks! > > K > > On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE > CORP] wrote: > Interesting. Could you share a little more about your architecture? Is > it possible to mount the fs on an NSD server and do some dd's from the > fs on the NSD server? If that gives you decent performance perhaps try > NSDPERF next > _https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf_ > > > -Aaron > > > > > On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman > __ wrote: > > Hi, > > Having an issue that looks the same as this one: > > We can do sequential writes to the filesystem at 7,8 GB/s total , > which is the expected speed for our current storage > backend. While we have even better performance with sequential reads > on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each > nsd server seems limited by 0,5GB/s) independent of the number of clients > (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, > MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed > in this thread, but nothing seems to impact this read performance. > > Any ideas? > > Thanks! > > Kenneth > > On 17/02/17 19:29, Jan-Frode Myklebust wrote: > I just had a similar experience from a sandisk infiniflash system > SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for > writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on > the order of 2 Gbyte/s. > > After a bit head scratching snd fumbling around I found out that > reducing maxMBpS from 10000 to 100 fixed the problem! Digging further > I found that reducing prefetchThreads from default=72 to 32 also fixed > it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. > > Could something like this be the problem on your box as well? > > > > -jf > fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister > <_aaron.s.knister at nasa.gov_ >: > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load > up on one socket, you push all the interrupt handling to the other > socket where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD > servers, I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: _gpfsug-discuss-bounces at spectrumscale.org_ > [_gpfsug-discuss-bounces at spectrumscale.org_ > ] on behalf of Aaron > Knister [_aaron.s.knister at nasa.gov_ ] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at _spectrumscale.org_ > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at _spectrumscale.org_ > > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at _spectrumscale.org_ _ > __http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > _http://gpfsug.org/mailman/listinfo/gpfsug-discuss_ > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From kums at us.ibm.com Fri Apr 21 21:27:49 2017 From: kums at us.ibm.com (Kumaran Rajaram) Date: Fri, 21 Apr 2017 20:27:49 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov><4896c9cd-16d0-234d-b867-4787b41910cd@ugent.be><67E31108-39CE-4F37-8EF4-F0B548A4735C@nasa.gov><9dbcde5d-c7b1-717c-f7b9-a5b9665cfa98@ugent.be> <7f7349c9-bdd3-5847-1cca-d98d221489fe@ugent.be> Message-ID: Hi Kenneth, As it was mentioned earlier, it will be good to first verify the raw network performance between the NSD client and NSD server using the nsdperf tool that is built with RDMA support. g++ -O2 -DRDMA -o nsdperf -lpthread -lrt -libverbs -lrdmacm nsdperf.C In addition, since you have 2 x NSD servers it will be good to perform NSD client file-system performance test with just single NSD server (mmshutdown the other server, assuming all the NSDs have primary, server NSD server configured + Quorum will be intact when a NSD server is brought down) to see if it helps to improve the read performance + if there are variations in the file-system read bandwidth results between NSD_server#1 'active' vs. NSD_server #2 'active' (with other NSD server in GPFS "down" state). If there is significant variation, it can help to isolate the issue to particular NSD server (HW or IB issue?). You can issue "mmdiag --waiters" on NSD client as well as NSD servers during your dd test, to verify if there are unsual long GPFS waiters. In addition, you may issue Linux "perf top -z" command on the GPFS node to see if there is high CPU usage by any particular call/event (for e.g., If GPFS config parameter verbsRdmaMaxSendBytes has been set to low value from the default 16M, then it can cause RDMA completion threads to go CPU bound ). Please verify some performance scenarios detailed in Chapter 22 in Spectrum Scale Problem Determination Guide (link below). https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/pdf/scale_pdg.pdf?view=kc Thanks, -Kums From: Kenneth Waegeman To: gpfsug main discussion list Date: 04/21/2017 11:43 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, We already verified this on our nsds: [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --QpiSpeed QpiSpeed=maxdatarate [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg --turbomode turbomode=enable [root at nsd00 ~]# /opt/dell/toolkit/bin/syscfg ?-SysProfile SysProfile=perfoptimized so sadly this is not the issue. Also the output of the verbs commands look ok, there are connections from the client to the nsds are there is data being read and writen. Thanks again! Kenneth On 21/04/17 16:01, Kumaran Rajaram wrote: Hi, Try enabling the following in the BIOS of the NSD servers (screen shots below) Turbo Mode - Enable QPI Link Frequency - Max Performance Operating Mode - Maximum Performance >>>>While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients >>We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. Also, It will be good to verify that all the GPFS nodes have Verbs RDMA started using "mmfsadm test verbs status" and that the NSD client-server communication from client to server during "dd" is actually using Verbs RDMA using "mmfsadm test verbs conn" command (on NSD client doing dd). If not, then GPFS might be using TCP/IP network over which the cluster is configured impacting performance (If this is the case, GPFS mmfs.log.latest for any Verbs RDMA related errors and resolve). Regards, -Kums From: "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" To: gpfsug main discussion list Date: 04/21/2017 09:11 AM Subject: Re: [gpfsug-discuss] bizarre performance behavior Sent by: gpfsug-discuss-bounces at spectrumscale.org Fantastic news! It might also be worth running "cpupower monitor" or "turbostat" on your NSD servers while you're running dd tests from the clients to see what CPU frequency your cores are actually running at. A typical NSD server workload (especially with IB verbs and for reads) can be pretty light on CPU which might not prompt your CPU crew governor to up the frequency (which can affect throughout). If your frequency scaling governor isn't kicking up the frequency of your CPUs I've seen that cause this behavior in my testing. -Aaron On April 21, 2017 at 05:43:40 EDT, Kenneth Waegeman wrote: Hi, We are running a test setup with 2 NSD Servers backed by 4 Dell Powervaults MD3460s. nsd00 is primary serving LUNS of controller A of the 4 powervaults, nsd02 is primary serving LUNS of controller B. We are testing from 2 testing machines connected to the nsds with infiniband, verbs enabled. When we do dd from the NSD servers, we see indeed performance going to 5.8GB/s for one nsd, 7.2GB/s for the two! So it looks like GPFS is able to get the data at a decent speed. Since we can write from the clients at a good speed, I didn't suspect the communication between clients and nsds being the issue, especially since total performance stays the same using 1 or multiple clients. I'll use the nsdperf tool to see if we can find anything, thanks! K On 20/04/17 17:04, Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP] wrote: Interesting. Could you share a little more about your architecture? Is it possible to mount the fs on an NSD server and do some dd's from the fs on the NSD server? If that gives you decent performance perhaps try NSDPERF next https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General+Parallel+File+System+(GPFS)/page/Testing+network+performance+with+nsdperf -Aaron On April 20, 2017 at 10:53:47 EDT, Kenneth Waegeman wrote: Hi, Having an issue that looks the same as this one: We can do sequential writes to the filesystem at 7,8 GB/s total , which is the expected speed for our current storage backend. While we have even better performance with sequential reads on raw storage LUNS, using GPFS we can only reach 1GB/s in total (each nsd server seems limited by 0,5GB/s) independent of the number of clients (1,2,4,..) or ways we tested (fio,dd). We played with blockdev params, MaxMBps, PrefetchThreads, hyperthreading, c1e/cstates, .. as discussed in this thread, but nothing seems to impact this read performance. Any ideas? Thanks! Kenneth On 17/02/17 19:29, Jan-Frode Myklebust wrote: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org[ gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 61023 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 85131 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 84819 bytes Desc: not available URL: From frank.tower at outlook.com Thu Apr 20 13:27:13 2017 From: frank.tower at outlook.com (Frank Tower) Date: Thu, 20 Apr 2017 12:27:13 +0000 Subject: [gpfsug-discuss] Protocol node recommendations Message-ID: Hi, We have here around 2PB GPFS where users access oney through GPFS client (used by an HPC cluster), but we will have to setup protocols nodes. We will have to share GPFS data to ~ 1000 users, where each users will have different access usage, meaning: - some will do large I/O (e.g: store 1TB files) - some will read/write more than 10k files in a raw - other will do only sequential read I already read the following wiki page: https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node IBM Spectrum Scale Wiki - Sizing Guidance for Protocol Node www.ibm.com developerWorks wikis allow groups of people to jointly create and maintain content through contribution and collaboration. Wikis apply the wisdom of crowds to ... But I wondering if some people have recommendations regarding hardware sizing and software tuning for such situation ? Or better, if someone already such setup ? Thank you by advance, Frank. -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Sat Apr 22 05:30:29 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Sat, 22 Apr 2017 00:30:29 -0400 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: <52354.1492835429@turing-police.cc.vt.edu> On Thu, 20 Apr 2017 12:27:13 -0000, Frank Tower said: > - some will do large I/O (e.g: store 1TB files) > - some will read/write more than 10k files in a raw > - other will do only sequential read > But I wondering if some people have recommendations regarding hardware sizing > and software tuning for such situation ? The three most critical pieces of info are missing here: 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? 2) How many of the users are likely to be active at the same time? 1,000 users, each of whom are active an hour a week is entirely different from 200 users that are each active 140 hours a week. 3) What SLA/performance target are they expecting? If they want large 1TB I/O and 100MB/sec is acceptable, that's different than if they have a business need to go at 1.2GB/sec.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From frank.tower at outlook.com Sat Apr 22 07:34:44 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 06:34:44 +0000 Subject: [gpfsug-discuss] Protocol node recommendations Message-ID: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sat Apr 22 09:50:11 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sat, 22 Apr 2017 08:50:11 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower : > Hi, > > We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with > GPFS client on each node. > > We will have to open GPFS to all our users over CIFS and kerberized NFS > with ACL support for both protocol for around +1000 users > > All users have different use case and needs: > - some will do random I/O through a large set of opened files (~5k files) > - some will do large write with 500GB-1TB files > - other will arrange sequential I/O with ~10k opened files > > NFS and CIFS will share the same server, so I through to use SSD drive, at > least 128GB memory with 2 sockets. > > Regarding tuning parameters, I thought at: > > maxFilesToCache 10000 > syncIntervalStrict yes > workerThreads (8*core) > prefetchPct 40 (for now and update if needed) > > I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering > if someone could share his experience/best practice regarding hardware > sizing and/or tuning parameters. > > Thank by advance, > Frank > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sat Apr 22 19:47:59 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 18:47:59 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: <52354.1492835429@turing-police.cc.vt.edu> References: , <52354.1492835429@turing-police.cc.vt.edu> Message-ID: Hi, Thank for your answer. > 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? True, here the list: - 800 users that have 1 workstation through 1Gb/s ethernet and will use NFS/CIFS - 200 users that have 2 workstation through 1Gb/s ethernet, few have 10Gb/s ethernet and will use NFS/CIFS > 2) How many of the users are likely to be active at the same time? 1,000 > users, each of whom are active an hour a week is entirely different from > 200 users that are each active 140 hours a week. True again, around 200 users will actively use GPFS through NFS/CIFS during night and day, but we cannot control if people will use 2 workstations or more :( We will have peak during day with an average of 700 'workstations' > 3) What SLA/performance target are they expecting? If they want > large 1TB I/O and 100MB/sec is acceptable, that's different than if they > have a business need to go at 1.2GB/sec.... We just want to provide at normal throughput through an 1GB/s network. Users are aware of such situation and will mainly use HPC cluster for high speed and heavy computation. But they would like to do 'light' computation on their desktop. The main topic here is to sustain 'normal' throughput for all users during peak. Thank for your help. ________________________________ From: valdis.kletnieks at vt.edu Sent: Saturday, April 22, 2017 6:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Protocol node recommendations On Thu, 20 Apr 2017 12:27:13 -0000, Frank Tower said: > - some will do large I/O (e.g: store 1TB files) > - some will read/write more than 10k files in a raw > - other will do only sequential read > But I wondering if some people have recommendations regarding hardware sizing > and software tuning for such situation ? The three most critical pieces of info are missing here: 1) Do you mean 1,000 human users, or 1,000 machines doing NFS/CIFS mounts? 2) How many of the users are likely to be active at the same time? 1,000 users, each of whom are active an hour a week is entirely different from 200 users that are each active 140 hours a week. 3) What SLA/performance target are they expecting? If they want large 1TB I/O and 100MB/sec is acceptable, that's different than if they have a business need to go at 1.2GB/sec.... -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sat Apr 22 20:22:23 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sat, 22 Apr 2017 19:22:23 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sun Apr 23 11:07:38 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sun, 23 Apr 2017 10:07:38 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: Message-ID: The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower : > Hi, > > > Thank for the recommendations. > > Now we deal with the situation of: > > > - take 3 nodes with round robin DNS that handle both protocols > > - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and > NFS services. > > > Regarding your recommendations, 256GB memory node could be a plus if we > mix both protocols for such case. > > > Is the spreadsheet publicly available or do we need to ask IBM ? > > > Thank for your help, > > Frank. > > > ------------------------------ > *From:* Jan-Frode Myklebust > *Sent:* Saturday, April 22, 2017 10:50 AM > *To:* gpfsug-discuss at spectrumscale.org > *Subject:* Re: [gpfsug-discuss] Protocol node recommendations > > That's a tiny maxFilesToCache... > > I would start by implementing the settings from > /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your > protocoll nodes, and leave further tuning to when you see you have issues. > > Regarding sizing, we have a spreadsheet somewhere where you can input some > workload parameters and get an idea for how many nodes you'll need. Your > node config seems fine, but one node seems too few to serve 1000+ users. We > support max 3000 SMB connections/node, and I believe the recommendation is > 4000 NFS connections/node. > > > -jf > l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower : > >> Hi, >> >> We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with >> GPFS client on each node. >> >> We will have to open GPFS to all our users over CIFS and kerberized NFS >> with ACL support for both protocol for around +1000 users >> >> All users have different use case and needs: >> - some will do random I/O through a large set of opened files (~5k files) >> - some will do large write with 500GB-1TB files >> - other will arrange sequential I/O with ~10k opened files >> >> NFS and CIFS will share the same server, so I through to use SSD >> drive, at least 128GB memory with 2 sockets. >> >> Regarding tuning parameters, I thought at: >> >> maxFilesToCache 10000 >> syncIntervalStrict yes >> workerThreads (8*core) >> prefetchPct 40 (for now and update if needed) >> >> I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering >> if someone could share his experience/best practice regarding hardware >> sizing and/or tuning parameters. >> >> Thank by advance, >> Frank >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > -------------- next part -------------- An HTML attachment was scrubbed... URL: From rreuscher at verizon.net Sun Apr 23 17:43:44 2017 From: rreuscher at verizon.net (Robert Reuscher) Date: Sun, 23 Apr 2017 11:43:44 -0500 Subject: [gpfsug-discuss] LUN expansion Message-ID: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> We run GPFS on z/Linux and have been using ECKD devices for disks. We are looking at implementing some new filesystems on FCP LUNS. One of the features of a LUN is we can expand a LUN instead of adding new LUNS, where as with ECKD devices. From what I?ve found searching to see if GPFS filesystem can be expanding to see the expanded LUN, it doesn?t seem that this will work, you have to add new LUNS (or new disks) and then add them to the filesystem. Everything I?ve found is at least 2-3 old (most of it much older), and just want to check that this is still is true before we make finalize our LUN/GPFS procedures. Robert Reuscher NR5AR -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Sun Apr 23 22:27:50 2017 From: frank.tower at outlook.com (Frank Tower) Date: Sun, 23 Apr 2017 21:27:50 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfadden at us.ibm.com Sun Apr 23 23:44:56 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Sun, 23 Apr 2017 22:44:56 +0000 Subject: [gpfsug-discuss] LUN expansion In-Reply-To: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> References: <4CBF459B-4008-4CA2-904F-1A48882F021E@verizon.net> Message-ID: An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Mon Apr 24 10:11:25 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Mon, 24 Apr 2017 09:11:25 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , Message-ID: What's your SSD going to help with... will you implement it as a LROC device? Otherwise I can't see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust ; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Mon Apr 24 11:28:08 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 24 Apr 2017 12:28:08 +0200 (CEST) Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Message-ID: <416417651.114582.1493029688959@email.1und1.de> An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Apr 24 12:14:17 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 12:14:17 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <416417651.114582.1493029688959@email.1und1.de> References: <416417651.114582.1493029688959@email.1und1.de> Message-ID: <1493032457.11896.20.camel@buzzard.me.uk> On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From service at metamodul.com Mon Apr 24 13:21:09 2017 From: service at metamodul.com (service at metamodul.com) Date: Mon, 24 Apr 2017 14:21:09 +0200 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Message-ID: Hi Jonathan todays hardware is so powerful that imho it might make sense to split a CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). I think that such a server is a little bit to big ?just to be a single NSD server. Note that i use for each GPFS service a dedicated node. So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. Inhm 4xS822L could handle this and a little bit more quite well. Of course blade technology could be used or 1U server. With kind regards Hajo --? Unix Systems Engineer MetaModul GmbH +49 177 4393994
-------- Urspr?ngliche Nachricht --------
Von: Jonathan Buzzard
Datum:2017.04.24 13:14 (GMT+01:00)
An: gpfsug main discussion list
Betreff: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale
On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Apr 24 13:42:51 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 24 Apr 2017 15:42:51 +0300 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: Hi As tastes vary, I would not partition it so much for the backend. Assuming there is little to nothing overhead on the CPU at PHYP level, which it depends. On the protocols nodes, due the CTDB keeping locks together across all nodes (SMB), you would get more performance on bigger & less number of CES nodes than more and smaller. Certainly a 822 is quite a server if we go back to previous generations but I would still keep a simple backend (NSd servers), simple CES (less number of nodes the merrier) & then on the client part go as micro partitions as you like/can as the effect on the cluster is less relevant in the case of resources starvation. But, it depends on workloads, SLA and money so I say try, establish a baseline and it fills the requirements, go for it. If not change till does. Have fun From: "service at metamodul.com" To: gpfsug main discussion list Date: 24/04/2017 15:21 Subject: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Jonathan todays hardware is so powerful that imho it might make sense to split a CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). I think that such a server is a little bit to big just to be a single NSD server. Note that i use for each GPFS service a dedicated node. So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. Inhm 4xS822L could handle this and a little bit more quite well. Of course blade technology could be used or 1U server. With kind regards Hajo -- Unix Systems Engineer MetaModul GmbH +49 177 4393994 -------- Urspr?ngliche Nachricht -------- Von: Jonathan Buzzard Datum:2017.04.24 13:14 (GMT+01:00) An: gpfsug main discussion list Betreff: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > @All > > > does anybody uses virtualization technologies for GPFS Server ? If yes > what kind and why have you selected your soulution. > > I think currently about using Linux on Power using 40G SR-IOV for > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > also assign only a certain amount of CPUs to GPFS. ( Lower license > cost / You pay for what you use) > > > I must admit that i am not familar how "good" KVM/ESX in respect to > direct assignment of hardware is. Thus the question to the group > For the most part GPFS is used at scale and in general all the components are redundant. As such why you would want to allocate less than a whole server into a production GPFS system in somewhat beyond me. That is you will have a bunch of NSD servers in the system and if one crashes, well the other NSD's take over. Similar for protocol nodes, and in general the total file system size is going to hundreds of TB otherwise why bother with GPFS. I guess there is currently potential value at sticking the GUI into a virtual machine to get redundancy. On the other hand if you want a test rig, then virtualization works wonders. I have put GPFS on a single Linux box, using LV's for the disks and mapping them into virtual machines under KVM. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Apr 24 14:04:26 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 14:04:26 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <1493039066.11896.30.camel@buzzard.me.uk> On Mon, 2017-04-24 at 14:21 +0200, service at metamodul.com wrote: > Hi Jonathan > todays hardware is so powerful that imho it might make sense to split > a CEC into more "piece". For example the IBM S822L has up to 2x12 > cores, 9 PCI3 slots ( 4?16 lans & 5?8 lan ). > I think that such a server is a little bit to big just to be a single > NSD server. So don't buy it for an NSD server then :-) > Note that i use for each GPFS service a dedicated node. > So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup > nodes and at least 3 test server a total of 11 server is needed. > Inhm 4xS822L could handle this and a little bit more quite well. > I think you are missing the point somewhat. Well by several country miles and quite possibly an ocean or two to be honest. Spectrum scale is supposed to be a "scale out" solution. More storage required add more arrays. More bandwidth add more servers etc. If you are just going to scale it all up on a *single* server then you might as well forget GPFS and do an old school standard scale up solution. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From janfrode at tanso.net Mon Apr 24 14:14:20 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 24 Apr 2017 15:14:20 +0200 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: I agree with Luis -- why so many nodes? """ So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes and at least 3 test server a total of 11 server is needed. """ If this is your whole cluster, why not just 3x P822L/P812L running single partition per node, hosting a cluster of 3x protocol-nodes that does both direct FC for disk access, and also run backups on same nodes ? No complications, full hw performance. Then separate node for test, or separate partition on same nodes with dedicated adapters. But back to your original question. My experience is that LPAR/NPIV works great, but it's a bit annoying having to also have VIOs. Hope we'll get FC SR-IOV eventually.. Also LPAR/Dedicated-adapters naturally works fine. VMWare/RDM can be a challenge in some failure situations. It likes to pause VMs in APD or PDL situations, which will affect all VMs with access to it :-o VMs without direct disk access is trivial. -jf On Mon, Apr 24, 2017 at 2:42 PM, Luis Bolinches wrote: > Hi > > As tastes vary, I would not partition it so much for the backend. Assuming > there is little to nothing overhead on the CPU at PHYP level, which it > depends. On the protocols nodes, due the CTDB keeping locks together across > all nodes (SMB), you would get more performance on bigger & less number of > CES nodes than more and smaller. > > Certainly a 822 is quite a server if we go back to previous generations > but I would still keep a simple backend (NSd servers), simple CES (less > number of nodes the merrier) & then on the client part go as micro > partitions as you like/can as the effect on the cluster is less relevant in > the case of resources starvation. > > But, it depends on workloads, SLA and money so I say try, establish a > baseline and it fills the requirements, go for it. If not change till does. > Have fun > > > > From: "service at metamodul.com" > To: gpfsug main discussion list > Date: 24/04/2017 15:21 > Subject: Re: [gpfsug-discuss] Used virtualization technologies for > GPFS/Spectrum Scale > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Hi Jonathan > todays hardware is so powerful that imho it might make sense to split a > CEC into more "piece". For example the IBM S822L has up to 2x12 cores, 9 > PCI3 slots ( 4?16 lans & 5?8 lan ). > I think that such a server is a little bit to big just to be a single NSD > server. > Note that i use for each GPFS service a dedicated node. > So if i would go for 4 NSD server, 6 protocol nodes and 2 tsm backup nodes > and at least 3 test server a total of 11 server is needed. > Inhm 4xS822L could handle this and a little bit more quite well. > > Of course blade technology could be used or 1U server. > > With kind regards > Hajo > > -- > Unix Systems Engineer > MetaModul GmbH > +49 177 4393994 <+49%20177%204393994> > > > -------- Urspr?ngliche Nachricht -------- > Von: Jonathan Buzzard > Datum:2017.04.24 13:14 (GMT+01:00) > An: gpfsug main discussion list > Betreff: Re: [gpfsug-discuss] Used virtualization technologies for > GPFS/Spectrum Scale > > On Mon, 2017-04-24 at 12:28 +0200, Hans-Joachim Ehlers wrote: > > @All > > > > > > does anybody uses virtualization technologies for GPFS Server ? If yes > > what kind and why have you selected your soulution. > > > > I think currently about using Linux on Power using 40G SR-IOV for > > Network and NPIV/Dedidcated FC Adater for storage. As a plus i can > > also assign only a certain amount of CPUs to GPFS. ( Lower license > > cost / You pay for what you use) > > > > > > I must admit that i am not familar how "good" KVM/ESX in respect to > > direct assignment of hardware is. Thus the question to the group > > > > For the most part GPFS is used at scale and in general all the > components are redundant. As such why you would want to allocate less > than a whole server into a production GPFS system in somewhat beyond me. > > That is you will have a bunch of NSD servers in the system and if one > crashes, well the other NSD's take over. Similar for protocol nodes, and > in general the total file system size is going to hundreds of TB > otherwise why bother with GPFS. > > I guess there is currently potential value at sticking the GUI into a > virtual machine to get redundancy. > > On the other hand if you want a test rig, then virtualization works > wonders. I have put GPFS on a single Linux box, using LV's for the disks > and mapping them into virtual machines under KVM. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss_______ > ________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: > Oy IBM Finland Ab > PL 265, 00101 Helsinki, Finland > Business ID, Y-tunnus: 0195876-3 > Registered in Finland > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Mon Apr 24 16:29:56 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 24 Apr 2017 11:29:56 -0400 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <131241.1493047796@turing-police.cc.vt.edu> On Mon, 24 Apr 2017 14:21:09 +0200, "service at metamodul.com" said: > todays hardware is so powerful that imho it might make sense to split a CEC > into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots > ( 4?16 lans & 5?8 lan ). We look at it the other way around: Today's hardware is so powerful that you can build a cluster out of a stack of fairly low-end 1U servers (we have one cluster that's built out of Dell r630s). And it's more robust against hardware failures than a VM based solution - although the 822 seems to allow hot-swap of PCI cards, a dead socket or DIMM will still kill all the VMs when you go to replace it. If one 1U out of 4 goes down due to a bad DIMM (which has happened to us more often than a bad PCI card) you can just power it down and replace it.... -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From service at metamodul.com Mon Apr 24 17:11:25 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 24 Apr 2017 18:11:25 +0200 (CEST) Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: References: Message-ID: <1961501377.286669.1493050285874@email.1und1.de> > Jan-Frode Myklebust hat am 24. April 2017 um 15:14 geschrieben: > I agree with Luis -- why so many nodes? Many ? IMHO it is not that much. I do not like to have one server doing more than one task. Thus a NSD Server does only serves GPFS. A Protocol server serves either NFS or SMB but not both except IBM says it would be better to run NFS/SMB on the same node. A backup server runs also on its "own" hardware. So i would need at least 4 NSD Server since if 1 fails i am losing only 25% of my "performance" and still having a 4/5 quorum. Nice in case an Update of a NSD failed. Each protocol service requires at least 2 nodes and the backup service as well. I can only say that with that approach i never had problems. I have be running into problems each time i did not followed that apporach. But of course YMMV But keep in mind that each service might requires different GPFS configuration or even slightly different hardware. Saying so i am a fan of having many GPFS Server ( NSD, Protocol , Backup a.s.o ) and i do not understand why not to use many nodes ^_^ Cheers Hajo From jonathan at buzzard.me.uk Mon Apr 24 17:24:29 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 24 Apr 2017 17:24:29 +0100 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <131241.1493047796@turing-police.cc.vt.edu> References: <131241.1493047796@turing-police.cc.vt.edu> Message-ID: <1493051069.11896.39.camel@buzzard.me.uk> On Mon, 2017-04-24 at 11:29 -0400, valdis.kletnieks at vt.edu wrote: > On Mon, 24 Apr 2017 14:21:09 +0200, "service at metamodul.com" said: > > > todays hardware is so powerful that imho it might make sense to split a CEC > > into more "piece". For example the IBM S822L has up to 2x12 cores, 9 PCI3 slots > > ( 4?16 lans & 5?8 lan ). > > We look at it the other way around: Today's hardware is so powerful that > you can build a cluster out of a stack of fairly low-end 1U servers (we > have one cluster that's built out of Dell r630s). And it's more robust > against hardware failures than a VM based solution - although the 822 seems > to allow hot-swap of PCI cards, a dead socket or DIMM will still kill all > the VMs when you go to replace it. If one 1U out of 4 goes down due to > a bad DIMM (which has happened to us more often than a bad PCI card) you > can just power it down and replace it.... Hate to say but the 822 will happily keep trucking when the CPU (assuming it has more than one) fails and similar with the DIMM's. In fact mirrored DIMM's is reasonably common on x86 machines these days, though very few people ever use it. That said CPU failures are incredibly rare in my experience. The only time I have ever come across a failed CPU was on a pSeries machine and then it was only because the backup was running really slow (it was running TSM) that prompted us to look closer and see what had happened. Monitoring (Zenoss) was not setup to register the event because like when does a CPU fail and the machine keep running! I am not 100% sure on the 822 put I suspect that the DIMM's and any socketed CPU's can be hot swapped in addition to the PCI card's which I have personally done on pSeries machines. However it is a stupidly over priced solution to run GPFS, because there are better or at the very least vastly cheaper ways to get the same level of reliability. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From valdis.kletnieks at vt.edu Mon Apr 24 18:58:17 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Mon, 24 Apr 2017 13:58:17 -0400 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <1493051069.11896.39.camel@buzzard.me.uk> References: <131241.1493047796@turing-police.cc.vt.edu> <1493051069.11896.39.camel@buzzard.me.uk> Message-ID: <7337.1493056697@turing-police.cc.vt.edu> On Mon, 24 Apr 2017 17:24:29 +0100, Jonathan Buzzard said: > Hate to say but the 822 will happily keep trucking when the CPU > (assuming it has more than one) fails and similar with the DIMM's. In How about when you go to replace the DIMM? You able to hot-swap the memory without anything losing its mind? (I know this is possible in the Z/series world, but those usually have at least 2-3 more zeros in the price tag). -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From luis.bolinches at fi.ibm.com Mon Apr 24 19:08:32 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 24 Apr 2017 21:08:32 +0300 Subject: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale In-Reply-To: <7337.1493056697@turing-police.cc.vt.edu> References: <131241.1493047796@turing-police.cc.vt.edu> <1493051069.11896.39.camel@buzzard.me.uk> <7337.1493056697@turing-police.cc.vt.edu> Message-ID: Hi 822 is an entry scale out Power machine, it has limited RAS compared with the high end ones (870/880). The 822 needs to be down for CPU / DIMM replacement: https://www.ibm.com/support/knowledgecenter/5148-21L/p8eg3/p8eg3_83x_8rx_kickoff.htm . And it is not a end user task. You can argue that, I owuld but it is the current statement and you pay for support for these kind of stuff. From: valdis.kletnieks at vt.edu To: gpfsug main discussion list Date: 24/04/2017 20:58 Subject: Re: [gpfsug-discuss] Used virtualization technologies for GPFS/Spectrum Scale Sent by: gpfsug-discuss-bounces at spectrumscale.org On Mon, 24 Apr 2017 17:24:29 +0100, Jonathan Buzzard said: > Hate to say but the 822 will happily keep trucking when the CPU > (assuming it has more than one) fails and similar with the DIMM's. In How about when you go to replace the DIMM? You able to hot-swap the memory without anything losing its mind? (I know this is possible in the Z/series world, but those usually have at least 2-3 more zeros in the price tag). [attachment "attqolcz.dat" deleted by Luis Bolinches/Finland/IBM] _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From frank.tower at outlook.com Mon Apr 24 22:12:14 2017 From: frank.tower at outlook.com (Frank Tower) Date: Mon, 24 Apr 2017 21:12:14 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , , Message-ID: >From what I've read from the Wiki: 'The NFS protocol performance is largely dependent on the base system performance of the protocol node hardware and network. This includes multiple factors the type and number of CPUs, the size of the main memory in the nodes, the type of disk drives used (HDD, SSD, etc.) and the disk configuration (RAID-level, replication etc.). In addition, NFS protocol performance can be impacted by the overall load of the node (such as number of clients accessing, snapshot creation/deletion and more) and administrative tasks (for example filesystem checks or online re-striping of disk arrays).' Nowadays, SSD is worst to invest. LROC could be an option in the future, but we need to quantify NFS/CIFS workload first. Are you using LROC with your GPFS installation ? Best, Frank. ________________________________ From: Sobey, Richard A Sent: Monday, April 24, 2017 11:11 AM To: gpfsug main discussion list; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations What?s your SSD going to help with? will you implement it as a LROC device? Otherwise I can?t see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust ; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 09:19:10 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 08:19:10 +0000 Subject: [gpfsug-discuss] Protocol node recommendations In-Reply-To: References: , , Message-ID: I tried it on one node but investing in what could be up to ?5000 in SSDs when we don't know the gains isn't something I can argue. Not that LROC will hurt the environment but my users may not see any benefit. My cluster is the complete opposite of busy (relative to people saying they're seeing sustained 800MB/sec throughput), I just need it stable. Richard From: Frank Tower [mailto:frank.tower at outlook.com] Sent: 24 April 2017 22:12 To: Sobey, Richard A ; gpfsug main discussion list ; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations >From what I've read from the Wiki: 'The NFS protocol performance is largely dependent on the base system performance of the protocol node hardware and network. This includes multiple factors the type and number of CPUs, the size of the main memory in the nodes, the type of disk drives used (HDD, SSD, etc.) and the disk configuration (RAID-level, replication etc.). In addition, NFS protocol performance can be impacted by the overall load of the node (such as number of clients accessing, snapshot creation/deletion and more) and administrative tasks (for example filesystem checks or online re-striping of disk arrays).' Nowadays, SSD is worst to invest. LROC could be an option in the future, but we need to quantify NFS/CIFS workload first. Are you using LROC with your GPFS installation ? Best, Frank. ________________________________ From: Sobey, Richard A > Sent: Monday, April 24, 2017 11:11 AM To: gpfsug main discussion list; Jan-Frode Myklebust Subject: Re: [gpfsug-discuss] Protocol node recommendations What's your SSD going to help with... will you implement it as a LROC device? Otherwise I can't see the benefit to using it to boot off. From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Frank Tower Sent: 23 April 2017 22:28 To: Jan-Frode Myklebust >; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations Hi, Nice ! didn't pay attention at the revision and the spreadsheet. If someone still have a copy somewhere it could be useful, Google didn't help :( We will follow your advise and start with 3 protocol nodes equipped with 128GB memory, 2 x 12 cores (maybe E5-2680 or E5-2670). >From what I read, NFS-Ganesha mainly depend of the hardware, Linux on a SSD should be a big plus in our case. Best, Frank ________________________________ From: Jan-Frode Myklebust > Sent: Sunday, April 23, 2017 12:07:38 PM To: Frank Tower; gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations The protocol sizing tool should be available from https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/Sizing%20Guidance%20for%20Protocol%20Node/version/70a4c7c0-a5c6-4dde-b391-8f91c542dd7d , but I'm getting 404 now. I think 128GB should be enough for both protocols on same nodes, and I think your 3 node suggestion is best. Better load sharing with not dedicating subset of nodes to each protocol. -jf l?r. 22. apr. 2017 kl. 21.22 skrev Frank Tower >: Hi, Thank for the recommendations. Now we deal with the situation of: - take 3 nodes with round robin DNS that handle both protocols - take 4 nodes, split CIFS and NFS, still use round robin DNS for CIFS and NFS services. Regarding your recommendations, 256GB memory node could be a plus if we mix both protocols for such case. Is the spreadsheet publicly available or do we need to ask IBM ? Thank for your help, Frank. ________________________________ From: Jan-Frode Myklebust > Sent: Saturday, April 22, 2017 10:50 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Protocol node recommendations That's a tiny maxFilesToCache... I would start by implementing the settings from /usr/lpp/mmfs/*/gpfsprotocolldefaul* plus a 64GB pagepool for your protocoll nodes, and leave further tuning to when you see you have issues. Regarding sizing, we have a spreadsheet somewhere where you can input some workload parameters and get an idea for how many nodes you'll need. Your node config seems fine, but one node seems too few to serve 1000+ users. We support max 3000 SMB connections/node, and I believe the recommendation is 4000 NFS connections/node. -jf l?r. 22. apr. 2017 kl. 08.34 skrev Frank Tower >: Hi, We have here around 2PB GPFS (4.2.2) accessed through an HPC cluster with GPFS client on each node. We will have to open GPFS to all our users over CIFS and kerberized NFS with ACL support for both protocol for around +1000 users All users have different use case and needs: - some will do random I/O through a large set of opened files (~5k files) - some will do large write with 500GB-1TB files - other will arrange sequential I/O with ~10k opened files NFS and CIFS will share the same server, so I through to use SSD drive, at least 128GB memory with 2 sockets. Regarding tuning parameters, I thought at: maxFilesToCache 10000 syncIntervalStrict yes workerThreads (8*core) prefetchPct 40 (for now and update if needed) I read the wiki 'Sizing Guidance for Protocol Node', but I was wondering if someone could share his experience/best practice regarding hardware sizing and/or tuning parameters. Thank by advance, Frank _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From chair at spectrumscale.org Tue Apr 25 09:23:32 2017 From: chair at spectrumscale.org (Spectrum Scale UG Chair (Simon Thompson)) Date: Tue, 25 Apr 2017 09:23:32 +0100 Subject: [gpfsug-discuss] User group meeting May 9th/10th 2017 Message-ID: The UK user group is now just 2 weeks away! Its time to register ... https://www.eventbrite.com/e/spectrum-scalegpfs-user-group-spring-2017-regi stration-32113696932 (or https://goo.gl/tRptru) Remember user group meetings are free to attend, and this year's 2 day meeting is packed full of sessions and several of the breakout sessions are cloud-focussed looking at how Spectrum Scale can be used with cloud deployments. And as usual, we have the ever popular Sven speaking with his views from the Research topics. Thanks to our sponsors Arcastream, DDN, Ellexus, Lenovo, IBM, Mellanox, OCF and Seagate for helping make this happen! We need to finalise numbers for the evening event soon, so make sure you book your place now! Simon From S.J.Thompson at bham.ac.uk Tue Apr 25 12:20:39 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 11:20:39 +0000 Subject: [gpfsug-discuss] NFS issues Message-ID: Hi, We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes. We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs. In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine. In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't. If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues. So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround? We are currently running the packages nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often. On the clients, we use -o sync,hard BTW as in the IBM docs. On a client showing the issues, we'll see in dmesg, NFS related messages like: [Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out Which explains the client hang on certain mount points. The symptoms feel very much like those logged in this Gluster/ganesha bug: https://bugzilla.redhat.com/show_bug.cgi?id=1354439 Thanks Simon From Mark.Bush at siriuscom.com Tue Apr 25 14:27:38 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 13:27:38 +0000 Subject: [gpfsug-discuss] Perfmon and GUI Message-ID: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 14:44:59 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 13:44:59 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> References: <321F04D4-5F3A-443F-A598-0616642C9F96@siriuscom.com> Message-ID: I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From j.ouwehand at vumc.nl Tue Apr 25 14:51:22 2017 From: j.ouwehand at vumc.nl (Ouwehand, JJ) Date: Tue, 25 Apr 2017 13:51:22 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: <5594921EA5B3674AB44AD9276126AAF40170DD3159@sp-mx-mbx42> Hello, At first a short introduction. My name is Jaap Jan Ouwehand, I work at a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical (office, research and clinical data) business process. We have three large GPFS filesystems for different purposes. We also had such a situation with cNFS. A failover (IPtakeover) was technically good, only clients experienced "stale filehandles". We opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few months later, the solution appeared to be in the fsid option. An NFS filehandle is built by a combination of fsid and a hash function on the inode. After a failover, the fsid value can be different and the client has a "stale filehandle". To avoid this, the fsid value can be statically specified. See: https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum.scale.v4r22.doc/bl1adm_nfslin.htm Maybe there is also a value in Ganesha that changes after a failover. Certainly since most sessions will be re-established after a failback. Maybe you see more debug information with tcpdump. Kind regards, ? Jaap Jan Ouwehand ICT Specialist (Storage & Linux) VUmc - ICT E: jj.ouwehand at vumc.nl W: www.vumc.com -----Oorspronkelijk bericht----- Van: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson (IT Research Support) Verzonden: dinsdag 25 april 2017 13:21 Aan: gpfsug-discuss at spectrumscale.org Onderwerp: [gpfsug-discuss] NFS issues Hi, We have recently started deploying NFS in addition our existing SMB exports on our protocol nodes. We use a RR DNS name that points to 4 VIPs for SMB services and failover seems to work fine with SMB clients. We figured we could use the same name and IPs and run Ganesha on the protocol servers, however we are seeing issues with NFS clients when IP failover occurs. In normal operation on a client, we might see several mounts from different IPs obviously due to the way the DNS RR is working, but it all works fine. In a failover situation, the IP will move to another node and some clients will carry on, others will hang IO to the mount points referred to by the IP which has moved. We can *sometimes* trigger this by manually suspending a CES node, but not always and some clients mounting from the IP moving will be fine, others won't. If we resume a node an it fails back, the clients that are hanging will usually recover fine. We can reboot a client prior to failback and it will be fine, stopping and starting the ganesha service on a protocol node will also sometimes resolve the issues. So, has anyone seen this sort of issue and any suggestions for how we could either debug more or workaround? We are currently running the packages nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). At one point we were seeing it a lot, and could track it back to an underlying GPFS network issue that was causing protocol nodes to be expelled occasionally, we resolved that and the issues became less apparent, but maybe we just fixed one failure mode so see it less often. On the clients, we use -o sync,hard BTW as in the IBM docs. On a client showing the issues, we'll see in dmesg, NFS related messages like: [Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not responding, timed out Which explains the client hang on certain mount points. The symptoms feel very much like those logged in this Gluster/ganesha bug: https://bugzilla.redhat.com/show_bug.cgi?id=1354439 Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Tue Apr 25 15:06:04 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 14:06:04 +0000 Subject: [gpfsug-discuss] NFS issues Message-ID: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical >(office, research and clinical data) business process. We have three >large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We opened >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months >later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson >(IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: gpfsug-discuss at spectrumscale.org >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and failover >seems to work fine with SMB clients. We figured we could use the same >name and IPs and run Ganesha on the protocol servers, however we are >seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it all >works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by manually >suspending a CES node, but not always and some clients mounting from the >IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not >responding, timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Mark.Bush at siriuscom.com Tue Apr 25 15:13:58 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 14:13:58 +0000 Subject: [gpfsug-discuss] Perfmon and GUI Message-ID: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Tue Apr 25 15:29:07 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Tue, 25 Apr 2017 14:29:07 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From r.sobey at imperial.ac.uk Tue Apr 25 15:31:13 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 25 Apr 2017 14:31:13 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: No worries Mark. We don?t use NFS here (yet) so I can?t help there. Glad I could help. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 15:29 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Tue Apr 25 18:04:41 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Tue, 25 Apr 2017 17:04:41 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk>: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Ouwehand, JJ" j.ouwehand at vumc.nl> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: jj.ouwehand at vumc.nl > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: gpfsug-discuss at spectrumscale.org > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From hoang.nguyen at seagate.com Tue Apr 25 18:12:19 2017 From: hoang.nguyen at seagate.com (Hoang Nguyen) Date: Tue, 25 Apr 2017 10:12:19 -0700 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: Message-ID: I have a customer with a slightly different issue but sounds somewhat related. If you stop and stop the NFS service on a CES node or update an existing export which will restart Ganesha. Some of their NFS clients do not reconnect in a very similar fashion as you described. I haven't been able to reproduce it on my test system repeatedly but using soft NFS mounts seems to help. Seems like it happens more often to clients currently running NFS IO during the outage. But I'm interested to see what you guys uncover. Thanks, Hoang On Tue, Apr 25, 2017 at 7:06 AM, Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk> wrote: > Hi, > > From what I can see, Ganesha uses the Export_Id option in the config file > (which is managed by CES) for this. I did find some reference in the > Ganesha devs list that if its not set, then it would read the FSID from > the GPFS file-system, either way they should surely be consistent across > all the nodes. The posts I found were from someone with an IBM email > address, so I guess someone in the IBM teams. > > I checked a couple of my protocol nodes and they use the same Export_Id > consistently, though I guess that might not be the same as the FSID value. > > Perhaps someone from IBM could comment on if FSID is likely to the cause > of my problems? > > Thanks > > Simon > > On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Ouwehand, JJ" j.ouwehand at vumc.nl> wrote: > > >Hello, > > > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a > >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM > >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical > >(office, research and clinical data) business process. We have three > >large GPFS filesystems for different purposes. > > > >We also had such a situation with cNFS. A failover (IPtakeover) was > >technically good, only clients experienced "stale filehandles". We opened > >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months > >later, the solution appeared to be in the fsid option. > > > >An NFS filehandle is built by a combination of fsid and a hash function > >on the inode. After a failover, the fsid value can be different and the > >client has a "stale filehandle". To avoid this, the fsid value can be > >statically specified. See: > > > >https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ibm.com_support_ > knowledgecenter_STXKQY-5F4.2.2_com.ibm.spectrum&d=DwICAg&c= > IGDlg0lD0b-nebmJJ0Kp8A&r=erT0ET1g1dsvTDYndRRTAAZ6Dneebt > G6F47PIUMDXFw&m=K3iXrW2N_HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s= > PIXnA0UQbneTHMRxvUcmsvZK6z5V2XU4jR_GIVaZP5Q&e= . > >scale.v4r22.doc/bl1adm_nfslin.htm > > > >Maybe there is also a value in Ganesha that changes after a failover. > >Certainly since most sessions will be re-established after a failback. > >Maybe you see more debug information with tcpdump. > > > > > >Kind regards, > > > >Jaap Jan Ouwehand > >ICT Specialist (Storage & Linux) > >VUmc - ICT > >E: jj.ouwehand at vumc.nl > >W: www.vumc.com > > > > > > > >-----Oorspronkelijk bericht----- > >Van: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson > >(IT Research Support) > >Verzonden: dinsdag 25 april 2017 13:21 > >Aan: gpfsug-discuss at spectrumscale.org > >Onderwerp: [gpfsug-discuss] NFS issues > > > >Hi, > > > >We have recently started deploying NFS in addition our existing SMB > >exports on our protocol nodes. > > > >We use a RR DNS name that points to 4 VIPs for SMB services and failover > >seems to work fine with SMB clients. We figured we could use the same > >name and IPs and run Ganesha on the protocol servers, however we are > >seeing issues with NFS clients when IP failover occurs. > > > >In normal operation on a client, we might see several mounts from > >different IPs obviously due to the way the DNS RR is working, but it all > >works fine. > > > >In a failover situation, the IP will move to another node and some > >clients will carry on, others will hang IO to the mount points referred > >to by the IP which has moved. We can *sometimes* trigger this by manually > >suspending a CES node, but not always and some clients mounting from the > >IP moving will be fine, others won't. > > > >If we resume a node an it fails back, the clients that are hanging will > >usually recover fine. We can reboot a client prior to failback and it > >will be fine, stopping and starting the ganesha service on a protocol > >node will also sometimes resolve the issues. > > > >So, has anyone seen this sort of issue and any suggestions for how we > >could either debug more or workaround? > > > >We are currently running the packages > >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > > > >At one point we were seeing it a lot, and could track it back to an > >underlying GPFS network issue that was causing protocol nodes to be > >expelled occasionally, we resolved that and the issues became less > >apparent, but maybe we just fixed one failure mode so see it less often. > > > >On the clients, we use -o sync,hard BTW as in the IBM docs. > > > >On a client showing the issues, we'll see in dmesg, NFS related messages > >like: > >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not > >responding, timed out > > > >Which explains the client hang on certain mount points. > > > >The symptoms feel very much like those logged in this Gluster/ganesha bug: > >https://urldefense.proofpoint.com/v2/url?u=https- > 3A__bugzilla.redhat.com_show-5Fbug.cgi-3Fid-3D1354439&d= > DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r=erT0ET1g1dsvTDYndRRTAAZ6Dneebt > G6F47PIUMDXFw&m=K3iXrW2N_HcdrGDuKmRWFjypuPLPJDIm9VosFII > sFoI&s=KN5WKk1vLEt0Y_17nVQeDi1lK5mSQUZQ7lPtQK3FBG4&e= > > > > > >Thanks > > > >Simon > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_ > listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_ > listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug. > org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=IGDlg0lD0b-nebmJJ0Kp8A&r= > erT0ET1g1dsvTDYndRRTAAZ6DneebtG6F47PIUMDXFw&m=K3iXrW2N_ > HcdrGDuKmRWFjypuPLPJDIm9VosFIIsFoI&s=rvZX6mp5gZr7h3QuwTM2EVZaG- > d1VXwSDKDhKVyQurw&e= > -- Hoang Nguyen *? *Sr Staff Engineer Seagate Technology office: +1 (858) 751-4487 mobile: +1 (858) 284-7846 www.seagate.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Tue Apr 25 18:30:40 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Tue, 25 Apr 2017 17:30:40 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: , Message-ID: I did some digging in the mmcesfuncs to see what happens server side on fail over. Basically the server losing the IP is supposed to terminate all sessions and the receiver server sends ACK tickles. My current supposition is that for whatever reason, the losing server isn't releasing something and the client still has hold of a connection which is mostly dead. The tickle then fails to the client from the new server. This would explain why failing the IP back to the original server usually brings the client back to life. This is only my working theory at the moment as we can't reliably reproduce this. Next time it happens we plan to grab some netstat from each side. Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the server that received the IP and see if that fixes it (i.e. the receiver server didn't tickle properly). (Usage extracted from mmcesfuncs which is ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) for anyone interested. Then try and kill he sessions on the losing server to check if there is stuff still open and re-tickle the client. If we can get steps to workaround, I'll log a PMR. I suppose I could do that now, but given its non deterministic and we want to be 100% sure it's not us doing something wrong, I'm inclined to wait until we do some more testing. I agree with the suggestion that it's probably IO pending nodes that are affected, but don't have any data to back that up yet. We did try with a read workload on a client, but may we need either long IO blocked reads or writes (from the GPFS end). We also originally had soft as the default option, but saw issues then and the docs suggested hard, so we switched and also enabled sync (we figured maybe it was NFS client with uncommited writes), but neither have resolved the issues entirely. Difficult for me to say if they improved the issue though given its sporadic. Appreciate people's suggestions! Thanks Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode Myklebust [janfrode at tanso.net] Sent: 25 April 2017 18:04 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" on behalf of j.ouwehand at vumc.nl> wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at a >Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of IBM >Spectrum Scale, Spectrum Archive and Spectrum Protect in our critical >(office, research and clinical data) business process. We have three >large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We opened >a PMR at IBM and after testing, deliver logs, tcpdumps and a few months >later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] Namens Simon Thompson >(IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: gpfsug-discuss at spectrumscale.org >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and failover >seems to work fine with SMB clients. We figured we could use the same >name and IPs and run Ganesha on the protocol servers, however we are >seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it all >works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by manually >suspending a CES node, but not always and some clients mounting from the >IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server MYNFSSERVER.bham.ac.uk not >responding, timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Greg.Lehmann at csiro.au Wed Apr 26 00:46:35 2017 From: Greg.Lehmann at csiro.au (Greg.Lehmann at csiro.au) Date: Tue, 25 Apr 2017 23:46:35 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: , Message-ID: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Are you using infiniband or Ethernet? I'm wondering if IBM have solved the gratuitous arp issue which we see with our non-protocols NFS implementation. -----Original Message----- From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon Thompson (IT Research Support) Sent: Wednesday, 26 April 2017 3:31 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I did some digging in the mmcesfuncs to see what happens server side on fail over. Basically the server losing the IP is supposed to terminate all sessions and the receiver server sends ACK tickles. My current supposition is that for whatever reason, the losing server isn't releasing something and the client still has hold of a connection which is mostly dead. The tickle then fails to the client from the new server. This would explain why failing the IP back to the original server usually brings the client back to life. This is only my working theory at the moment as we can't reliably reproduce this. Next time it happens we plan to grab some netstat from each side. Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the server that received the IP and see if that fixes it (i.e. the receiver server didn't tickle properly). (Usage extracted from mmcesfuncs which is ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) for anyone interested. Then try and kill he sessions on the losing server to check if there is stuff still open and re-tickle the client. If we can get steps to workaround, I'll log a PMR. I suppose I could do that now, but given its non deterministic and we want to be 100% sure it's not us doing something wrong, I'm inclined to wait until we do some more testing. I agree with the suggestion that it's probably IO pending nodes that are affected, but don't have any data to back that up yet. We did try with a read workload on a client, but may we need either long IO blocked reads or writes (from the GPFS end). We also originally had soft as the default option, but saw issues then and the docs suggested hard, so we switched and also enabled sync (we figured maybe it was NFS client with uncommited writes), but neither have resolved the issues entirely. Difficult for me to say if they improved the issue though given its sporadic. Appreciate people's suggestions! Thanks Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode Myklebust [janfrode at tanso.net] Sent: 25 April 2017 18:04 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues I *think* I've seen this, and that we then had open TCP connection from client to NFS server according to netstat, but these connections were not visible from netstat on NFS-server side. Unfortunately I don't remember what the fix was.. -jf tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >: Hi, >From what I can see, Ganesha uses the Export_Id option in the config file (which is managed by CES) for this. I did find some reference in the Ganesha devs list that if its not set, then it would read the FSID from the GPFS file-system, either way they should surely be consistent across all the nodes. The posts I found were from someone with an IBM email address, so I guess someone in the IBM teams. I checked a couple of my protocol nodes and they use the same Export_Id consistently, though I guess that might not be the same as the FSID value. Perhaps someone from IBM could comment on if FSID is likely to the cause of my problems? Thanks Simon On 25/04/2017, 14:51, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Ouwehand, JJ" on behalf of j.ouwehand at vumc.nl> wrote: >Hello, > >At first a short introduction. My name is Jaap Jan Ouwehand, I work at >a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >critical (office, research and clinical data) business process. We have >three large GPFS filesystems for different purposes. > >We also had such a situation with cNFS. A failover (IPtakeover) was >technically good, only clients experienced "stale filehandles". We >opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >months later, the solution appeared to be in the fsid option. > >An NFS filehandle is built by a combination of fsid and a hash function >on the inode. After a failover, the fsid value can be different and the >client has a "stale filehandle". To avoid this, the fsid value can be >statically specified. See: > >https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum. >scale.v4r22.doc/bl1adm_nfslin.htm > >Maybe there is also a value in Ganesha that changes after a failover. >Certainly since most sessions will be re-established after a failback. >Maybe you see more debug information with tcpdump. > > >Kind regards, > >Jaap Jan Ouwehand >ICT Specialist (Storage & Linux) >VUmc - ICT >E: jj.ouwehand at vumc.nl >W: www.vumc.com > > > >-----Oorspronkelijk bericht----- >Van: >gpfsug-discuss-bounces at spectrumscale.orgspectrumscale.org> >[mailto:gpfsug-discuss-bounces at spectrumscale.orgbounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >Verzonden: dinsdag 25 april 2017 13:21 >Aan: >gpfsug-discuss at spectrumscale.orgg> >Onderwerp: [gpfsug-discuss] NFS issues > >Hi, > >We have recently started deploying NFS in addition our existing SMB >exports on our protocol nodes. > >We use a RR DNS name that points to 4 VIPs for SMB services and >failover seems to work fine with SMB clients. We figured we could use >the same name and IPs and run Ganesha on the protocol servers, however >we are seeing issues with NFS clients when IP failover occurs. > >In normal operation on a client, we might see several mounts from >different IPs obviously due to the way the DNS RR is working, but it >all works fine. > >In a failover situation, the IP will move to another node and some >clients will carry on, others will hang IO to the mount points referred >to by the IP which has moved. We can *sometimes* trigger this by >manually suspending a CES node, but not always and some clients >mounting from the IP moving will be fine, others won't. > >If we resume a node an it fails back, the clients that are hanging will >usually recover fine. We can reboot a client prior to failback and it >will be fine, stopping and starting the ganesha service on a protocol >node will also sometimes resolve the issues. > >So, has anyone seen this sort of issue and any suggestions for how we >could either debug more or workaround? > >We are currently running the packages >nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >At one point we were seeing it a lot, and could track it back to an >underlying GPFS network issue that was causing protocol nodes to be >expelled occasionally, we resolved that and the issues became less >apparent, but maybe we just fixed one failure mode so see it less often. > >On the clients, we use -o sync,hard BTW as in the IBM docs. > >On a client showing the issues, we'll see in dmesg, NFS related >messages >like: >[Wed Apr 12 16:59:53 2017] nfs: server >MYNFSSERVER.bham.ac.uk not responding, >timed out > >Which explains the client hang on certain mount points. > >The symptoms feel very much like those logged in this Gluster/ganesha bug: >https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > > >Thanks > >Simon > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From Mark.Bush at siriuscom.com Wed Apr 26 14:26:08 2017 From: Mark.Bush at siriuscom.com (Mark Bush) Date: Wed, 26 Apr 2017 13:26:08 +0000 Subject: [gpfsug-discuss] Perfmon and GUI In-Reply-To: References: <2A0DC44A-D9FF-428B-8B02-FC6EC504BD34@siriuscom.com> Message-ID: My saga has come to an end. Turns out to get perf stats for NFS you need the gpfs.pm-ganesha package - duh. I typically do manual installs of scale so I just missed this one as it was buried in /usr/lpp/mmfs/4.2.3.0/zimon_rpms/rhel7. Anyway, package installed and now I get NFS stats in the gui and from cli. From: "Sobey, Richard A" Reply-To: gpfsug main discussion list Date: Tuesday, April 25, 2017 at 9:31 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI No worries Mark. We don?t use NFS here (yet) so I can?t help there. Glad I could help. Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 15:29 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Perfmon and GUI Update: So SMB monitoring is now working after copying all files per Richard?s recommendation (thank you sir) and restarting pmsensors, pmcollector, and gpfsfui. Sadly, NFS monitoring isn?t. It doesn?t work from the cli either though. So clearly, something is up with that part. I continue to troubleshoot. From: Mark Bush > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 9:13 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI Interesting. Some files were indeed already there but it was missing a few NFSIO.cfg being the most notable to me. I?ve gone ahead and copied those to all my nodes (just three in this cluster) and restarted services. Still no luck. I?m going to restart the GUI service next to see if that makes a difference. Interestingly I can do things like mmperfmon query smb2 and that tends to work and give me real data so not sure where the breakdown is in the GUI. Mark From: "Sobey, Richard A" > Reply-To: gpfsug main discussion list > Date: Tuesday, April 25, 2017 at 8:44 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Perfmon and GUI I would have thought this would be fixed by now as this happened to me in 4.2.1-(0?) ? here?s what support said. Can you try? I think you?ve already got the relevant bits in your .cfg files so it should just be a case of copying the files across and restarting pmsensors and pmcollector. Again bear in mind this affected me on 4.2.1 and you?re using 4.2.3 so ymmv.. ? I spoke with development and normally these files would be copied over to /opt/IBM/zimon when using the automatic installer but since this case doesn't use the installer we have to copy them over manually. We acknowledge this should be in the docs, and the reason it is not included in pmsensors rpm is due to the fact these do not come from the zimon team. The following files can be copied over to /opt/IBM/zimon [root at node1 default]# pwd /usr/lpp/mmfs/4.2.1.0/installer/cookbooks/zimon_on_gpfs/files/default [root at node1 default]# ls CTDBDBStats.cfg CTDBStats.cfg NFSIO.cfg SMBGlobalStats.cfg SMBSensors.cfg SMBStats.cfg ZIMonCollector.cfg ? Richard From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark Bush Sent: 25 April 2017 14:28 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] Perfmon and GUI Anyone know why in the GUI when I go to look at things like nodes and select a protocol node and then pick NFS or SMB why it has the boxes where a graph is supposed to be and it has a Red circled X and says ?Performance collector did not return any data?? I?ve added the things from the link into my protocol Nodes /opt/IBM/zimon/ZIMonSensors.cfg file https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/com.ibm.spectrum.scale.v4r2.adv.doc/bl1adv_configuringthePMT.htm Also restarted both pmsensors and pmcollector on the nodes. What am I missing? Here?s my ZIMonSensors.cfg file [root at n3 zimon]# cat ZIMonSensors.cfg cephMon = "/opt/IBM/zimon/CephMonProxy" cephRados = "/opt/IBM/zimon/CephRadosProxy" colCandidates = "n1" colRedundancy = 1 collectors = { host = "n1" port = "4739" } config = "/opt/IBM/zimon/ZIMonSensors.cfg" ctdbstat = "" daemonize = T hostname = "" ipfixinterface = "0.0.0.0" logfile = "/var/log/zimon/ZIMonSensors.log" loglevel = "info" mmcmd = "/opt/IBM/zimon/MMCmdProxy" mmdfcmd = "/opt/IBM/zimon/MMDFProxy" mmpmon = "/opt/IBM/zimon/MmpmonSockProxy" piddir = "/var/run" release = "4.2.3-0" sensors = { name = "CPU" period = 1 }, { name = "Load" period = 1 }, { name = "Memory" period = 1 }, { name = "Network" period = 1 }, { name = "Netstat" period = 10 }, { name = "Diskstat" period = 0 }, { name = "DiskFree" period = 600 }, { name = "GPFSDisk" period = 0 }, { name = "GPFSFilesystem" period = 1 }, { name = "GPFSNSDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSPoolIO" period = 0 }, { name = "GPFSVFS" period = 1 }, { name = "GPFSIOC" period = 0 }, { name = "GPFSVIO" period = 0 }, { name = "GPFSPDDisk" period = 0 restrict = "nsdNodes" }, { name = "GPFSvFLUSH" period = 0 }, { name = "GPFSNode" period = 1 }, { name = "GPFSNodeAPI" period = 1 }, { name = "GPFSFilesystemAPI" period = 1 }, { name = "GPFSLROC" period = 0 }, { name = "GPFSCHMS" period = 0 }, { name = "GPFSAFM" period = 0 }, { name = "GPFSAFMFS" period = 0 }, { name = "GPFSAFMFSET" period = 0 }, { name = "GPFSRPCS" period = 10 }, { name = "GPFSWaiters" period = 10 }, { name = "GPFSFilesetQuota" period = 3600 }, { name = "GPFSDiskCap" period = 0 }, { name = "GPFSFileset" period = 0 restrict = "n1" }, { name = "GPFSPool" period = 0 restrict = "n1" }, { name = "Infiniband" period = 0 }, { name = "CTDBDBStats" period = 1 type = "Generic" }, { name = "CTDBStats" period = 1 type = "Generic" }, { name = "NFSIO" period = 1 type = "Generic" }, { name = "SMBGlobalStats" period = 1 type = "Generic" }, { name = "SMBStats" period = 1 type = "Generic" } smbstat = "" This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Wed Apr 26 15:20:30 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 26 Apr 2017 14:20:30 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: Nope, the clients are all L3 connected, so not an arp issue. Two things we have observed: 1. It triggers when one of the CES IPs moves and quickly moves back again. The move occurs because the NFS server goes into grace: 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 2 nodeid -1 ip 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 recovery release ip 2017-04-25 20:36:49 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE 2017-04-25 20:37:42 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:37:44 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN GRACE, duration 60 2017-04-25 20:37:44 : epoch 00040183 : : ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server recovery event 4 nodeid 2 ip We can't see in any of the logs WHY ganesha is going into grace. Any suggestions on how to debug this further? (I.e. If we can stop the grace issues, we can solve the problem mostly). 2. Our clients are using LDAP which is bound to the CES IPs. If we shutdown nslcd on the client we can get the client to recover once all the TIME_WAIT connections have gone. Maybe this was a bad choice on our side to bind to the CES IPs - we figured it would handily move the IPs for us, but I guess the mmcesfuncs isn't aware of this and so doesn't kill the connections to the IP as it goes away. So two approaches we are going to try. Reconfigure the nslcd on a couple of clients and see if they still show up the issues when fail-over occurs. Second is to work out why the NFS servers are going into grace in the first place. Simon On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Greg.Lehmann at csiro.au" wrote: >Are you using infiniband or Ethernet? I'm wondering if IBM have solved >the gratuitous arp issue which we see with our non-protocols NFS >implementation. > >-----Original Message----- >From: gpfsug-discuss-bounces at spectrumscale.org >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >Thompson (IT Research Support) >Sent: Wednesday, 26 April 2017 3:31 AM >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] NFS issues > >I did some digging in the mmcesfuncs to see what happens server side on >fail over. > >Basically the server losing the IP is supposed to terminate all sessions >and the receiver server sends ACK tickles. > >My current supposition is that for whatever reason, the losing server >isn't releasing something and the client still has hold of a connection >which is mostly dead. The tickle then fails to the client from the new >server. > >This would explain why failing the IP back to the original server usually >brings the client back to life. > >This is only my working theory at the moment as we can't reliably >reproduce this. Next time it happens we plan to grab some netstat from >each side. > >Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >server that received the IP and see if that fixes it (i.e. the receiver >server didn't tickle properly). (Usage extracted from mmcesfuncs which is >ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >for anyone interested. > >Then try and kill he sessions on the losing server to check if there is >stuff still open and re-tickle the client. > >If we can get steps to workaround, I'll log a PMR. I suppose I could do >that now, but given its non deterministic and we want to be 100% sure >it's not us doing something wrong, I'm inclined to wait until we do some >more testing. > >I agree with the suggestion that it's probably IO pending nodes that are >affected, but don't have any data to back that up yet. We did try with a >read workload on a client, but may we need either long IO blocked reads >or writes (from the GPFS end). > >We also originally had soft as the default option, but saw issues then >and the docs suggested hard, so we switched and also enabled sync (we >figured maybe it was NFS client with uncommited writes), but neither have >resolved the issues entirely. Difficult for me to say if they improved >the issue though given its sporadic. > >Appreciate people's suggestions! > >Thanks > >Simon >________________________________________ >From: gpfsug-discuss-bounces at spectrumscale.org >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >Myklebust [janfrode at tanso.net] >Sent: 25 April 2017 18:04 >To: gpfsug main discussion list >Subject: Re: [gpfsug-discuss] NFS issues > >I *think* I've seen this, and that we then had open TCP connection from >client to NFS server according to netstat, but these connections were not >visible from netstat on NFS-server side. > >Unfortunately I don't remember what the fix was.. > > > > -jf > >tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >>: >Hi, > >From what I can see, Ganesha uses the Export_Id option in the config file >(which is managed by CES) for this. I did find some reference in the >Ganesha devs list that if its not set, then it would read the FSID from >the GPFS file-system, either way they should surely be consistent across >all the nodes. The posts I found were from someone with an IBM email >address, so I guess someone in the IBM teams. > >I checked a couple of my protocol nodes and they use the same Export_Id >consistently, though I guess that might not be the same as the FSID value. > >Perhaps someone from IBM could comment on if FSID is likely to the cause >of my problems? > >Thanks > >Simon > >On 25/04/2017, 14:51, >"gpfsug-discuss-bounces at spectrumscale.orgectrumscale.org> on behalf of Ouwehand, JJ" >ectrumscale.org> on behalf of >j.ouwehand at vumc.nl> wrote: > >>Hello, >> >>At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>critical (office, research and clinical data) business process. We have >>three large GPFS filesystems for different purposes. >> >>We also had such a situation with cNFS. A failover (IPtakeover) was >>technically good, only clients experienced "stale filehandles". We >>opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>months later, the solution appeared to be in the fsid option. >> >>An NFS filehandle is built by a combination of fsid and a hash function >>on the inode. After a failover, the fsid value can be different and the >>client has a "stale filehandle". To avoid this, the fsid value can be >>statically specified. See: >> >>https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>. >>scale.v4r22.doc/bl1adm_nfslin.htm >> >>Maybe there is also a value in Ganesha that changes after a failover. >>Certainly since most sessions will be re-established after a failback. >>Maybe you see more debug information with tcpdump. >> >> >>Kind regards, >> >>Jaap Jan Ouwehand >>ICT Specialist (Storage & Linux) >>VUmc - ICT >>E: jj.ouwehand at vumc.nl >>W: www.vumc.com >> >> >> >>-----Oorspronkelijk bericht----- >>Van: >>gpfsug-discuss-bounces at spectrumscale.org>spectrumscale.org> >>[mailto:gpfsug-discuss-bounces at spectrumscale.org>bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>Verzonden: dinsdag 25 april 2017 13:21 >>Aan: >>gpfsug-discuss at spectrumscale.org>g> >>Onderwerp: [gpfsug-discuss] NFS issues >> >>Hi, >> >>We have recently started deploying NFS in addition our existing SMB >>exports on our protocol nodes. >> >>We use a RR DNS name that points to 4 VIPs for SMB services and >>failover seems to work fine with SMB clients. We figured we could use >>the same name and IPs and run Ganesha on the protocol servers, however >>we are seeing issues with NFS clients when IP failover occurs. >> >>In normal operation on a client, we might see several mounts from >>different IPs obviously due to the way the DNS RR is working, but it >>all works fine. >> >>In a failover situation, the IP will move to another node and some >>clients will carry on, others will hang IO to the mount points referred >>to by the IP which has moved. We can *sometimes* trigger this by >>manually suspending a CES node, but not always and some clients >>mounting from the IP moving will be fine, others won't. >> >>If we resume a node an it fails back, the clients that are hanging will >>usually recover fine. We can reboot a client prior to failback and it >>will be fine, stopping and starting the ganesha service on a protocol >>node will also sometimes resolve the issues. >> >>So, has anyone seen this sort of issue and any suggestions for how we >>could either debug more or workaround? >> >>We are currently running the packages >>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >> >>At one point we were seeing it a lot, and could track it back to an >>underlying GPFS network issue that was causing protocol nodes to be >>expelled occasionally, we resolved that and the issues became less >>apparent, but maybe we just fixed one failure mode so see it less often. >> >>On the clients, we use -o sync,hard BTW as in the IBM docs. >> >>On a client showing the issues, we'll see in dmesg, NFS related >>messages >>like: >>[Wed Apr 12 16:59:53 2017] nfs: server >>MYNFSSERVER.bham.ac.uk not responding, >>timed out >> >>Which explains the client hang on certain mount points. >> >>The symptoms feel very much like those logged in this Gluster/ganesha >>bug: >>https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >> >> >>Thanks >> >>Simon >> >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>_______________________________________________ >>gpfsug-discuss mailing list >>gpfsug-discuss at spectrumscale.org >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss >_______________________________________________ >gpfsug-discuss mailing list >gpfsug-discuss at spectrumscale.org >http://gpfsug.org/mailman/listinfo/gpfsug-discuss From janfrode at tanso.net Wed Apr 26 15:27:03 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Wed, 26 Apr 2017 14:27:03 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: Would it help to lower the grace time? mmnfs configuration change LEASE_LIFETIME=10 mmnfs configuration change GRACE_PERIOD=10 -jf ons. 26. apr. 2017 kl. 16.20 skrev Simon Thompson (IT Research Support) < S.J.Thompson at bham.ac.uk>: > Nope, the clients are all L3 connected, so not an arp issue. > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > > >Are you using infiniband or Ethernet? I'm wondering if IBM have solved > >the gratuitous arp issue which we see with our non-protocols NFS > >implementation. > > > >-----Original Message----- > >From: gpfsug-discuss-bounces at spectrumscale.org > >[mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon > >Thompson (IT Research Support) > >Sent: Wednesday, 26 April 2017 3:31 AM > >To: gpfsug main discussion list > >Subject: Re: [gpfsug-discuss] NFS issues > > > >I did some digging in the mmcesfuncs to see what happens server side on > >fail over. > > > >Basically the server losing the IP is supposed to terminate all sessions > >and the receiver server sends ACK tickles. > > > >My current supposition is that for whatever reason, the losing server > >isn't releasing something and the client still has hold of a connection > >which is mostly dead. The tickle then fails to the client from the new > >server. > > > >This would explain why failing the IP back to the original server usually > >brings the client back to life. > > > >This is only my working theory at the moment as we can't reliably > >reproduce this. Next time it happens we plan to grab some netstat from > >each side. > > > >Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the > >server that received the IP and see if that fixes it (i.e. the receiver > >server didn't tickle properly). (Usage extracted from mmcesfuncs which is > >ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) > >for anyone interested. > > > >Then try and kill he sessions on the losing server to check if there is > >stuff still open and re-tickle the client. > > > >If we can get steps to workaround, I'll log a PMR. I suppose I could do > >that now, but given its non deterministic and we want to be 100% sure > >it's not us doing something wrong, I'm inclined to wait until we do some > >more testing. > > > >I agree with the suggestion that it's probably IO pending nodes that are > >affected, but don't have any data to back that up yet. We did try with a > >read workload on a client, but may we need either long IO blocked reads > >or writes (from the GPFS end). > > > >We also originally had soft as the default option, but saw issues then > >and the docs suggested hard, so we switched and also enabled sync (we > >figured maybe it was NFS client with uncommited writes), but neither have > >resolved the issues entirely. Difficult for me to say if they improved > >the issue though given its sporadic. > > > >Appreciate people's suggestions! > > > >Thanks > > > >Simon > >________________________________________ > >From: gpfsug-discuss-bounces at spectrumscale.org > >[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode > >Myklebust [janfrode at tanso.net] > >Sent: 25 April 2017 18:04 > >To: gpfsug main discussion list > >Subject: Re: [gpfsug-discuss] NFS issues > > > >I *think* I've seen this, and that we then had open TCP connection from > >client to NFS server according to netstat, but these connections were not > >visible from netstat on NFS-server side. > > > >Unfortunately I don't remember what the fix was.. > > > > > > > > -jf > > > >tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) > >>: > >Hi, > > > >From what I can see, Ganesha uses the Export_Id option in the config file > >(which is managed by CES) for this. I did find some reference in the > >Ganesha devs list that if its not set, then it would read the FSID from > >the GPFS file-system, either way they should surely be consistent across > >all the nodes. The posts I found were from someone with an IBM email > >address, so I guess someone in the IBM teams. > > > >I checked a couple of my protocol nodes and they use the same Export_Id > >consistently, though I guess that might not be the same as the FSID value. > > > >Perhaps someone from IBM could comment on if FSID is likely to the cause > >of my problems? > > > >Thanks > > > >Simon > > > >On 25/04/2017, 14:51, > >"gpfsug-discuss-bounces at spectrumscale.org gpfsug-discuss-bounces at sp > >ectrumscale.org> on behalf of Ouwehand, JJ" > > gpfsug-discuss-bounces at sp > >ectrumscale.org> on behalf of > >j.ouwehand at vumc.nl> wrote: > > > >>Hello, > >> > >>At first a short introduction. My name is Jaap Jan Ouwehand, I work at > >>a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of > >>IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our > >>critical (office, research and clinical data) business process. We have > >>three large GPFS filesystems for different purposes. > >> > >>We also had such a situation with cNFS. A failover (IPtakeover) was > >>technically good, only clients experienced "stale filehandles". We > >>opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few > >>months later, the solution appeared to be in the fsid option. > >> > >>An NFS filehandle is built by a combination of fsid and a hash function > >>on the inode. After a failover, the fsid value can be different and the > >>client has a "stale filehandle". To avoid this, the fsid value can be > >>statically specified. See: > >> > >> > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum > >>. > >>scale.v4r22.doc/bl1adm_nfslin.htm > >> > >>Maybe there is also a value in Ganesha that changes after a failover. > >>Certainly since most sessions will be re-established after a failback. > >>Maybe you see more debug information with tcpdump. > >> > >> > >>Kind regards, > >> > >>Jaap Jan Ouwehand > >>ICT Specialist (Storage & Linux) > >>VUmc - ICT > >>E: jj.ouwehand at vumc.nl > >>W: www.vumc.com > >> > >> > >> > >>-----Oorspronkelijk bericht----- > >>Van: > >>gpfsug-discuss-bounces at spectrumscale.org >>spectrumscale.org> > >>[mailto:gpfsug-discuss-bounces at spectrumscale.org >>bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) > >>Verzonden: dinsdag 25 april 2017 13:21 > >>Aan: > >>gpfsug-discuss at spectrumscale.org >>g> > >>Onderwerp: [gpfsug-discuss] NFS issues > >> > >>Hi, > >> > >>We have recently started deploying NFS in addition our existing SMB > >>exports on our protocol nodes. > >> > >>We use a RR DNS name that points to 4 VIPs for SMB services and > >>failover seems to work fine with SMB clients. We figured we could use > >>the same name and IPs and run Ganesha on the protocol servers, however > >>we are seeing issues with NFS clients when IP failover occurs. > >> > >>In normal operation on a client, we might see several mounts from > >>different IPs obviously due to the way the DNS RR is working, but it > >>all works fine. > >> > >>In a failover situation, the IP will move to another node and some > >>clients will carry on, others will hang IO to the mount points referred > >>to by the IP which has moved. We can *sometimes* trigger this by > >>manually suspending a CES node, but not always and some clients > >>mounting from the IP moving will be fine, others won't. > >> > >>If we resume a node an it fails back, the clients that are hanging will > >>usually recover fine. We can reboot a client prior to failback and it > >>will be fine, stopping and starting the ganesha service on a protocol > >>node will also sometimes resolve the issues. > >> > >>So, has anyone seen this sort of issue and any suggestions for how we > >>could either debug more or workaround? > >> > >>We are currently running the packages > >>nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). > >> > >>At one point we were seeing it a lot, and could track it back to an > >>underlying GPFS network issue that was causing protocol nodes to be > >>expelled occasionally, we resolved that and the issues became less > >>apparent, but maybe we just fixed one failure mode so see it less often. > >> > >>On the clients, we use -o sync,hard BTW as in the IBM docs. > >> > >>On a client showing the issues, we'll see in dmesg, NFS related > >>messages > >>like: > >>[Wed Apr 12 16:59:53 2017] nfs: server > >>MYNFSSERVER.bham.ac.uk not responding, > >>timed out > >> > >>Which explains the client hang on certain mount points. > >> > >>The symptoms feel very much like those logged in this Gluster/ganesha > >>bug: > >>https://bugzilla.redhat.com/show_bug.cgi?id=1354439 > >> > >> > >>Thanks > >> > >>Simon > >> > >>_______________________________________________ > >>gpfsug-discuss mailing list > >>gpfsug-discuss at spectrumscale.org > >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >>_______________________________________________ > >>gpfsug-discuss mailing list > >>gpfsug-discuss at spectrumscale.org > >>http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > >_______________________________________________ > >gpfsug-discuss mailing list > >gpfsug-discuss at spectrumscale.org > >http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From peserocka at gmail.com Wed Apr 26 18:53:51 2017 From: peserocka at gmail.com (Peter Serocka) Date: Wed, 26 Apr 2017 19:53:51 +0200 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: > On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) wrote: > > Nope, the clients are all L3 connected, so not an arp issue. ...not on the client, but the server-facing L3 switch still need to manage its ARP table, and might miss the IP moving to a new MAC. Cisco switches have a default ARP cache timeout of 4 hours, fwiw. Can your network team provide you the ARP status from the switch when you see a fail-over being stuck? ? Peter > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > >> Are you using infiniband or Ethernet? I'm wondering if IBM have solved >> the gratuitous arp issue which we see with our non-protocols NFS >> implementation. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >> Thompson (IT Research Support) >> Sent: Wednesday, 26 April 2017 3:31 AM >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I did some digging in the mmcesfuncs to see what happens server side on >> fail over. >> >> Basically the server losing the IP is supposed to terminate all sessions >> and the receiver server sends ACK tickles. >> >> My current supposition is that for whatever reason, the losing server >> isn't releasing something and the client still has hold of a connection >> which is mostly dead. The tickle then fails to the client from the new >> server. >> >> This would explain why failing the IP back to the original server usually >> brings the client back to life. >> >> This is only my working theory at the moment as we can't reliably >> reproduce this. Next time it happens we plan to grab some netstat from >> each side. >> >> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >> server that received the IP and see if that fixes it (i.e. the receiver >> server didn't tickle properly). (Usage extracted from mmcesfuncs which is >> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >> for anyone interested. >> >> Then try and kill he sessions on the losing server to check if there is >> stuff still open and re-tickle the client. >> >> If we can get steps to workaround, I'll log a PMR. I suppose I could do >> that now, but given its non deterministic and we want to be 100% sure >> it's not us doing something wrong, I'm inclined to wait until we do some >> more testing. >> >> I agree with the suggestion that it's probably IO pending nodes that are >> affected, but don't have any data to back that up yet. We did try with a >> read workload on a client, but may we need either long IO blocked reads >> or writes (from the GPFS end). >> >> We also originally had soft as the default option, but saw issues then >> and the docs suggested hard, so we switched and also enabled sync (we >> figured maybe it was NFS client with uncommited writes), but neither have >> resolved the issues entirely. Difficult for me to say if they improved >> the issue though given its sporadic. >> >> Appreciate people's suggestions! >> >> Thanks >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >> Myklebust [janfrode at tanso.net] >> Sent: 25 April 2017 18:04 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I *think* I've seen this, and that we then had open TCP connection from >> client to NFS server according to netstat, but these connections were not >> visible from netstat on NFS-server side. >> >> Unfortunately I don't remember what the fix was.. >> >> >> >> -jf >> >> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >> >: >> Hi, >> >> From what I can see, Ganesha uses the Export_Id option in the config file >> (which is managed by CES) for this. I did find some reference in the >> Ganesha devs list that if its not set, then it would read the FSID from >> the GPFS file-system, either way they should surely be consistent across >> all the nodes. The posts I found were from someone with an IBM email >> address, so I guess someone in the IBM teams. >> >> I checked a couple of my protocol nodes and they use the same Export_Id >> consistently, though I guess that might not be the same as the FSID value. >> >> Perhaps someone from IBM could comment on if FSID is likely to the cause >> of my problems? >> >> Thanks >> >> Simon >> >> On 25/04/2017, 14:51, >> "gpfsug-discuss-bounces at spectrumscale.org> ectrumscale.org> on behalf of Ouwehand, JJ" >> > ectrumscale.org> on behalf of >> j.ouwehand at vumc.nl> wrote: >> >>> Hello, >>> >>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>> critical (office, research and clinical data) business process. We have >>> three large GPFS filesystems for different purposes. >>> >>> We also had such a situation with cNFS. A failover (IPtakeover) was >>> technically good, only clients experienced "stale filehandles". We >>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>> months later, the solution appeared to be in the fsid option. >>> >>> An NFS filehandle is built by a combination of fsid and a hash function >>> on the inode. After a failover, the fsid value can be different and the >>> client has a "stale filehandle". To avoid this, the fsid value can be >>> statically specified. See: >>> >>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>> . >>> scale.v4r22.doc/bl1adm_nfslin.htm >>> >>> Maybe there is also a value in Ganesha that changes after a failover. >>> Certainly since most sessions will be re-established after a failback. >>> Maybe you see more debug information with tcpdump. >>> >>> >>> Kind regards, >>> >>> Jaap Jan Ouwehand >>> ICT Specialist (Storage & Linux) >>> VUmc - ICT >>> E: jj.ouwehand at vumc.nl >>> W: www.vumc.com >>> >>> >>> >>> -----Oorspronkelijk bericht----- >>> Van: >>> gpfsug-discuss-bounces at spectrumscale.org>> spectrumscale.org> >>> [mailto:gpfsug-discuss-bounces at spectrumscale.org>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>> Verzonden: dinsdag 25 april 2017 13:21 >>> Aan: >>> gpfsug-discuss at spectrumscale.org>> g> >>> Onderwerp: [gpfsug-discuss] NFS issues >>> >>> Hi, >>> >>> We have recently started deploying NFS in addition our existing SMB >>> exports on our protocol nodes. >>> >>> We use a RR DNS name that points to 4 VIPs for SMB services and >>> failover seems to work fine with SMB clients. We figured we could use >>> the same name and IPs and run Ganesha on the protocol servers, however >>> we are seeing issues with NFS clients when IP failover occurs. >>> >>> In normal operation on a client, we might see several mounts from >>> different IPs obviously due to the way the DNS RR is working, but it >>> all works fine. >>> >>> In a failover situation, the IP will move to another node and some >>> clients will carry on, others will hang IO to the mount points referred >>> to by the IP which has moved. We can *sometimes* trigger this by >>> manually suspending a CES node, but not always and some clients >>> mounting from the IP moving will be fine, others won't. >>> >>> If we resume a node an it fails back, the clients that are hanging will >>> usually recover fine. We can reboot a client prior to failback and it >>> will be fine, stopping and starting the ganesha service on a protocol >>> node will also sometimes resolve the issues. >>> >>> So, has anyone seen this sort of issue and any suggestions for how we >>> could either debug more or workaround? >>> >>> We are currently running the packages >>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >>> >>> At one point we were seeing it a lot, and could track it back to an >>> underlying GPFS network issue that was causing protocol nodes to be >>> expelled occasionally, we resolved that and the issues became less >>> apparent, but maybe we just fixed one failure mode so see it less often. >>> >>> On the clients, we use -o sync,hard BTW as in the IBM docs. >>> >>> On a client showing the issues, we'll see in dmesg, NFS related >>> messages >>> like: >>> [Wed Apr 12 16:59:53 2017] nfs: server >>> MYNFSSERVER.bham.ac.uk not responding, >>> timed out >>> >>> Which explains the client hang on certain mount points. >>> >>> The symptoms feel very much like those logged in this Gluster/ganesha >>> bug: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >>> >>> >>> Thanks >>> >>> Simon >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss From S.J.Thompson at bham.ac.uk Wed Apr 26 19:00:06 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (IT Research Support)) Date: Wed, 26 Apr 2017 18:00:06 +0000 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> , Message-ID: We have no issues with L3 SMB accessing clients, so I'm pretty sure it's not arp. And some of the boxes on the other side of the L3 gateway don't see the issues. We don't use Cisco kit. I posted in a different update that we think it's related to connections to other ports on the same IP which get left open when the IP quickly gets moved away and back again. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Peter Serocka [peserocka at gmail.com] Sent: 26 April 2017 18:53 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] NFS issues > On 2017 Apr 26 Wed, at 16:20, Simon Thompson (IT Research Support) wrote: > > Nope, the clients are all L3 connected, so not an arp issue. ...not on the client, but the server-facing L3 switch still need to manage its ARP table, and might miss the IP moving to a new MAC. Cisco switches have a default ARP cache timeout of 4 hours, fwiw. Can your network team provide you the ARP status from the switch when you see a fail-over being stuck? ? Peter > > Two things we have observed: > > 1. It triggers when one of the CES IPs moves and quickly moves back again. > The move occurs because the NFS server goes into grace: > > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 2 nodeid -1 ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_release_v4_client :STATE :EVENT :NFS Server V4 > recovery release ip > 2017-04-25 20:36:49 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs_in_grace :STATE :EVENT :NFS Server Now IN GRACE > 2017-04-25 20:37:42 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server Now IN > GRACE, duration 60 > 2017-04-25 20:37:44 : epoch 00040183 : : > ganesha.nfsd-1261[dbus] nfs4_start_grace :STATE :EVENT :NFS Server > recovery event 4 nodeid 2 ip > > > > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). > > > 2. Our clients are using LDAP which is bound to the CES IPs. If we > shutdown nslcd on the client we can get the client to recover once all the > TIME_WAIT connections have gone. Maybe this was a bad choice on our side > to bind to the CES IPs - we figured it would handily move the IPs for us, > but I guess the mmcesfuncs isn't aware of this and so doesn't kill the > connections to the IP as it goes away. > > > So two approaches we are going to try. Reconfigure the nslcd on a couple > of clients and see if they still show up the issues when fail-over occurs. > Second is to work out why the NFS servers are going into grace in the > first place. > > Simon > > On 26/04/2017, 00:46, "gpfsug-discuss-bounces at spectrumscale.org on behalf > of Greg.Lehmann at csiro.au" behalf of Greg.Lehmann at csiro.au> wrote: > >> Are you using infiniband or Ethernet? I'm wondering if IBM have solved >> the gratuitous arp issue which we see with our non-protocols NFS >> implementation. >> >> -----Original Message----- >> From: gpfsug-discuss-bounces at spectrumscale.org >> [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Simon >> Thompson (IT Research Support) >> Sent: Wednesday, 26 April 2017 3:31 AM >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I did some digging in the mmcesfuncs to see what happens server side on >> fail over. >> >> Basically the server losing the IP is supposed to terminate all sessions >> and the receiver server sends ACK tickles. >> >> My current supposition is that for whatever reason, the losing server >> isn't releasing something and the client still has hold of a connection >> which is mostly dead. The tickle then fails to the client from the new >> server. >> >> This would explain why failing the IP back to the original server usually >> brings the client back to life. >> >> This is only my working theory at the moment as we can't reliably >> reproduce this. Next time it happens we plan to grab some netstat from >> each side. >> >> Then we plan to issue "mmcmi tcpack $cesIpPort $clientIpPort" on the >> server that received the IP and see if that fixes it (i.e. the receiver >> server didn't tickle properly). (Usage extracted from mmcesfuncs which is >> ksh of course). ... CesIPPort is colon separated IP:portnumber (of NFSd) >> for anyone interested. >> >> Then try and kill he sessions on the losing server to check if there is >> stuff still open and re-tickle the client. >> >> If we can get steps to workaround, I'll log a PMR. I suppose I could do >> that now, but given its non deterministic and we want to be 100% sure >> it's not us doing something wrong, I'm inclined to wait until we do some >> more testing. >> >> I agree with the suggestion that it's probably IO pending nodes that are >> affected, but don't have any data to back that up yet. We did try with a >> read workload on a client, but may we need either long IO blocked reads >> or writes (from the GPFS end). >> >> We also originally had soft as the default option, but saw issues then >> and the docs suggested hard, so we switched and also enabled sync (we >> figured maybe it was NFS client with uncommited writes), but neither have >> resolved the issues entirely. Difficult for me to say if they improved >> the issue though given its sporadic. >> >> Appreciate people's suggestions! >> >> Thanks >> >> Simon >> ________________________________________ >> From: gpfsug-discuss-bounces at spectrumscale.org >> [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jan-Frode >> Myklebust [janfrode at tanso.net] >> Sent: 25 April 2017 18:04 >> To: gpfsug main discussion list >> Subject: Re: [gpfsug-discuss] NFS issues >> >> I *think* I've seen this, and that we then had open TCP connection from >> client to NFS server according to netstat, but these connections were not >> visible from netstat on NFS-server side. >> >> Unfortunately I don't remember what the fix was.. >> >> >> >> -jf >> >> tir. 25. apr. 2017 kl. 16.06 skrev Simon Thompson (IT Research Support) >> >: >> Hi, >> >> From what I can see, Ganesha uses the Export_Id option in the config file >> (which is managed by CES) for this. I did find some reference in the >> Ganesha devs list that if its not set, then it would read the FSID from >> the GPFS file-system, either way they should surely be consistent across >> all the nodes. The posts I found were from someone with an IBM email >> address, so I guess someone in the IBM teams. >> >> I checked a couple of my protocol nodes and they use the same Export_Id >> consistently, though I guess that might not be the same as the FSID value. >> >> Perhaps someone from IBM could comment on if FSID is likely to the cause >> of my problems? >> >> Thanks >> >> Simon >> >> On 25/04/2017, 14:51, >> "gpfsug-discuss-bounces at spectrumscale.org> ectrumscale.org> on behalf of Ouwehand, JJ" >> > ectrumscale.org> on behalf of >> j.ouwehand at vumc.nl> wrote: >> >>> Hello, >>> >>> At first a short introduction. My name is Jaap Jan Ouwehand, I work at >>> a Dutch hospital "VU Medical Center" in Amsterdam. We make daily use of >>> IBM Spectrum Scale, Spectrum Archive and Spectrum Protect in our >>> critical (office, research and clinical data) business process. We have >>> three large GPFS filesystems for different purposes. >>> >>> We also had such a situation with cNFS. A failover (IPtakeover) was >>> technically good, only clients experienced "stale filehandles". We >>> opened a PMR at IBM and after testing, deliver logs, tcpdumps and a few >>> months later, the solution appeared to be in the fsid option. >>> >>> An NFS filehandle is built by a combination of fsid and a hash function >>> on the inode. After a failover, the fsid value can be different and the >>> client has a "stale filehandle". To avoid this, the fsid value can be >>> statically specified. See: >>> >>> https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/com.ibm.spectrum >>> . >>> scale.v4r22.doc/bl1adm_nfslin.htm >>> >>> Maybe there is also a value in Ganesha that changes after a failover. >>> Certainly since most sessions will be re-established after a failback. >>> Maybe you see more debug information with tcpdump. >>> >>> >>> Kind regards, >>> >>> Jaap Jan Ouwehand >>> ICT Specialist (Storage & Linux) >>> VUmc - ICT >>> E: jj.ouwehand at vumc.nl >>> W: www.vumc.com >>> >>> >>> >>> -----Oorspronkelijk bericht----- >>> Van: >>> gpfsug-discuss-bounces at spectrumscale.org>> spectrumscale.org> >>> [mailto:gpfsug-discuss-bounces at spectrumscale.org>> bounces at spectrumscale.org>] Namens Simon Thompson (IT Research Support) >>> Verzonden: dinsdag 25 april 2017 13:21 >>> Aan: >>> gpfsug-discuss at spectrumscale.org>> g> >>> Onderwerp: [gpfsug-discuss] NFS issues >>> >>> Hi, >>> >>> We have recently started deploying NFS in addition our existing SMB >>> exports on our protocol nodes. >>> >>> We use a RR DNS name that points to 4 VIPs for SMB services and >>> failover seems to work fine with SMB clients. We figured we could use >>> the same name and IPs and run Ganesha on the protocol servers, however >>> we are seeing issues with NFS clients when IP failover occurs. >>> >>> In normal operation on a client, we might see several mounts from >>> different IPs obviously due to the way the DNS RR is working, but it >>> all works fine. >>> >>> In a failover situation, the IP will move to another node and some >>> clients will carry on, others will hang IO to the mount points referred >>> to by the IP which has moved. We can *sometimes* trigger this by >>> manually suspending a CES node, but not always and some clients >>> mounting from the IP moving will be fine, others won't. >>> >>> If we resume a node an it fails back, the clients that are hanging will >>> usually recover fine. We can reboot a client prior to failback and it >>> will be fine, stopping and starting the ganesha service on a protocol >>> node will also sometimes resolve the issues. >>> >>> So, has anyone seen this sort of issue and any suggestions for how we >>> could either debug more or workaround? >>> >>> We are currently running the packages >>> nfs-ganesha-2.3.2-0.ibm32_1.el7.x86_64 (4.2.2-2 release ones). >>> >>> At one point we were seeing it a lot, and could track it back to an >>> underlying GPFS network issue that was causing protocol nodes to be >>> expelled occasionally, we resolved that and the issues became less >>> apparent, but maybe we just fixed one failure mode so see it less often. >>> >>> On the clients, we use -o sync,hard BTW as in the IBM docs. >>> >>> On a client showing the issues, we'll see in dmesg, NFS related >>> messages >>> like: >>> [Wed Apr 12 16:59:53 2017] nfs: server >>> MYNFSSERVER.bham.ac.uk not responding, >>> timed out >>> >>> Which explains the client hang on certain mount points. >>> >>> The symptoms feel very much like those logged in this Gluster/ganesha >>> bug: >>> https://bugzilla.redhat.com/show_bug.cgi?id=1354439 >>> >>> >>> Thanks >>> >>> Simon >>> >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >>> _______________________________________________ >>> gpfsug-discuss mailing list >>> gpfsug-discuss at spectrumscale.org >>> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From valdis.kletnieks at vt.edu Thu Apr 27 00:44:44 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Wed, 26 Apr 2017 19:44:44 -0400 Subject: [gpfsug-discuss] NFS issues In-Reply-To: References: <540d5b070cc8438ebe73df14a1ab619b@exch1-cdc.nexus.csiro.au> Message-ID: <52226.1493250284@turing-police.cc.vt.edu> On Wed, 26 Apr 2017 14:20:30 -0000, "Simon Thompson (IT Research Support)" said: > We can't see in any of the logs WHY ganesha is going into grace. Any > suggestions on how to debug this further? (I.e. If we can stop the grace > issues, we can solve the problem mostly). After over 3 decades of experience with 'exportfs' being totally safe to run in real time with both userspace and kernel NFSD implementations, it came as quite a surprise when we did 'mmnfs eport change --nfsadd='... and it bounced the NFS server on all 4 protocol nodes. At the same time. Fortunately for us, the set of client nodes only changes once every 2-3 months. -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 486 bytes Desc: not available URL: From secretary at gpfsug.org Thu Apr 27 09:29:41 2017 From: secretary at gpfsug.org (Secretary GPFS UG) Date: Thu, 27 Apr 2017 09:29:41 +0100 Subject: [gpfsug-discuss] Meet other spectrum scale users in May Message-ID: <1f483faa9cb61dcdc80afb187e908745@webmail.gpfsug.org> Dear Members, Please join us and other spectrum scale users for 2 days of great talks and networking! WHEN: 9-10th May 2017 WHERE: Macdonald Manchester Hotel & Spa, Manchester, UK (right by Manchester Piccadilly train station) WHO? The event is free to attend, is open to members from all industries and welcomes users with a little and a lot of experience using Spectrum Scale. The SSUG brings together the Spectrum Scale User Community including Spectrum Scale developers and architects to share knowledge, experiences and future plans. Topics include transparent cloud tiering, AFM, automation and security best practices, Docker and HDFS support, problem determination, and an update on Elastic Storage Server (ESS). Our popular forum includes interactive problem solving, a best practices discussion and networking. We're very excited to welcome back Doris Conti the Director for Spectrum Scale (GPFS) and HPC SW Product Development at IBM. The May meeting is sponsored by IBM, DDN, Lenovo, Mellanox, Seagate, Arcastream, Ellexus, and OCF. It is an excellent opportunity to learn more and get your questions answered. Register your place today at the Eventbrite page https://goo.gl/tRptru [1] We hope to see you there! -- Claire O'Toole Spectrum Scale/GPFS User Group Secretary +44 (0)7508 033896 www.spectrumscaleug.org Links: ------ [1] https://goo.gl/tRptru -------------- next part -------------- An HTML attachment was scrubbed... URL: From robert at strubi.ox.ac.uk Thu Apr 27 12:46:09 2017 From: robert at strubi.ox.ac.uk (Robert Esnouf) Date: Thu, 27 Apr 2017 12:46:09 +0100 (BST) Subject: [gpfsug-discuss] Two high-performance research computing posts in Oxford University Medical Sciences Message-ID: <201704271146.061978@mail.strubi.ox.ac.uk> Dear All, I hope that it is allowed to put job postings on this discussion list... sorry if I've broken a rule but it does mention SpectrumScale! I'd like to advertise the availability two exciting and challenging new opportunities to work in research computing/high-performance computing at Oxford University within the Nuffield Department of Medicine. The first is a Grade 8 position to expand the current Research Computing Core team at the Wellcome Trust Centre for Human Genetics. The Core now runs a cluster of about ~3800 high-memory compute cores, a further ~700 cores outside the cluster, a (growing) smattering of GPU-enabled and KNL nodes, 4PB high-performance SpectrumScale (GPFS) storage and about 4PB of lower grade (mostly XFS) storage. The facility has an FDR InfiniBand fabric providing for access to storage at up to 20GB/s and supporting MPI workloads. We mainly support the statistical genetics work of the Centre and other departments around Oxford, the work of the sequencing and bioinformatics cores and electron microscopy, but the workload is varied and interesting! Further significant update and expansion of this facility will occur during 2017 and beyond and means that we are expanding the team. http://www.well.ox.ac.uk/home http://www.well.ox.ac.uk/research-8 https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.display_form?p_company=10&p_internal_external=E&p_display_in_irish=N&p_process_type=&p_applicant_no=&p_form_profile_detail=&p_display_apply_ind=Y&p_refresh_search=Y&p_recruitment_id=126748 The second is a Grade 9 post at the newly opened Big Data Institute next door to the WTCHG - to work with me to establish a brand new Research Computing facility. The Big Data Institute Building has 32 shiny new racks ready to be filled with up to 320kW of IT load - and we won't stop there! The current plans envisage a virtualized infrastructure for secure access, a high-performance cluster supporting traditional workloads and containers, high-performance filesystem storage, a hyperconverged infrastructure supporting (OpenStack, project VMs, containers and distributed computing plaforms such as Apache Spark), a significant GPU-based artificial intelligence/deep learning platform and a large, multisite object store for managing research data in the long term. https://www.bdi.ox.ac.uk/ https://www.ndm.ox.ac.uk/current-job-vacancies/vacancy/128486-BDI-Research-Computing-Manager https://www.recruit.ox.ac.uk/pls/hrisliverecruit/erq_jobspec_version_4.display_form?p_company=10&p_internal_external=E&p_display_in_irish=N&p_process_type=&p_applicant_no=&p_form_profile_detail=&p_display_apply_ind=Y&p_refresh_search=Y&p_recruitment_id=128486 It is expected that the Wellcome Trust Centre and Big Data Institute facilities will develop independently for now, but in a complementary and supportive fashion given the overlap in science and technology that is likely to exist. The Research Computing support teams will therefore work extremely closely together to address the challenges facing computing in the medical sciences. If either (or both) of these vacancies seem interesting then please feel free to contact the Head of the Research Computing Core at the WTCHG (me) or the Director of Research Computing at the BDI (me). Deadline for the WTCHG post is 31st May and for the BDI post is 24th May. Please feel free to circulate this email to anyone who might be interested and apologies for any cross postings! Regards, Robert -- Dr Robert Esnouf University Research Lecturer, Director of Research Computing BDI, Head of Research Computing Core WTCHG, NDM Research Computing Strategy Officer Main office: Room 10/028, Wellcome Trust Centre for Human Genetics, Old Road Campus, Roosevelt Drive, Oxford OX3 7BN, UK Emails: robert at strubi.ox.ac.uk / robert at well.ox.ac.uk / robert.esnouf at bdi.ox.ac.uk Tel: (+44) - 1865 - 287783