From MDIETZ at de.ibm.com Wed Feb 1 09:04:14 2017 From: MDIETZ at de.ibm.com (Mathias Dietz) Date: Wed, 1 Feb 2017 10:04:14 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Feb 1 09:28:25 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 1 Feb 2017 09:28:25 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: Message-ID: Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Mathias Dietz" An:"gpfsug main discussion list" Datum:Mi. 01.02.2017 10:05Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Simon Thompson (Research Computing - IT Services)" An:"gpfsug main discussion list" Datum:Di. 31.01.2017 21:07Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave.bond at diamond.ac.uk Thu Feb 2 10:08:06 2017 From: dave.bond at diamond.ac.uk (dave.bond at diamond.ac.uk) Date: Thu, 2 Feb 2017 10:08:06 +0000 Subject: [gpfsug-discuss] GPFS meta data performance monitoring Message-ID: Hello Mailing list, Beyond mmpmon how are people monitoring their metadata performance? There are two parts I imagine to this question, the first being how do you get a detailed snapshot view of performance read and write etc. Then the second is does anyone collate this information for historical graphing, if so thoughts and ideas are very welcome. mmpmon is certainly useful but I would like to dig a little deeper, ideally without turning anything on that could impact stability or performance of a production file system. Dave (Diamond Light Source) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From olaf.weiser at de.ibm.com Thu Feb 2 15:55:44 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 16:55:44 +0100 Subject: [gpfsug-discuss] GPFS meta data performance monitoring In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Thu Feb 2 17:03:51 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 2 Feb 2017 12:03:51 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears Message-ID: Is there a way to accomplish this so the rest of cluster knows its down? My state now: [root at cl001 ~]# mmgetstate -aL cl004.cl.arc.internal: mmremote: determineMode: Missing file /var/mmfs/gen/mmsdrfs. cl004.cl.arc.internal: mmremote: This node does not belong to a GPFS cluster. mmdsh: cl004.cl.arc.internal remote shell process had return code 1. Node number Node name Quorum Nodes up Total nodes GPFS state Remarks ------------------------------------------------------------------------------------ 1 cl001 5 7 8 active quorum node 2 cl002 5 7 8 active quorum node 3 cl003 5 7 8 active quorum node 4 cl004 0 0 8 unknown quorum node 5 cl005 5 7 8 active quorum node 6 cl006 5 7 8 active quorum node 7 cl007 5 7 8 active quorum node 8 cl008 5 7 8 active quorum node cl004 we think has an internal raid controller blowout -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 2 17:28:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 18:28:22 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From jonathon.anderson at colorado.edu Thu Feb 2 17:44:45 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Thu, 2 Feb 2017 17:44:45 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 2 18:02:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 19:02:22 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Message-ID: An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Thu Feb 2 19:28:05 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Thu, 02 Feb 2017 14:28:05 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: Message-ID: <15501.1486063685@turing-police.cc.vt.edu> On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you > see a message like this.. > have you reinstalled that node / any backup/restore thing ? The internal RAID controller died a horrid death and basically took all the OS partitions with it. So the node was just sort of limping along, where the mmfsd process was still coping because it wasn't doing any I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work because that requires accessing stuff in /var. At that point, it starts getting tempting to just use ipmitool from another node to power the comatose one down - but that often causes a cascade of other issues while things are stuck waiting for timeouts. From aaron.s.knister at nasa.gov Thu Feb 2 19:33:41 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 2 Feb 2017 14:33:41 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: <15501.1486063685@turing-police.cc.vt.edu> References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: You could forcibly expel the node (one of my favorite GPFS commands): mmexpelnode -N $nodename and then power it off after the expulsion is complete and then do mmepelenode -r -N $nodename which will allow it to join the cluster next time you try and start up GPFS on it. You'll still likely have to go through recovery but you'll skip the part where GPFS wonders where the node went prior to it expelling it. -Aaron On 2/2/17 2:28 PM, valdis.kletnieks at vt.edu wrote: > On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > >> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you >> see a message like this.. >> have you reinstalled that node / any backup/restore thing ? > > The internal RAID controller died a horrid death and basically took > all the OS partitions with it. So the node was just sort of limping along, > where the mmfsd process was still coping because it wasn't doing any > I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work > because that requires accessing stuff in /var. > > At that point, it starts getting tempting to just use ipmitool from > another node to power the comatose one down - but that often causes > a cascade of other issues while things are stuck waiting for timeouts. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From olaf.weiser at de.ibm.com Thu Feb 2 21:28:01 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 22:28:01 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Fri Feb 3 12:46:30 2017 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 3 Feb 2017 12:46:30 +0000 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES Message-ID: I'm having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 The NFS clients are up to date Centos and Debian machines. All Scale servers and NFS clients have correct date and time via NTP. Creating a file, for instance 'touch file00', gives correct timestamp. Moving the file, 'mv file00 file01', gives correct timestamp Copying the file, 'cp file01 file02', gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. Have anyone seen this before? Regards, Andreas Mattsson _____________________________________________ [cid:part1.08040705.03090509 at maxiv.lu.se] Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 225 94 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.se www.maxiv.se -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: image001.png URL: From ulmer at ulmer.org Fri Feb 3 13:05:37 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 3 Feb 2017 08:05:37 -0500 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: That?s a cool one. :) What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? -- Stephen > On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: > > I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 > The NFS clients are up to date Centos and Debian machines. > All Scale servers and NFS clients have correct date and time via NTP. > > Creating a file, for instance ?touch file00?, gives correct timestamp. > Moving the file, ?mv file00 file01?, gives correct timestamp > Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. > > This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. > Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. > > Have anyone seen this before? > > Regards, > Andreas Mattsson > _____________________________________________ > > > Andreas Mattsson > Systems Engineer > > MAX IV Laboratory > Lund University > P.O. Box 118, SE-221 00 Lund, Sweden > Visiting address: Fotongatan 2, 225 94 Lund > Mobile: +46 706 64 95 44 > andreas.mattsson at maxiv.se > www.maxiv.se > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Fri Feb 3 13:19:37 2017 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 3 Feb 2017 13:19:37 +0000 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: That works. ?touch test100? Feb 3 14:16 test100 ?cp test100 test101? Feb 3 14:16 test100 Apr 21 2027 test101 ?touch ?r test100 test101? Feb 3 14:16 test100 Feb 3 14:16 test101 /Andreas That?s a cool one. :) What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? -- Stephen On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 The NFS clients are up to date Centos and Debian machines. All Scale servers and NFS clients have correct date and time via NTP. Creating a file, for instance ?touch file00?, gives correct timestamp. Moving the file, ?mv file00 file01?, gives correct timestamp Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. Have anyone seen this before? Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 225 94 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.se www.maxiv.se _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Feb 3 13:35:21 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 3 Feb 2017 08:35:21 -0500 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: Does the cp actually complete? As in, does it copy all of the blocks? What?s the exit code? A cp?d file should have ?new? metadata. That is, it should have it?s own dates, owners, etc. (not necessarily copied from the source file). I ran ?strace cp foo1 foo2?, and it was pretty instructive, maybe that would get you more info. On CentOS strace is in it?s own package, YMMV. -- Stephen > On Feb 3, 2017, at 8:19 AM, Andreas Mattsson > wrote: > > That works. > > ?touch test100? > > Feb 3 14:16 test100 > > ?cp test100 test101? > > Feb 3 14:16 test100 > Apr 21 2027 test101 > > ?touch ?r test100 test101? > > Feb 3 14:16 test100 > Feb 3 14:16 test101 > > /Andreas > > > That?s a cool one. :) > > What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? > > -- > Stephen > > > > On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: > > I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 > The NFS clients are up to date Centos and Debian machines. > All Scale servers and NFS clients have correct date and time via NTP. > > Creating a file, for instance ?touch file00?, gives correct timestamp. > Moving the file, ?mv file00 file01?, gives correct timestamp > Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. > > This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. > Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. > > Have anyone seen this before? > > Regards, > Andreas Mattsson > _____________________________________________ > > > Andreas Mattsson > Systems Engineer > > MAX IV Laboratory > Lund University > P.O. Box 118, SE-221 00 Lund, Sweden > Visiting address: Fotongatan 2, 225 94 Lund > Mobile: +46 706 64 95 44 > andreas.mattsson at maxiv.se > www.maxiv.se > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Fri Feb 3 13:46:49 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 3 Feb 2017 08:46:49 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: Well we got it into the down state using mmsdrrestore -p to recover stuff into /var/mmfs/gen to cl004. Anyhow we ended up unknown for cl004 when it powered off. Short of removing node, unknown is the state you get. Unknown seems stable for a hopefully short outage of cl004. Thanks On Thu, Feb 2, 2017 at 4:28 PM, Olaf Weiser wrote: > many ways lead to Rome .. and I agree .. mmexpelnode is a nice command .. > another approach... > power it off .. (not reachable by ping) .. mmdelnode ... power on/boot ... > mmaddnode .. > > > > From: Aaron Knister > To: > Date: 02/02/2017 08:37 PM > Subject: Re: [gpfsug-discuss] proper gpfs shutdown when node > disappears > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > You could forcibly expel the node (one of my favorite GPFS commands): > > mmexpelnode -N $nodename > > and then power it off after the expulsion is complete and then do > > mmepelenode -r -N $nodename > > which will allow it to join the cluster next time you try and start up > GPFS on it. You'll still likely have to go through recovery but you'll > skip the part where GPFS wonders where the node went prior to it > expelling it. > > -Aaron > > On 2/2/17 2:28 PM, valdis.kletnieks at vt.edu wrote: > > On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > > > >> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's > why you > >> see a message like this.. > >> have you reinstalled that node / any backup/restore thing ? > > > > The internal RAID controller died a horrid death and basically took > > all the OS partitions with it. So the node was just sort of limping > along, > > where the mmfsd process was still coping because it wasn't doing any > > I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work > > because that requires accessing stuff in /var. > > > > At that point, it starts getting tempting to just use ipmitool from > > another node to power the comatose one down - but that often causes > > a cascade of other issues while things are stuck waiting for timeouts. > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Feb 3 14:06:58 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 3 Feb 2017 15:06:58 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: An HTML attachment was scrubbed... URL: From service at metamodul.com Fri Feb 3 16:13:35 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Fri, 3 Feb 2017 17:13:35 +0100 (CET) Subject: [gpfsug-discuss] Mount of file set Message-ID: <738987264.170895.1486138416028@email.1und1.de> An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Feb 3 20:03:18 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 3 Feb 2017 20:03:18 +0000 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? Message-ID: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> I can?t seem to find some of these on fix central, have they been pulled? Specifically, I want: Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux https://www-945.ibm.com/support/fixcentral/swg/selectFixes?product=ibm%2FStorageSoftware%2FIBM+Spectrum+Scale&fixids=Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux&source=myna&myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E&function=fixId&parent=Software%20defined%20storage Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: IBM My Notifications Date: Monday, January 30, 2017 at 10:49 AM To: "Oesterlin, Robert" Subject: [EXTERNAL] IBM My notifications - Storage [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/headset.png] Check out the IBM Support beta [BM] [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/megaphone-m.png] Here are your weekly updates from IBM My Notifications. Contents: IBM Spectrum Scale IBM Spectrum Scale Spectrum_Scale_Protocols_Advanced-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. The pre-built SELinux policy within RHEL7.x conflicts with IBM Spectrum Scale NFS Ganesha [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] Ganesha running on CES nodes with seLinux in enforcing mode and selinux-policy-targeted-3.13.1-60.el7_2.7 installed causes the start of ganesha to fail and thus all CES nodes get UNHEALTHY. See https://bugzilla.redhat.com/show_bug.cgi?id=1383784 Note: IBM Spectrum Scale does not support CES with seLinux in enforcing mode Spectrum_Scale_Protocols_Data_Management-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Data_Management-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Advanced-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Data_Management-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-x86_64-Windows [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-x86_64-Windows [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/information.png] Your support Notifications display in English by default. Machine translation based on your IBM profile language setting is added if you specify this option in Delivery preferences within My Notifications. (Note: Not all languages are available at this time, and the English version always takes precedence over the machine translated version.) Manage your My Notifications subscriptions, or send questions and comments. Subscribe or Unsubscribe | Feedback Follow us on Twitter. To ensure proper delivery please add mynotify at stg.events.ihost.com to your address book. You received this email because you are subscribed to IBM My Notifications as: oester at gmail.com Please do not reply to this message as it is generated by an automated service machine. ?International Business Machines Corporation 2017. All rights reserved. -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Feb 3 19:57:29 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 3 Feb 2017 20:57:29 +0100 Subject: [gpfsug-discuss] Mount of file set In-Reply-To: <738987264.170895.1486138416028@email.1und1.de> References: <738987264.170895.1486138416028@email.1und1.de> Message-ID: An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Sun Feb 5 14:02:57 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Sun, 5 Feb 2017 14:02:57 +0000 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 912 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1463 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 6365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 2881 bytes Desc: not available URL: From martin at uni-mainz.de Mon Feb 6 11:15:31 2017 From: martin at uni-mainz.de (Christoph Martin) Date: Mon, 6 Feb 2017 12:15:31 +0100 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: I have already updated two GPFS installations with 4.2.2.2 with a download from Jan, 31. What issues with Ganesha do I have to expect until the fixed version is available? How can I see that the downloads have changed and are fixed? The information on the download site was: > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux-install (537.58 MB) > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux-install.md5 (97 bytes) > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux.readme.html (24.59 KB) Christoph Am 05.02.2017 um 15:02 schrieb Achim Rehor: > Yes, they have been pulled, all protocol 4.2.2.2 packages. there wsa an > issue with ganesha > > It was expected to see them back before the weekend, which is obviously > not the case. > So, i guess, a little patience is needed. -- ============================================================================ Christoph Martin, Leiter Unix-Systeme Zentrum f?r Datenverarbeitung, Uni-Mainz, Germany Anselm Franz von Bentzel-Weg 12, 55128 Mainz Telefon: +49(6131)3926337 Instant-Messaging: Jabber: martin at jabber.uni-mainz.de (Siehe http://www.zdv.uni-mainz.de/4010.php) -------------- next part -------------- A non-text attachment was scrubbed... Name: martin.vcf Type: text/x-vcard Size: 421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: OpenPGP digital signature URL: From bbanister at jumptrading.com Mon Feb 6 14:54:11 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 6 Feb 2017 14:54:11 +0000 Subject: [gpfsug-discuss] Mount of file set In-Reply-To: References: <738987264.170895.1486138416028@email.1und1.de> Message-ID: Is there an RFE for this yet that we can all vote up? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Olaf Weiser Sent: Friday, February 03, 2017 1:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Mount of file set Hi Ha-Jo, we do the same here .. so no news so far as I know... gruss vom laff From: Hans-Joachim Ehlers > To: gpfsug main discussion list > Date: 02/03/2017 05:14 PM Subject: [gpfsug-discuss] Mount of file set Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Moin Moin, is it nowaday possible to mount directly a GPFS Fileset ? In the old day i mounted the whole GPFS to a Mount point with 000 rights and did a Sub Mount of the needed Fileset. It works but it is ugly. -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Tue Feb 7 18:01:41 2017 From: mweil at wustl.edu (Matt Weil) Date: Tue, 7 Feb 2017 12:01:41 -0600 Subject: [gpfsug-discuss] stuck GracePeriodThread Message-ID: running cnfs # rpm -qa | grep gpfs gpfs.gpl-4.1.1-7.noarch gpfs.base-4.1.1-7.x86_64 gpfs.docs-4.1.1-7.noarch gpfs.gplbin-3.10.0-327.18.2.el7.x86_64-4.1.1-7.x86_64 pcp-pmda-gpfs-3.10.6-2.el7.x86_64 gpfs.ext-4.1.1-7.x86_64 gpfs.gskit-8.0.50-47.x86_64 gpfs.msg.en_US-4.1.1-7.noarch === mmdiag: waiters === 0x7F95F0008CF0 ( 19022) waiting 89.838355000 seconds, GracePeriodThread: delaying for 40.161645000 more seconds, reason: delayed do these cause issues and is there any other way besides stopping and restarting mmfsd to get rid of them. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From TROPPENS at de.ibm.com Wed Feb 8 08:36:45 2017 From: TROPPENS at de.ibm.com (Ulf Troppens) Date: Wed, 8 Feb 2017 09:36:45 +0100 Subject: [gpfsug-discuss] Spectrum Scale User Meeting - March 8+9 , 2017 - Ehningen, Germany Message-ID: There is an IBM organized Spectrum Scale User Meeting in Germany. Though, agenda and spirit are very close to user group organized events. Conference language is German. This is a two-day event. There is an introduction day for Spectrum Scale beginners a day before on March 7. See here for agenda and registration: https://www.spectrumscale.org/spectrum-scale-user-meeting-march-89-2027-ehningen-germany/ -- IBM Spectrum Scale Development - Client Engagements & Solutions Delivery Consulting IT Specialist Author "Storage Networks Explained" IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Wed Feb 8 08:48:06 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Wed, 8 Feb 2017 09:48:06 +0100 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Feb 9 14:30:18 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 9 Feb 2017 14:30:18 +0000 Subject: [gpfsug-discuss] AFM OpenFiles Message-ID: We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs From Mark.Bush at siriuscom.com Thu Feb 9 14:40:03 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Thu, 9 Feb 2017 14:40:03 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Message-ID: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Has any headway been made on this issue? I just ran into it as well. The CES ip addresses just disappeared from my two protocol nodes (4.2.2.0). From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Thursday, February 2, 2017 at 12:02 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes pls contact me directly olaf.weiser at de.ibm.com Mit freundlichen Gr??en / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform, ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland IBM Allee 1 71139 Ehningen Phone: +49-170-579-44-66 E-Mail: olaf.weiser at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From: Jonathon A Anderson To: gpfsug main discussion list Date: 02/02/2017 06:45 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Thu Feb 9 15:10:58 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Thu, 9 Feb 2017 20:40:58 +0530 Subject: [gpfsug-discuss] AFM OpenFiles In-Reply-To: References: Message-ID: What is the version of GPFS ? There was an issue fixed in Spectrum Scale 4.2.2 for file count(file_nr) leak. This issue mostly happens on Linux kernel version >= 3.6. ~Venkat (vpuvvada at in.ibm.com) From: Peter Childs To: gpfsug main discussion list Date: 02/09/2017 08:00 PM Subject: [gpfsug-discuss] AFM OpenFiles Sent by: gpfsug-discuss-bounces at spectrumscale.org We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Feb 9 15:34:25 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 9 Feb 2017 15:34:25 +0000 Subject: [gpfsug-discuss] AFM OpenFiles In-Reply-To: References: , Message-ID: 4.2.1.1 or CentOs 7. So that might account for it. Thanks Peter Childs ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Venkateswara R Puvvada Sent: Thursday, February 9, 2017 3:10:58 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM OpenFiles What is the version of GPFS ? There was an issue fixed in Spectrum Scale 4.2.2 for file count(file_nr) leak. This issue mostly happens on Linux kernel version >= 3.6. ~Venkat (vpuvvada at in.ibm.com) From: Peter Childs To: gpfsug main discussion list Date: 02/09/2017 08:00 PM Subject: [gpfsug-discuss] AFM OpenFiles Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From olaf.weiser at de.ibm.com Thu Feb 9 15:34:55 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 9 Feb 2017 16:34:55 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Message-ID: An HTML attachment was scrubbed... URL: From jonathon.anderson at colorado.edu Thu Feb 9 17:32:55 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Thu, 9 Feb 2017 17:32:55 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Message-ID: I was thinking that whether or not CES knows your nodes are up or not is dependent on how recently they were added to the cluster; but I?m starting to wonder if it?s dependent on the order in which nodes are brought up. Presumably you are running your CES nodes in a GPFS cluster with a large number of nodes? What happens if you bring your CES nodes up earlier (e.g., before your compute nodes)? ~jonathon From: on behalf of "Mark.Bush at siriuscom.com" Reply-To: gpfsug main discussion list Date: Thursday, February 9, 2017 at 7:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Has any headway been made on this issue? I just ran into it as well. The CES ip addresses just disappeared from my two protocol nodes (4.2.2.0). From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Thursday, February 2, 2017 at 12:02 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes pls contact me directly olaf.weiser at de.ibm.com Mit freundlichen Gr??en / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform, ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland IBM Allee 1 71139 Ehningen Phone: +49-170-579-44-66 E-Mail: olaf.weiser at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From: Jonathon A Anderson To: gpfsug main discussion list Date: 02/02/2017 06:45 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Fri Feb 10 16:33:26 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 10 Feb 2017 16:33:26 +0000 Subject: [gpfsug-discuss] Reverting to older versions Message-ID: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From S.J.Thompson at bham.ac.uk Fri Feb 10 16:51:43 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 10 Feb 2017 16:51:43 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> References: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Message-ID: Is it the 4.2.2 code or the protocol packages that broke? We found the 4.2.2.0 SMB packages don't work for us. We just reverted to the older SMB packages. Support have advised us to try the 4.2.2.1 packages, but it means a service break to upgrade protocol packages so we are trying to schedule in. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Mark.Bush at siriuscom.com [Mark.Bush at siriuscom.com] Sent: 10 February 2017 16:33 To: gpfsug main discussion list Subject: [gpfsug-discuss] Reverting to older versions Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From olaf.weiser at de.ibm.com Fri Feb 10 16:57:23 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Feb 2017 17:57:23 +0100 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> References: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8745 bytes Desc: not available URL: From duersch at us.ibm.com Fri Feb 10 17:05:23 2017 From: duersch at us.ibm.com (Steve Duersch) Date: Fri, 10 Feb 2017 17:05:23 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Fri Feb 10 17:08:48 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 10 Feb 2017 17:08:48 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: References: Message-ID: Excellent. Thanks to all. From: on behalf of Steve Duersch Reply-To: gpfsug main discussion list Date: Friday, February 10, 2017 at 11:05 AM To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Reverting to older versions See chapter 12 of the Concepts, Planning, and Installation guide. There is a section on reverting to a previous version. https://www.ibm.com/support/knowledgecenter/STXKQY/ibmspectrumscale_content.html Steve Duersch Spectrum Scale 845-433-7902 IBM Poughkeepsie, New York ----- Original message ----- From: gpfsug-discuss-request at spectrumscale.org Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: gpfsug-discuss Digest, Vol 61, Issue 18 Date: Fri, Feb 10, 2017 11:52 AM Message: 1 Date: Fri, 10 Feb 2017 16:33:26 +0000 From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Subject: [gpfsug-discuss] Reverting to older versions Message-ID: <484E02BE-463F-499D-90B8-47E6F10753E3 at siriuscom.com> Content-Type: text/plain; charset="utf-8" Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Fri Feb 10 21:56:55 2017 From: zgiles at gmail.com (Zachary Giles) Date: Fri, 10 Feb 2017 16:56:55 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression Message-ID: Hello All, I've been seeing some less than desirable behavior with mmap and compression in GPFS. Curious if others see similar or have any ideas if this is accurate.. The guys here want me to open an IBM ticket, but I figured I'd see if anyone has had this experience before. We have an internally developed app that runs on our cluster referencing data sitting in GPFS. It is using mmap to access the files due to a library we're using that requires it. If we run the app against some data on GPFS, it performs well.. finishing in a few minutes time -- Great. However, if we compress the file (in GPFS), the app is still running after 2 days time. stracing the app shows that is polling on a file descriptor, forever.. as if a data block is still pending. I know mmap is supported with compression according to the manual (with some stipulations), and that performance is expected to be much less since it's more large-block oriented due to decompressed in groups.. no problem. But it seems like some data should get returned. I'm surprised to find that a very small amount of data is sitting in the buffers (mmfsadm dump buffers) in reference to the inodes. The decompression thread is running continuously, while the app is still polling for data from memory and sleeping, retrying, sleeping, repeat. What I believe is happening is that the 4k pages are being pulled out of large decompression groups from an mmap read request, put in the buffer, then the compression group data is thrown away since it has the result it wants, only to need another piece of data that would have been in that group slightly later, which is recalled, put in the buffer.. etc. Thus an infinite slowdown. Perhaps also the data is expiring out of the buffer before the app has a chance to read it. I can't tell. In any case, the app makes zero progress. I tried without our app, using fio.. mmap on an uncompressed file with 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not impressive). However, on a compressed file it is only 20KB/s max. ( far less impressive ). Reading a file using aio etc is over 3GB/s on a single thread without even trying. What do you think? Anyone see anything like this? Perhaps there are some tunings to waste a bit more memory on cached blocks rather than make decompression recycle? I've searched back the archives a bit. There's a May 2013 thread about slowness as well. I think we're seeing much much less than that. Our page pools are of decent size. Its not just slowness, it's as if the app never gets a block back at all. ( We could handle slowness .. ) Thanks. Open to ideas.. -Zach Giles From mweil at wustl.edu Sat Feb 11 18:32:54 2017 From: mweil at wustl.edu (Matt Weil) Date: Sat, 11 Feb 2017 12:32:54 -0600 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: References: Message-ID: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> https://access.redhat.com/solutions/2437991 I ran into this issue the other day even with the echo "4096" > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that larger to get to the 2M IO size. So I never really seen this issue until the other day. I may have triggered it myself because I was adding new storage. Was wondering what version of GPFS fixes this. I really do not want to step back to and older kernel version. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From leoluan at us.ibm.com Sat Feb 11 22:23:24 2017 From: leoluan at us.ibm.com (Leo Luan) Date: Sat, 11 Feb 2017 22:23:24 +0000 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sun Feb 12 17:30:38 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sun, 12 Feb 2017 18:30:38 +0100 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> References: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> Message-ID: The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -jf On Sat, Feb 11, 2017 at 7:32 PM, Matt Weil wrote: > https://access.redhat.com/solutions/2437991 > > I ran into this issue the other day even with the echo "4096" > > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that > larger to get to the 2M IO size. So I never really seen this issue > until the other day. I may have triggered it myself because I was > adding new storage. > > Was wondering what version of GPFS fixes this. I really do not want to > step back to and older kernel version. > > Thanks > Matt > > ________________________________ > The materials in this message are private and may contain Protected > Healthcare Information or other information of a sensitive nature. If you > are not the intended recipient, be advised that any unauthorized use, > disclosure, copying or the taking of any action in reliance on the contents > of this information is strictly prohibited. If you have received this email > in error, please immediately notify the sender via telephone or return mail. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Feb 13 15:46:27 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 13 Feb 2017 09:46:27 -0600 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: References: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> Message-ID: excellent Thanks. On 2/12/17 11:30 AM, Jan-Frode Myklebust wrote: The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -jf On Sat, Feb 11, 2017 at 7:32 PM, Matt Weil > wrote: https://access.redhat.com/solutions/2437991 I ran into this issue the other day even with the echo "4096" > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that larger to get to the 2M IO size. So I never really seen this issue until the other day. I may have triggered it myself because I was adding new storage. Was wondering what version of GPFS fixes this. I really do not want to step back to and older kernel version. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Feb 13 15:49:07 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 13 Feb 2017 15:49:07 +0000 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later Message-ID: Alas, I ran into this as well ? only seems to impact some my older JBOD storage. The fix is vague, should I be worried about this turning up later, or will it happen right away? (if it does) Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Matt Weil Reply-To: gpfsug main discussion list Date: Monday, February 13, 2017 at 9:46 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Feb 13 17:00:10 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 13 Feb 2017 17:00:10 +0000 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later Message-ID: <34F66C99-B56D-4742-8C40-B6377B914FC0@nuance.com> See this technote for an alternative fix and details: http://www-01.ibm.com/support/docview.wss?uid=isg3T1024840&acss=danl_4184_web Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Matt Weil Reply-To: gpfsug main discussion list Date: Monday, February 13, 2017 at 9:46 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Feb 13 17:27:55 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 13 Feb 2017 12:27:55 -0500 Subject: [gpfsug-discuss] mmbackup examples using policy Message-ID: Anyone have any examples of this? I have a filesystem that has 2 pools and several filesets and would like daily progressive incremental backups of its contents. I found some stuff here(nothing real close to what I wanted however): /usr/lpp/mmfs/samples/ilm I have the tsm client installed on the server nsds. Thanks much -------------- next part -------------- An HTML attachment was scrubbed... URL: From usa-principal at gpfsug.org Tue Feb 14 06:07:05 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Mon, 13 Feb 2017 22:07:05 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: References: Message-ID: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> Just a follow up reminder to save the date, April 4-5, for a two-day Spectrum Scale Users Group event hosted by NERSC in Berkeley, California. We are working on the registration form and agenda and hope to be able to share more details soon. Best, Kristy & Bob On , usa-principal-gpfsug.org wrote: > Hello all and happy new year (depending upon where you are right now > :-) ). > > We'll have more details in 2017, but for now please save the date for > a two-day users group meeting at NERSC in Berkeley, California. > > April 4-5, 2017 > National Energy Research Scientific Computing Center (nersc.gov) > Berkeley, California > > We look forward to offering our first two-day event in the US. > > Best, > Kristy & Bob From zgiles at gmail.com Tue Feb 14 16:10:13 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 14 Feb 2017 11:10:13 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: Message-ID: Hi Leo, I agree with your view on compression and what it should be used for, in general. The read bandwidth amplification is definitely something we're seeing. Just a little more background on the files: The files themselves are not "cold" (archive), however, they are very lightly used. The data set is thousands of files that are each 100-200GB, totaling about a PB. the read pattern is a few GB from about 20% of the files once a month. So the total read is only several TB out of a PB every month. ( approximate ). We can get a compression of about 5:1 using GPFS with these files, so we can gain back 800TB with compression. The total run time of the app (reading all those all chunks, when uncompressed) is maybe an hour total. Although leaving the files uncompressed would let the app work, there's a huge gain to be had if we can make compression work by saving ~800TB As it's such a small amount of data read each time, and also not too predictable (it's semi-random historical), and as the length of the job is short enough, it's hard to justify decompressing large chunks of the system to run 1 job. I would have to decompress 200TB to read 10TB, recompress them, and decompress a different (overlapping) 200TB next month. The compression / decompression of sizable portions of the data takes days. I think there maybe more of an issue that just performance though.. The decompression thread is running, internal file metadata is read fine, most of the file is read fine. Just at times it gets stuck.. the decompression thread is running in GPFS, the app is polling, it just never comes back with the block. I feel like there's a race condition here where a block is read, available for the app, but thrown away before the app can read it, only to be decompressed again. It's strange how some block positions are slow (expected) and others just never come back (it will poll for days on a certain address). However, reading the file in-order is fine. Is this a block caching issue? Can we tune up the amount of blocks kept? I think with mmap the blocks are not kept in page pool, correct? -Zach On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan wrote: > Hi Zachary, > > When a compressed file is mmapped, each 4K read in your tests causes the > accessed part of the file to be decompressed (in the granularity of 10 GPFS > blocks). For usual file sizes, the parts being accessed will be > decompressed and IOs speed will be normal except for the first 4K IO in each > 10-GPFS-block group. For very large files, a large percentage of small > random IOs may keep getting amplified to 10-block decompression IO for a > long time. This is probably what happened in your mmap application run. > > The suggestion is to not compress files until they have become cold (not > likely to be accessed any time soon) and avoid compressing very large files > that may be accessed through mmap later. The product already has a built-in > protection preventing compression of files that are mmapped at compression > time. You can add an exclude rule in the compression policy run for files > that are identified to have mmap performance issues (in case they get > mmapped after being compressed in a periodical policy run). > > Leo Luan > > From: Zachary Giles > To: gpfsug main discussion list > Date: 02/10/2017 01:57 PM > Subject: [gpfsug-discuss] Questions about mmap GPFS and compression > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ________________________________ > > > > Hello All, > > I've been seeing some less than desirable behavior with mmap and > compression in GPFS. Curious if others see similar or have any ideas > if this is accurate.. > The guys here want me to open an IBM ticket, but I figured I'd see if > anyone has had this experience before. > > We have an internally developed app that runs on our cluster > referencing data sitting in GPFS. It is using mmap to access the files > due to a library we're using that requires it. > > If we run the app against some data on GPFS, it performs well.. > finishing in a few minutes time -- Great. However, if we compress the > file (in GPFS), the app is still running after 2 days time. > stracing the app shows that is polling on a file descriptor, forever.. > as if a data block is still pending. > > I know mmap is supported with compression according to the manual > (with some stipulations), and that performance is expected to be much > less since it's more large-block oriented due to decompressed in > groups.. no problem. But it seems like some data should get returned. > > I'm surprised to find that a very small amount of data is sitting in > the buffers (mmfsadm dump buffers) in reference to the inodes. The > decompression thread is running continuously, while the app is still > polling for data from memory and sleeping, retrying, sleeping, repeat. > > What I believe is happening is that the 4k pages are being pulled out > of large decompression groups from an mmap read request, put in the > buffer, then the compression group data is thrown away since it has > the result it wants, only to need another piece of data that would > have been in that group slightly later, which is recalled, put in the > buffer.. etc. Thus an infinite slowdown. Perhaps also the data is > expiring out of the buffer before the app has a chance to read it. I > can't tell. In any case, the app makes zero progress. > > I tried without our app, using fio.. mmap on an uncompressed file with > 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not > impressive). However, on a compressed file it is only 20KB/s max. ( > far less impressive ). Reading a file using aio etc is over 3GB/s on a > single thread without even trying. > > What do you think? > Anyone see anything like this? Perhaps there are some tunings to waste > a bit more memory on cached blocks rather than make decompression > recycle? > > I've searched back the archives a bit. There's a May 2013 thread about > slowness as well. I think we're seeing much much less than that. Our > page pools are of decent size. Its not just slowness, it's as if the > app never gets a block back at all. ( We could handle slowness .. ) > > Thanks. Open to ideas.. > > -Zach Giles > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Zach Giles zgiles at gmail.com From zgiles at gmail.com Tue Feb 14 16:25:09 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 14 Feb 2017 11:25:09 -0500 Subject: [gpfsug-discuss] read replica fastest tuning for short distance Message-ID: Hello all, ( Making good use of the mailing list recently.. :) ) I have two datacenters that are fairly close to each other (about 0.5ms away by-the-wire) and have a fairly small pipe between them ( single 40Gbit ). There is a stretched filesystem between the datacenters, two failure groups, and replicas=2 on all data and metadata. I'm trying to ensure that clients on each side only read their local replica instead of filling the pipe with reads from the other side. While readreplica=local would make sense, text suggests that it mostly checks to see if you're in the same subnet to check for local reads. This won't work for me since there are many many subnets on each side. The newer option of readreplica=fastest looks like a good idea, except that the latency of the connection between the datacenters is so small compared to the disk latency that reads often come from the wrong side. I've tried tuning fastestPolicyCmpThreshold down to 5 and fastestPolicyMinDiffPercent down to 10, but I still see reads from both sides. Does anyone have any pointers for tuning read replica using fastest on close-by multidatacenter installs to help ensure reads are only from one side? Any numbers that have been shown to work? I haven't been able to find a way to inspect the GPFS read latencies that it is using to make the decision. I looked in the dumps, but don't seem to see anything. Anyone know if it's possible and where they are? Thanks -Zach -- Zach Giles zgiles at gmail.com From usa-principal at gpfsug.org Tue Feb 14 19:29:04 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Tue, 14 Feb 2017 11:29:04 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> References: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> Message-ID: <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> I should have also asked for anyone interested in giving a talk, as usual, the users group meeting is not meant to be used as a sales and marketing platform, but user experiences are always welcome. If you're interested, or have an idea for a talk, please let us know so we can include it in the agenda. Thanks, Kristy & Bob On , usa-principal-gpfsug.org wrote: > Just a follow up reminder to save the date, April 4-5, for a two-day > Spectrum Scale Users Group event hosted by NERSC in Berkeley, > California. > > We are working on the registration form and agenda and hope to be able > to share more details soon. > > Best, > Kristy & Bob > > > On , usa-principal-gpfsug.org wrote: >> Hello all and happy new year (depending upon where you are right now >> :-) ). >> >> We'll have more details in 2017, but for now please save the date for >> a two-day users group meeting at NERSC in Berkeley, California. >> >> April 4-5, 2017 >> National Energy Research Scientific Computing Center (nersc.gov) >> Berkeley, California >> >> We look forward to offering our first two-day event in the US. >> >> Best, >> Kristy & Bob From mweil at wustl.edu Tue Feb 14 20:17:36 2017 From: mweil at wustl.edu (Matt Weil) Date: Tue, 14 Feb 2017 14:17:36 -0600 Subject: [gpfsug-discuss] GUI access Message-ID: Hello all, Some how we misplaced the password for our dev instance. Is there any way to reset it? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From r.sobey at imperial.ac.uk Tue Feb 14 20:31:16 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 14 Feb 2017 20:31:16 +0000 Subject: [gpfsug-discuss] GUI access In-Reply-To: References: Message-ID: Hi Matt This is what I got from support a few months ago when I had a problem with our "admin" user disappearing. "We have occasionally seen this issue in the past where it has been resolved by : /usr/lpp/mmfs/gui/cli/mkuser admin -p Passw0rd -g Administrator,SecurityAdmin This creates a new user named "admin" with the password "Passw0rd" " I was running 4.2.1-0 at the time iirc. ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Matt Weil Sent: 14 February 2017 20:17 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] GUI access Hello all, Some how we misplaced the password for our dev instance. Is there any way to reset it? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfadden at us.ibm.com Tue Feb 14 21:02:06 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 14 Feb 2017 21:02:06 +0000 Subject: [gpfsug-discuss] GUI access In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From leoluan at us.ibm.com Wed Feb 15 00:14:12 2017 From: leoluan at us.ibm.com (Leo Luan) Date: Wed, 15 Feb 2017 00:14:12 +0000 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Wed Feb 15 13:17:40 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 15 Feb 2017 08:17:40 -0500 Subject: [gpfsug-discuss] Fw: mmbackup examples using policy In-Reply-To: References: Message-ID: Hi Steven: Yes that is more or less what we want to do. We have tivoli here for backup so I'm somewhat familiar with inclexcl files. The filesystem I want to backup is a shared home. Right now I do have a policy...mmlspolicy home -L does return a policy. So if I did not want to backup core and cache files I could create a backup policy using /var/mmfs/mmbackup/.mmbackupRules.home and place in it?: EXCLUDE "/gpfs/home/.../core" EXCLUDE "/igpfs/home/.../.opera/cache4" EXCLUDE "/gpfs/home/.../.netscape/cache/.../*" EXCLUDE "/gpfs/home/.../.mozilla/default/.../Cache" EXCLUDE "/gpfs/home/.../.mozilla/.../Cache/*" EXCLUDE "/gpfs/home/.../.mozilla/.../Cache" EXCLUDE "/gpfs/home/.../.cache/mozilla/*" EXCLUDE.DIR "/gpfs/home/.../.mozilla/firefox/.../Cache" I did a test run of mmbackup and I noticed I got a template put in that location: [root at cl002 ~]# ll -al /var/mmfs/mmbackup/ total 12 drwxr-xr-x 2 root root 4096 Feb 15 07:43 . drwxr-xr-x 10 root root 4096 Jan 4 10:42 .. -r-------- 1 root root 1177 Feb 15 07:43 .mmbackupRules.home So I can copy this off into /var/mmfs/etc for example and to use next time with my edits. What is normally used to schedule the mmbackup? Cronjob? dsmcad? Thanks much. On Tue, Feb 14, 2017 at 11:21 AM, Steven Berman wrote: > Eric, > What specifically do you wish to accomplish? It sounds to me like > you want to use mmbackup to do incremental backup of parts or all of your > file system. But your question did not specify what specifically other > than "whole file system incremental" you want to accomplish. Mmbackup by > default, with "-t incremental" will back up the whole file system, > including all filesets of either variety, and without regard to storage > pools. If you wish to back up only a sub-tree of the file system, it must > be in an independent fileset (--inode-space=new) and the current product > supports doing the backup of just that fileset. If you want to backup > parts of the file system but exclude things in certain storage pools, from > anywhere in the tree, you can either use "include exclude rules" in your > Spectrum Protect (formerly TSM) configuration file, or you can hand-edit > the policy rules for mmbackup which can be copied from /var/mmfs/mmbackup/.mmbackupRules. system name> (only persistent during mmbackup execution). Copy that > file to a new location, hand-edit and run mmbackup next time with -P policy rules file>. Is there something else you want to accomplish? > > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/ > com.ibm.spectrum.scale.v4r22.doc/bl1adv_semaprul.htm > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/ > com.ibm.spectrum.scale.v4r22.doc/bl1adm_backupusingmmbackup.htm > > Steven Berman Spectrum Scale / HPC General Parallel File > System Dev. > Pittsburgh, PA (412) 667-6993 Tie-Line 989-6993 > sberman at us.ibm.com > ----Every once in a while, it is a good idea to call out, "Computer, end > program!" just to check. --David Noelle > ----All Your Base Are Belong To Us. --CATS > > > > > > From: "J. Eric Wonderley" > To: gpfsug main discussion list > > Date: 02/13/2017 10:28 AM > Subject: [gpfsug-discuss] mmbackup examples using policy > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Anyone have any examples of this? I have a filesystem that has 2 pools > and several filesets and would like daily progressive incremental backups > of its contents. > > I found some stuff here(nothing real close to what I wanted however): > /usr/lpp/mmfs/samples/ilm > > I have the tsm client installed on the server nsds. > > Thanks much_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Wed Feb 15 16:43:37 2017 From: zgiles at gmail.com (Zachary Giles) Date: Wed, 15 Feb 2017 11:43:37 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: Message-ID: Just checked, we are definitely using PROT_READ, and the users only have read permission to the files, so it should be purely read. I guess that furthers the concern since we shouldn't be seeing the IO overhead as you mentioned. We also use madvise.. not sure if that helps or hurts. On Tue, Feb 14, 2017 at 7:14 PM, Leo Luan wrote: > Does your internally developed application do only reads during in its > monthly run? If so, can you change it to use PROT_READ flag during the > mmap call? That way you will not get the 10-block decompression IO overhead > and your files will remain compressed. The decompression happens upon > pagein's only if the mmap call includes the PROT_WRITE flag (or upon actual > writes for non-mmap IOs). > > Leo > > > ----- Original message ----- > From: Zachary Giles > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Questions about mmap GPFS and compression > Date: Tue, Feb 14, 2017 8:10 AM > > Hi Leo, > > I agree with your view on compression and what it should be used for, > in general. The read bandwidth amplification is definitely something > we're seeing. > > Just a little more background on the files: > The files themselves are not "cold" (archive), however, they are very > lightly used. The data set is thousands of files that are each > 100-200GB, totaling about a PB. the read pattern is a few GB from > about 20% of the files once a month. So the total read is only several > TB out of a PB every month. ( approximate ). We can get a compression > of about 5:1 using GPFS with these files, so we can gain back 800TB > with compression. The total run time of the app (reading all those all > chunks, when uncompressed) is maybe an hour total. > > Although leaving the files uncompressed would let the app work, > there's a huge gain to be had if we can make compression work by > saving ~800TB As it's such a small amount of data read each time, and > also not too predictable (it's semi-random historical), and as the > length of the job is short enough, it's hard to justify decompressing > large chunks of the system to run 1 job. I would have to decompress > 200TB to read 10TB, recompress them, and decompress a different > (overlapping) 200TB next month. The compression / decompression of > sizable portions of the data takes days. > > I think there maybe more of an issue that just performance though.. > The decompression thread is running, internal file metadata is read > fine, most of the file is read fine. Just at times it gets stuck.. the > decompression thread is running in GPFS, the app is polling, it just > never comes back with the block. I feel like there's a race condition > here where a block is read, available for the app, but thrown away > before the app can read it, only to be decompressed again. > It's strange how some block positions are slow (expected) and others > just never come back (it will poll for days on a certain address). > However, reading the file in-order is fine. > > Is this a block caching issue? Can we tune up the amount of blocks kept? > I think with mmap the blocks are not kept in page pool, correct? > > -Zach > > On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan wrote: >> Hi Zachary, >> >> When a compressed file is mmapped, each 4K read in your tests causes the >> accessed part of the file to be decompressed (in the granularity of 10 >> GPFS >> blocks). For usual file sizes, the parts being accessed will be >> decompressed and IOs speed will be normal except for the first 4K IO in >> each >> 10-GPFS-block group. For very large files, a large percentage of small >> random IOs may keep getting amplified to 10-block decompression IO for a >> long time. This is probably what happened in your mmap application run. >> >> The suggestion is to not compress files until they have become cold (not >> likely to be accessed any time soon) and avoid compressing very large >> files >> that may be accessed through mmap later. The product already has a >> built-in >> protection preventing compression of files that are mmapped at compression >> time. You can add an exclude rule in the compression policy run for files >> that are identified to have mmap performance issues (in case they get >> mmapped after being compressed in a periodical policy run). >> >> Leo Luan >> >> From: Zachary Giles >> To: gpfsug main discussion list >> Date: 02/10/2017 01:57 PM >> Subject: [gpfsug-discuss] Questions about mmap GPFS and compression >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ________________________________ >> >> >> >> Hello All, >> >> I've been seeing some less than desirable behavior with mmap and >> compression in GPFS. Curious if others see similar or have any ideas >> if this is accurate.. >> The guys here want me to open an IBM ticket, but I figured I'd see if >> anyone has had this experience before. >> >> We have an internally developed app that runs on our cluster >> referencing data sitting in GPFS. It is using mmap to access the files >> due to a library we're using that requires it. >> >> If we run the app against some data on GPFS, it performs well.. >> finishing in a few minutes time -- Great. However, if we compress the >> file (in GPFS), the app is still running after 2 days time. >> stracing the app shows that is polling on a file descriptor, forever.. >> as if a data block is still pending. >> >> I know mmap is supported with compression according to the manual >> (with some stipulations), and that performance is expected to be much >> less since it's more large-block oriented due to decompressed in >> groups.. no problem. But it seems like some data should get returned. >> >> I'm surprised to find that a very small amount of data is sitting in >> the buffers (mmfsadm dump buffers) in reference to the inodes. The >> decompression thread is running continuously, while the app is still >> polling for data from memory and sleeping, retrying, sleeping, repeat. >> >> What I believe is happening is that the 4k pages are being pulled out >> of large decompression groups from an mmap read request, put in the >> buffer, then the compression group data is thrown away since it has >> the result it wants, only to need another piece of data that would >> have been in that group slightly later, which is recalled, put in the >> buffer.. etc. Thus an infinite slowdown. Perhaps also the data is >> expiring out of the buffer before the app has a chance to read it. I >> can't tell. In any case, the app makes zero progress. >> >> I tried without our app, using fio.. mmap on an uncompressed file with >> 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not >> impressive). However, on a compressed file it is only 20KB/s max. ( >> far less impressive ). Reading a file using aio etc is over 3GB/s on a >> single thread without even trying. >> >> What do you think? >> Anyone see anything like this? Perhaps there are some tunings to waste >> a bit more memory on cached blocks rather than make decompression >> recycle? >> >> I've searched back the archives a bit. There's a May 2013 thread about >> slowness as well. I think we're seeing much much less than that. Our >> page pools are of decent size. Its not just slowness, it's as if the >> app never gets a block back at all. ( We could handle slowness .. ) >> >> Thanks. Open to ideas.. >> >> -Zach Giles >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > -- > Zach Giles > zgiles at gmail.com > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Zach Giles zgiles at gmail.com From aaron.s.knister at nasa.gov Fri Feb 17 15:52:19 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 10:52:19 -0500 Subject: [gpfsug-discuss] bizarre performance behavior Message-ID: This is a good one. I've got an NSD server with 4x 16GB fibre connections coming in and 1x FDR10 and 1x QDR connection going out to the clients. I was having a really hard time getting anything resembling sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for reads). The back-end is a DDN SFA12K and I *know* it can do better than that. I don't remember quite how I figured this out but simply by running "openssl speed -multi 16" on the nsd server to drive up the load I saw an almost 4x performance jump which is pretty much goes against every sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to quadruple your i/o performance"). This feels like some type of C-states frequency scaling shenanigans that I haven't quite ironed down yet. I booted the box with the following kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which didn't seem to make much of a difference. I also tried setting the frequency governer to userspace and setting the minimum frequency to 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have to run something to drive up the CPU load and then performance improves. I'm wondering if this could be an issue with the C1E state? I'm curious if anyone has seen anything like this. The node is a dx360 M4 (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Fri Feb 17 16:43:34 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 17 Feb 2017 16:43:34 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] Sent: 17 February 2017 15:52 To: gpfsug main discussion list Subject: [gpfsug-discuss] bizarre performance behavior This is a good one. I've got an NSD server with 4x 16GB fibre connections coming in and 1x FDR10 and 1x QDR connection going out to the clients. I was having a really hard time getting anything resembling sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for reads). The back-end is a DDN SFA12K and I *know* it can do better than that. I don't remember quite how I figured this out but simply by running "openssl speed -multi 16" on the nsd server to drive up the load I saw an almost 4x performance jump which is pretty much goes against every sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to quadruple your i/o performance"). This feels like some type of C-states frequency scaling shenanigans that I haven't quite ironed down yet. I booted the box with the following kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which didn't seem to make much of a difference. I also tried setting the frequency governer to userspace and setting the minimum frequency to 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have to run something to drive up the CPU load and then performance improves. I'm wondering if this could be an issue with the C1E state? I'm curious if anyone has seen anything like this. The node is a dx360 M4 (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Fri Feb 17 16:53:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 11:53:00 -0500 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: <104dc3f8-a91c-d9ae-3a86-88136c46de39@nasa.gov> Well, disabling the C1E state seems to have done the trick. I removed the kernel parameters I mentioned and set the cpu governer back to ondemand with a minimum of 1.2ghz. I'm now getting 6.2GB/s of reads which I believe is pretty darned close to theoretical peak performance. -Aaron On 2/17/17 10:52 AM, Aaron Knister wrote: > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Fri Feb 17 17:13:08 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 12:13:08 -0500 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Robert.Oesterlin at nuance.com Fri Feb 17 17:26:29 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 17 Feb 2017 17:26:29 +0000 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages Message-ID: Any way to suppress these? I get them every time mmpmon is run: Feb 17 11:54:02 nrg5-gpfs01 mmfs[10375]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Feb 17 11:55:01 nrg5-gpfs01 mmfs[13668]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Feb 17 11:56:02 nrg5-gpfs01 mmfs[17318]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From syi at ca.ibm.com Fri Feb 17 17:54:39 2017 From: syi at ca.ibm.com (Yi Sun) Date: Fri, 17 Feb 2017 12:54:39 -0500 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages In-Reply-To: References: Message-ID: It may relate to CommandAudit http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1xx_soc.htm Yi Sun > ------------------------------ > > Message: 5 > Date: Fri, 17 Feb 2017 17:26:29 +0000 > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages > Message-ID: > Content-Type: text/plain; charset="utf-8" > > Any way to suppress these? I get them every time mmpmon is run: > > Feb 17 11:54:02 nrg5-gpfs01 mmfs[10375]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > Feb 17 11:55:01 nrg5-gpfs01 mmfs[13668]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > Feb 17 11:56:02 nrg5-gpfs01 mmfs[17318]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Feb 17 17:58:28 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 17 Feb 2017 17:58:28 +0000 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages Message-ID: <3E007FA1-7152-45FB-B78E-2C92A34B7727@nuance.com> Bingo, that was it. I wish I could control it in a more fine-grained manner. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Yi Sun Reply-To: gpfsug main discussion list Date: Friday, February 17, 2017 at 11:54 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmpmon messages in /var/log/messages It may relate to CommandAudit http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1xx_soc.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Fri Feb 17 18:29:46 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 17 Feb 2017 18:29:46 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Message-ID: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, > I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 20 15:35:09 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 15:35:09 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM Message-ID: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Feb 20 15:40:39 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 20 Feb 2017 15:40:39 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: Message-ID: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Mon Feb 20 15:47:57 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 17:47:57 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Mon Feb 20 15:55:47 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 15:55:47 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D28B5F.82432C40] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: From orichards at pixitmedia.com Mon Feb 20 16:00:50 2017 From: orichards at pixitmedia.com (Orlando Richards) Date: Mon, 20 Feb 2017 16:00:50 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Woo! Still going strong! Lovely to hear it still being useful - thanks Kevin :) -- *Orlando Richards* VP Product Development, Pixit Media 07930742808|orichards at pixitmedia.com www.pixitmedia.com |Tw:@pixitmedia On 20/02/2017 15:40, Buterbaugh, Kevin L wrote: > Hi Mark, > > Are you referring to this? > > http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html > > It?s not magical, but it?s pretty good! ;-) Seriously, we use it any > time we want to move stuff around in our GPFS filesystems. > > Kevin > >> On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com >> wrote: >> >> I have a client that has around 200 filesets (must be a good reason >> for it) and they need to migrate data but it?s really looking like >> this might bring AFM to its knees. At one point, I had heard of some >> magical version of RSYNC that IBM developed that could do something >> like this. Anyone have any details on such a tool and is it >> available. Or is there some other way I might do this? >> >> *Mark R. Bush*| *Storage Architect* >> Mobile: 210-237-8415 >> Twitter:@bushmr | LinkedIn:/markreedbush >> >> 10100 Reunion Place, Suite 500, San Antonio, TX 78216 >> www.siriuscom.com >> |mark.bush at siriuscom.com >> > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu > - (615)875-9633 > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Mon Feb 20 16:04:26 2017 From: zgiles at gmail.com (Zachary Giles) Date: Mon, 20 Feb 2017 11:04:26 -0500 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Message-ID: Hey Mark, I'm curious about the idea behind 200 filesets bring AFM to its knees. Any specific part you're concerned about? -Zach On Mon, Feb 20, 2017 at 11:00 AM, Orlando Richards wrote: > Woo! Still going strong! Lovely to hear it still being useful - thanks > Kevin :) > > > -- > *Orlando Richards* > VP Product Development, Pixit Media > 07930742808 | orichards at pixitmedia.com > www.pixitmedia.com | Tw:@pixitmedia > > > > On 20/02/2017 15:40, Buterbaugh, Kevin L wrote: > > Hi Mark, > > Are you referring to this? > > http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012- > October/000169.html > > It?s not magical, but it?s pretty good! ;-) Seriously, we use it any > time we want to move stuff around in our GPFS filesystems. > > Kevin > > On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: > > I have a client that has around 200 filesets (must be a good reason for > it) and they need to migrate data but it?s really looking like this might > bring AFM to its knees. At one point, I had heard of some magical version > of RSYNC that IBM developed that could do something like this. Anyone have > any details on such a tool and is it available. Or is there some other way > I might do this? > > > > > *Mark R. Bush*| *Storage Architect* > Mobile: 210-237-8415 <(210)%20237-8415> > Twitter: @bushmr | LinkedIn: /markreedbush > > 10100 Reunion Place, Suite 500, San Antonio, TX 78216 > www.siriuscom.com |mark.bush at siriuscom.com > > > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633> > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other person. > Please notify the sender immediately and delete this email from your > computer system. Any opinions expressed are not necessarily those of the > company from which this email was sent and, whilst to the best of our > knowledge no viruses or defects exist, no responsibility can be accepted > for any loss or damage arising from its receipt or subsequent use of this > email. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- Zach Giles zgiles at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Mon Feb 20 16:05:27 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 18:05:27 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Message-ID: Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Mon Feb 20 16:35:03 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Mon, 20 Feb 2017 17:35:03 +0100 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu><4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Message-ID: An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 20 16:54:23 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 16:54:23 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Message-ID: <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> Regular rsync apparently takes one week to sync up. I?m just the messenger getting more info from my client soon. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 10:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D28B67.B160D010] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image002.gif at 01D28B67.B160D010] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 1853 bytes Desc: image002.gif URL: From YARD at il.ibm.com Mon Feb 20 17:03:29 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 19:03:29 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu><05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> Message-ID: Hi Split rsync into the directory level so u can run parallel rsync session , this way you maximize the network usage. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 06:54 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Regular rsync apparently takes one week to sync up. I?m just the messenger getting more info from my client soon. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 10:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1853 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Feb 21 13:53:21 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 21 Feb 2017 13:53:21 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: Message-ID: Hey, we?ve got 400+ filesets and still adding more ? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark.Bush at siriuscom.com Sent: 20 February 2017 15:35 To: gpfsug main discussion list Subject: [gpfsug-discuss] 200 filesets and AFM I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From jonathon.anderson at colorado.edu Tue Feb 21 21:39:48 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Tue, 21 Feb 2017 21:39:48 +0000 Subject: [gpfsug-discuss] Bad performance with GPFS system monitoring (mmsysmon) in GPFS 4.2.1.1 Message-ID: This thread happened before I joined gpfsug-discuss; but be advised that we also experienced severe (1.5x-3x) performance degradation in user applications when running mmsysmon. In particular, we?re running a Haswell+OPA system. The issue appears to only happen when the user application is simultaneously using all available cores *and* communicating over the network. Synthetic cpu tests with HPL did not expose the issue, nor did OSU micro-benchmarks that were designed to maximize the network without necessarily using all CPUs. I?ve stopped mmsysmon by hand[^1] for now; but I haven?t yet gone so far as to remove the config file to prevent it from starting in the future. We intend to run further tests; but I wanted to share our experiences so far (as this took us way longer than I wish it had to diagnose). ~jonathon From dod2014 at med.cornell.edu Wed Feb 22 15:57:46 2017 From: dod2014 at med.cornell.edu (Douglas Duckworth) Date: Wed, 22 Feb 2017 10:57:46 -0500 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node Message-ID: Hello! I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. In addition I tried: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost However the same result. When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. So far I consulted the following documentation: http://ibm.co/2mcjK3P http://ibm.co/2lFSInH Could anyone please help? We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. Thanks so much! Best Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Wed Feb 22 16:12:15 2017 From: david_johnson at brown.edu (David D. Johnson) Date: Wed, 22 Feb 2017 11:12:15 -0500 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node In-Reply-To: References: Message-ID: I have a feeling that this is how mmchconfig is supposed to work. You?ve asked it to change the configuration of one node, but the database of configuration settings needs to be propagated to the entire cluster whenever a change is made. You?ll find a section in the mmlsconfig output specific to the node(s) that have been changed [node155] ?. At this point your configuration may be out of sync on any number of nodes. ? ddj Dave Johnson Brown University CCV/CIS > On Feb 22, 2017, at 10:57 AM, Douglas Duckworth wrote: > > Hello! > > I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! > > We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... > > Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: > > sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 > > When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. > > In addition I tried: > > sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost > > However the same result. > > When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. > > So far I consulted the following documentation: > > http://ibm.co/2mcjK3P > http://ibm.co/2lFSInH > > Could anyone please help? > > We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. > > Thanks so much! > > Best > Doug > > > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Physiology and Biophysics > Weill Cornell Medicine > E: doug at med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Feb 22 16:17:09 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 22 Feb 2017 16:17:09 +0000 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node In-Reply-To: References: Message-ID: I agree with this assessment. I would also recommend looking into user defined node classes so that your mmlsconfig output is more easily readable, otherwise each node will be listed in the mmlsconfig output. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of David D. Johnson Sent: Wednesday, February 22, 2017 10:12 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing verbsPorts On Single Node I have a feeling that this is how mmchconfig is supposed to work. You?ve asked it to change the configuration of one node, but the database of configuration settings needs to be propagated to the entire cluster whenever a change is made. You?ll find a section in the mmlsconfig output specific to the node(s) that have been changed [node155] ?. At this point your configuration may be out of sync on any number of nodes. ? ddj Dave Johnson Brown University CCV/CIS On Feb 22, 2017, at 10:57 AM, Douglas Duckworth > wrote: Hello! I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. In addition I tried: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost However the same result. When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. So far I consulted the following documentation: http://ibm.co/2mcjK3P http://ibm.co/2lFSInH Could anyone please help? We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. Thanks so much! Best Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Feb 23 15:46:20 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 23 Feb 2017 15:46:20 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Message-ID: For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" Reply-To: "dW-notify at us.ibm.com" Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28DB9.AEDC8740] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From aaron.s.knister at nasa.gov Thu Feb 23 17:03:18 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 23 Feb 2017 12:03:18 -0500 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> On a particularly heavy loaded NSD server I'm seeing a lot of these messages: 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' I've tried tweaking verbsRdmasPerConnection but the issue seems to persist. Has anyone has encountered this and if so how'd you fix it? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Thu Feb 23 17:12:40 2017 From: oehmes at gmail.com (Sven Oehme) Date: Thu, 23 Feb 2017 17:12:40 +0000 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: all this waiter shows is that you have more in flight than the node or connection can currently serve. the reasons for that can be misconfiguration or you simply run out of resources on the node, not the connection. with latest code you shouldn't see this anymore for node limits as the system automatically adjusts the number of maximum RDMA's according to the systems Node capabilities : you should see messages in your mmfslog like : 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes verbsRdmaUseCompVectors=yes 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so (version >= 1.1) loaded and initialized. 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased from* 3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes.* 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE we want to eliminate all this configurable limits eventually, but this takes time, but as you can see above, we make progress on each release :-) Sven On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister wrote: > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From usa-principal at gpfsug.org Thu Feb 23 21:54:01 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Thu, 23 Feb 2017 13:54:01 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> References: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> Message-ID: <06d616c6d0da5b6aabae1f8d4bbc0b84@webmail.gpfsug.org> Hello, Information, including the registration form, for the April 4-5 User Group Meeting at NERSC (Berkeley, CA) is now available. Please register as early as possible so we can make final decisions about room selection and a science facility tour. The agenda is still be being finalized and we will continue to update the online agenda as details get settled. *We still have room for 2-3 20-minute user talks, if you are interested, please let us know.* Details, and a link to the registration form can be found here: https://www.nersc.gov/research-and-development/data-analytics/spectrum-user-group-meeting/ Looking forward to seeing you in April. Cheers, Kristy & Bob On , usa-principal-gpfsug.org wrote: > I should have also asked for anyone interested in giving a talk, as > usual, the users group meeting is not meant to be used as a sales and > marketing platform, but user experiences are always welcome. > > If you're interested, or have an idea for a talk, please let us know > so we can include it in the agenda. > > Thanks, > Kristy & Bob > > > On , usa-principal-gpfsug.org wrote: >> Just a follow up reminder to save the date, April 4-5, for a two-day >> Spectrum Scale Users Group event hosted by NERSC in Berkeley, >> California. >> >> We are working on the registration form and agenda and hope to be able >> to share more details soon. >> >> Best, >> Kristy & Bob >> >> >> On , usa-principal-gpfsug.org wrote: >>> Hello all and happy new year (depending upon where you are right now >>> :-) ). >>> >>> We'll have more details in 2017, but for now please save the date for >>> a two-day users group meeting at NERSC in Berkeley, California. >>> >>> April 4-5, 2017 >>> National Energy Research Scientific Computing Center (nersc.gov) >>> Berkeley, California >>> >>> We look forward to offering our first two-day event in the US. >>> >>> Best, >>> Kristy & Bob From willi.engeli at id.ethz.ch Fri Feb 24 12:39:03 2017 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Fri, 24 Feb 2017 12:39:03 +0000 Subject: [gpfsug-discuss] Performance Tests using Bonnie++ forces expell of the client running the test Message-ID: Dear all, Does one of you know if Bonnie++ io Test is compatible with GPFS and if, what could force expell of the client from the cluster? Thanks Willi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From Achim.Rehor at de.ibm.com Fri Feb 24 13:24:50 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Fri, 24 Feb 2017 14:24:50 +0100 Subject: [gpfsug-discuss] Performance Tests using Bonnie++ forces expell of the client running the test In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: From bbanister at jumptrading.com Fri Feb 24 14:08:19 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 14:08:19 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E75.28281900] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From Paul.Sanchez at deshaw.com Fri Feb 24 15:15:59 2017 From: Paul.Sanchez at deshaw.com (Sanchez, Paul) Date: Fri, 24 Feb 2017 15:15:59 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E86.6F1F9BB0] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From bbanister at jumptrading.com Fri Feb 24 15:25:14 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E7F.E769D830] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From jfosburg at mdanderson.org Fri Feb 24 15:29:41 2017 From: jfosburg at mdanderson.org (Fosburgh,Jonathan) Date: Fri, 24 Feb 2017 15:29:41 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> References: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> Message-ID: <1487950179.11933.2.camel@mdanderson.org> FWIW, my contact said to do everything, even client only clusters. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations jfosburg at mdanderson.org (713) 745-9346 -----Original Message----- Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption To: gpfsug main discussion list > Reply-to: gpfsug main discussion list From: Bryan Banister > I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:1487950179.36938.0.camel at mdanderson.org] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From bbanister at jumptrading.com Fri Feb 24 16:21:07 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 16:21:07 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: <1487950179.11933.2.camel@mdanderson.org> References: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> <1487950179.11933.2.camel@mdanderson.org> Message-ID: Here is the latest I got from IBM: The fix only needs to be installed on the file system manager nodes. About how to know if your cluster is affected already, you can check if there was any MMFS_FSSTRUCT error in the system logs. If you encounter any lookup failure, funny ls cmd outputs. Or if any cmd would give some replica mismatch error or warning. If you encountered the following kind of Assertion failure you hit the bug. Thu Jul 21 03:26:32.373 2016: [X] *** Assert exp(prevIndEntryP->nextP->dataBlockNum > dataBlockNum) in line 4552 of file /project/sprelbmd/build/rbmd1629a/src/avs/fs/mmfs/ts/log/repUpdate.C Thu Jul 21 03:26:32.374 2016: [E] *** Traceback: Thu Jul 21 03:26:32.375 2016: [E] 2:0x7FE6E141AB36 logAssertFailed + 0x2D6 at Logger.C:546 Thu Jul 21 03:26:32.376 2016: [E] 3:0x7FE6E13FCD25 InodeRecoveryList::addInodeAndIndBlock(long long, unsigned int, RepDiskAddr const&, InodeRecoveryList::FlagsToSet, long long, RepDiskAddr const&) + 0x355 at repUpdate.C:4552 Thu Jul 21 03:26:32.377 2016: [E] 4:0x7FE6E1066879 RecoverDirEntry(StripeGroup*, LogRecovery*, LogFile*, LogRecordType, long long, int, unsigned int*, char*, int*, RepDiskAddr) + 0x1089 at direct.C:2312 Thu Jul 21 03:26:32.378 2016: [E] 5:0x7FE6E13F8741 LogRecovery::recoverOneObject(long long) + 0x1E1 at recoverlog.C:362 Thu Jul 21 03:26:32.379 2016: [E] 6:0x7FE6E0F29B25 MultiThreadWork::doNextStep() + 0xC5 at workthread.C:533 Thu Jul 21 03:26:32.380 2016: [E] 7:0x7FE6E0F29FBB MultiThreadWork::helperThreadBody(void*) + 0xCB at workthread.C:455 Thu Jul 21 03:26:32.381 2016: [E] 8:0x7FE6E0F5FB26 Thread::callBody(Thread*) + 0x46 at thread.C:393 Thu Jul 21 03:26:32.382 2016: [E] 9:0x7FE6E0F4DD12 Thread::callBodyWrapper(Thread*) + 0xA2 at mastdep.C:1077 Thu Jul 21 03:26:32.383 2016: [E] 10:0x7FE6E0667851 start_thread + 0xD1 at mastdep.C:1077 Thu Jul 21 03:26:32.384 2016: [E] 11:0x7FE6DF7BE90D clone + 0x6D at mastdep.C:1077 Hope that helps, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Fosburgh,Jonathan Sent: Friday, February 24, 2017 9:30 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption FWIW, my contact said to do everything, even client only clusters. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations jfosburg at mdanderson.org (713) 745-9346 -----Original Message----- Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption To: gpfsug main discussion list > Reply-to: gpfsug main discussion list > From: Bryan Banister > I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E87.B52EFB90] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From SAnderson at convergeone.com Fri Feb 24 16:58:34 2017 From: SAnderson at convergeone.com (Shaun Anderson) Date: Fri, 24 Feb 2017 16:58:34 +0000 Subject: [gpfsug-discuss] NFS Permission matchup to mmnfs command Message-ID: <1487955513211.95497@convergeone.com> I have a customer currently using native NFS and we are going to move them over the CES. I'm looking at the mmnfs command and trying to map the nfs export arguments with the CES arguments. My customer has these currently: no_wdelay, nohide, rw, sync, no_root_squash, no_all_squash I have this so far: mmnfs export add /gpfs/ltfsee/ --client XX.XX.XX.XX ( Access_Type=RW, Squash=no_root_squash,noidsquash, NFS_COMMIT=true ) So the only arguments that don't appear accounted for is the 'nohide' parameter. Does this look right? SHAUN ANDERSON STORAGE ARCHITECT O 208.577.2112 M 214.263.7014 NOTICE: This email message and any attachments here to may contain confidential information. Any unauthorized review, use, disclosure, or distribution of such information is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy the original message and all copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Feb 24 19:31:08 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 24 Feb 2017 14:31:08 -0500 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Fri Feb 24 19:39:30 2017 From: oehmes at gmail.com (Sven Oehme) Date: Fri, 24 Feb 2017 19:39:30 +0000 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: its more likely you run out of verbsRdmasPerNode which is the top limit across all connections for a given node. Sven On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister wrote: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Fri Feb 24 23:10:07 2017 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Fri, 24 Feb 2017 23:10:07 +0000 Subject: [gpfsug-discuss] Hardening sudo wrapper? In-Reply-To: References: Message-ID: <1487977807260.32706@UTSouthwestern.edu> As per the knowledge page suggested (https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_configsudo.htm), a sudo wapper can work around with PermitRootLogin no. However, giving sudo right to a gpfsadmin account with /usr/bin/scp could be dangerous in the case of this gpfsadmin account been compromised. eg. [gpfsadmin at adminNode ~] $ sudo /usr/bin/scp `/bin/echo /dev/random` /path/to/any_important_files.txt Is it possible to remove scp from the sudoers commands? Instead of the recommended here, # Allow members of the gpfs group to run all commands but only selected commands without a password: %gpfsadmin ALL=(ALL) PASSWD: ALL, NOPASSWD: /usr/lpp/mmfs/bin/mmremote, /usr/bin/scp, /bin/echo, /usr/lpp/mmfs/bin/mmsdrrestore We would like to have this line like this: # Disabled command alias Cmnd_alias MMDELCMDS = /usr/lpp/mmfs/bin/mmdeldisk, /usr/lpp/mmfs/bin/mmdelfileset, /usr/lpp/mmfs/bin/mmdelfs, /usr/lpp/mmfs/bin/mmdelnsd, /usr/lpp/mmfs/bin/mmdelsnapshot %gpfsadmin ALL=(root : gpfsadmin) NOPASSWD: /bin/echo, /usr/lpp/mmfs/bin/?, !MMDELCMDS In this case, we limit the gpfsadmin group user to run only selected mm commands, also not including /usr/bin/scp. In the event of system breach, by loosing gpfsadmin group user account, scp will overwrite system config / user data. From my initial test, this seems to be OK for basic admin commands (such as mmstartup, mmshutdown, mmrepquota, mmchfs), but it did not pass the mmcommon test scpwrap command. ?[gpfsadmin at adminNode ~]$ sudo /usr/lpp/mmfs/bin/mmcommon test scpwrap node1 sudo: no tty present and no askpass program specified lost connection mmcommon: Remote copy file command "/usr/lpp/mmfs/bin/scpwrap" failed (push operation). Return code is 1. mmcommon test scpwrap: Command failed. Examine previous error messages to determine cause. [gpfsadmin at adminNode ~]$ sudo /usr/lpp/mmfs/bin/mmcommon test sshwrap node1 mmcommon test sshwrap: Command successfully completed It is unclear to me now that what exactly does the scp do in the sudo wrapper in the GPFS 4.2.0 version as per Yuri Volobuev's note GPFS and Remote Shell (https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20and%20Remote%20Shell). Will the mmsdrrestore still use scp or rcp to copy the cluster configuration file mmsdrfs around from the central node? Or it uses RPC to synchronize? Are we OK to drop scp/rcp and limit the commands to run? Is there any risk, security wise and performance wise? Can we limit the gpfsadmin account to a very very small level of privilege? I have send this message to gpfs at us.ibm.com and posted at developer works, but I think the answer could benefit other users. Thanks Wei Guo ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Friday, February 24, 2017 1:39 PM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 61, Issue 46 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. NFS Permission matchup to mmnfs command (Shaun Anderson) 2. Re: waiting for conn rdmas < conn maxrdmas (Aaron Knister) 3. Re: waiting for conn rdmas < conn maxrdmas (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Fri, 24 Feb 2017 16:58:34 +0000 From: Shaun Anderson To: gpfsug main discussion list Subject: [gpfsug-discuss] NFS Permission matchup to mmnfs command Message-ID: <1487955513211.95497 at convergeone.com> Content-Type: text/plain; charset="iso-8859-1" I have a customer currently using native NFS and we are going to move them over the CES. I'm looking at the mmnfs command and trying to map the nfs export arguments with the CES arguments. My customer has these currently: no_wdelay, nohide, rw, sync, no_root_squash, no_all_squash I have this so far: mmnfs export add /gpfs/ltfsee/ --client XX.XX.XX.XX ( Access_Type=RW, Squash=no_root_squash,noidsquash, NFS_COMMIT=true ) So the only arguments that don't appear accounted for is the 'nohide' parameter. Does this look right? SHAUN ANDERSON STORAGE ARCHITECT O 208.577.2112 M 214.263.7014 NOTICE: This email message and any attachments here to may contain confidential information. Any unauthorized review, use, disclosure, or distribution of such information is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy the original message and all copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 24 Feb 2017 14:31:08 -0500 From: Aaron Knister To: Subject: Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: Content-Type: text/plain; charset="windows-1252"; format=flowed Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ------------------------------ Message: 3 Date: Fri, 24 Feb 2017 19:39:30 +0000 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: Content-Type: text/plain; charset="utf-8" its more likely you run out of verbsRdmasPerNode which is the top limit across all connections for a given node. Sven On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister wrote: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 61, Issue 46 ********************************************** ________________________________ UT Southwestern Medical Center The future of medicine, today. From service at metamodul.com Mon Feb 27 10:22:48 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 27 Feb 2017 11:22:48 +0100 (CET) Subject: [gpfsug-discuss] Q: backup with dsmc & .snapshots directory Message-ID: <459383319.282012.1488190969081@email.1und1.de> An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Feb 27 11:13:59 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 27 Feb 2017 11:13:59 +0000 Subject: [gpfsug-discuss] Q: backup with dsmc & .snapshots directory In-Reply-To: <459383319.282012.1488190969081@email.1und1.de> Message-ID: I usually exclude them. Otherwise you will end up with lots of data on the TSM backend. -- Cheers > On 27 Feb 2017, at 12.23, Hans-Joachim Ehlers wrote: > > Hi, > > short question: if we are using the native TSM dsmc Client, should we exclude the "./.snapshots/." directory from the backup or is it best practise to backup the .snapshots as well. > > Note: We DO NOT use a dedicated .snapshots directory for backups right now. The snapshots directory is created by a policy which is not adapted for TSM so the snapshot creation and deletion is not synchronized with TSM. In the near future we might use dedicated .snapshots for the backup. > > tia > > Hajo > > - > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 > Mail: service at metamodul.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 11:30:15 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 11:30:15 +0000 Subject: [gpfsug-discuss] Tracking deleted files Message-ID: Hi, Is there a way to track files which have been deleted easily? I'm assuming that we can't easily use a policy scan as they files are no longer in the file-system unless we do some sort of diff? I'm assuming there must be a way of doing this as mmbackup must track deleted files to notify TSM of expired objects. Basically I want a list of new files, changed files and deleted files since a certain time. I'm assuming the first two will be relatively simple with a policyscan, but the latter I'm not sure about. Thanks Simon From jtucker at pixitmedia.com Mon Feb 27 11:59:44 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 27 Feb 2017 11:59:44 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Hi Simon I presented exactly this (albeit briefly) at the 2016 UG. See the snapdiff section of the presentation at: http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf We can track creations, modifications, deletions and moves (from, to) for files and directories between one point in time and another. The selections can be returned via a manner of your choice. If anyone wants to know more, hit me up directly. Incidentally - I will be at BVE this week (http://www.bvexpo.com/) showing new things driven by the Python API and GPFS - so if anyone is in the area and wants to chat about technicals in person rather than on mail, drop me a line and we can sort that out. Best, Jez On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT Services) wrote: > Hi, > > Is there a way to track files which have been deleted easily? I'm assuming > that we can't easily use a policy scan as they files are no longer in the > file-system unless we do some sort of diff? > > I'm assuming there must be a way of doing this as mmbackup must track > deleted files to notify TSM of expired objects. > > Basically I want a list of new files, changed files and deleted files > since a certain time. I'm assuming the first two will be relatively simple > with a policyscan, but the latter I'm not sure about. > > Thanks > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Mon Feb 27 12:00:54 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 27 Feb 2017 13:00:54 +0100 (CET) Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: <783766399.287097.1488196854922@email.1und1.de> An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 12:39:02 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 12:39:02 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Yeah but that uses snapshots, which is pretty heavy-weight for what I want to do, particularly given mmbackup seems to have a way of tracking deletes... Simon From: > on behalf of Jez Tucker > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Monday, 27 February 2017 at 11:59 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files Hi Simon I presented exactly this (albeit briefly) at the 2016 UG. See the snapdiff section of the presentation at: http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf We can track creations, modifications, deletions and moves (from, to) for files and directories between one point in time and another. The selections can be returned via a manner of your choice. If anyone wants to know more, hit me up directly. Incidentally - I will be at BVE this week (http://www.bvexpo.com/) showing new things driven by the Python API and GPFS - so if anyone is in the area and wants to chat about technicals in person rather than on mail, drop me a line and we can sort that out. Best, Jez On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT Services) > wrote: Hi, Is there a way to track files which have been deleted easily? I'm assuming that we can't easily use a policy scan as they files are no longer in the file-system unless we do some sort of diff? I'm assuming there must be a way of doing this as mmbackup must track deleted files to notify TSM of expired objects. Basically I want a list of new files, changed files and deleted files since a certain time. I'm assuming the first two will be relatively simple with a policyscan, but the latter I'm not sure about. Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- [http://www.pixitmedia.com/sig/pxone_pt1.png][http://www.pixitmedia.com/sig/pxone_pt2.png][http://www.pixitmedia.com/sig/pxone_pt3.png][http://www.pixitmedia.com/sig/pxone_pt4.png] [http://pixitmedia.com/sig/BVE-Banner4.png] This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Mon Feb 27 13:11:59 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 27 Feb 2017 13:11:59 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Hi Whilst it does use snapshots, I'd argue that snapshot creation is pretty lightweight - and always consistent. Your alternative via the mmbackup 'tracking' route is to parse out the mmbackup shadow file. AFAIK to do this /properly in a timely fashion/ you'd need to do this as an inline post process after the scan phase of mmbackup has run, else you're instead looking at the outdated view of the shadow file post previous mmbackup run. mmbackup does not 'track' file changes, it performs a comparison pass between the filesystem contents and what TSM _believes_ is the known state of the file system during each run. If a change is made oob of TSM then you need to re-generate the show file to regain total consistency. Sensibly you should be running any mmbackup process from a snapshot to perform consistent backups without dsmc errors. So all things being equal, using snapshots for exact consistency and not having to regenerate (very heavyweight) or parse out a shadow file periodically is a lighter weight, smoother and reliably consistent workflow. YMMV with either approach depending on your management of TSM and your interpretation of 'consistent view' vs 'good enough'. Jez On Mon, 27 Feb 2017 at 12:39, Simon Thompson (Research Computing - IT Services) wrote: > Yeah but that uses snapshots, which is pretty heavy-weight for what I want > to do, particularly given mmbackup seems to have a way of tracking > deletes... > > Simon > > From: on behalf of Jez Tucker < > jtucker at pixitmedia.com> > Reply-To: "gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > Date: Monday, 27 February 2017 at 11:59 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files > > Hi Simon > > I presented exactly this (albeit briefly) at the 2016 UG. > > See the snapdiff section of the presentation at: > > > http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf > > We can track creations, modifications, deletions and moves (from, to) for > files and directories between one point in time and another. > > The selections can be returned via a manner of your choice. > > If anyone wants to know more, hit me up directly. > > Incidentally - I will be at BVE this week (http://www.bvexpo.com/) > showing new things driven by the Python API and GPFS - so if anyone is in > the area and wants to chat about technicals in person rather than on mail, > drop me a line and we can sort that out. > > Best, > > Jez > > > On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT > Services) wrote: > > Hi, > > Is there a way to track files which have been deleted easily? I'm assuming > that we can't easily use a policy scan as they files are no longer in the > file-system unless we do some sort of diff? > > I'm assuming there must be a way of doing this as mmbackup must track > deleted files to notify TSM of expired objects. > > Basically I want a list of new files, changed files and deleted files > since a certain time. I'm assuming the first two will be relatively simple > with a policyscan, but the latter I'm not sure about. > > Thanks > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other person. > Please notify the sender immediately and delete this email from your > computer system. Any opinions expressed are not necessarily those of the > company from which this email was sent and, whilst to the best of our > knowledge no viruses or defects exist, no responsibility can be accepted > for any loss or damage arising from its receipt or subsequent use of this > email. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Feb 27 13:25:21 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 27 Feb 2017 13:25:21 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: <1488201921.4074.114.camel@buzzard.me.uk> On Mon, 2017-02-27 at 12:39 +0000, Simon Thompson (Research Computing - IT Services) wrote: > Yeah but that uses snapshots, which is pretty heavy-weight for what I > want to do, particularly given mmbackup seems to have a way of > tracking deletes... > It has been discussed in the past, but the way to track stuff is to enable HSM and then hook into the DSMAPI. That way you can see all the file creates and deletes "live". I can't however find a reference to it now. I have a feeling it was in the IBM GPFS forum however. It would however require you to get your hands dirty writing code. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From luis.bolinches at fi.ibm.com Mon Feb 27 13:25:15 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 27 Feb 2017 13:25:15 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 13:32:42 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 13:32:42 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <1488201921.4074.114.camel@buzzard.me.uk> References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: >It has been discussed in the past, but the way to track stuff is to >enable HSM and then hook into the DSMAPI. That way you can see all the >file creates and deletes "live". Won't work, I already have a "real" HSM client attached to DMAPI (dsmrecalld). I'm not actually wanting to backup for this use case, we already have mmbackup running to do those things, but it was a list of deleted files that I was after (I just thought it might be easy given mmbackup is tracking it already). Simon From oehmes at gmail.com Mon Feb 27 13:37:46 2017 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 27 Feb 2017 13:37:46 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <1488201921.4074.114.camel@buzzard.me.uk> References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: a couple of years ago tridge demonstrated things you can do with DMAPI interface and even delivered some non supported example code to demonstrate it : https://www.samba.org/~tridge/hacksm/ keep in mind that the DMAPI interface has some severe limitations in terms of scaling, it can only run on one node and can have only one subscriber. we are working on a more scalable and supported solution to accomplish what is asks for (track operations, not just delete) , stay tuned in one of the next user group meetings where i will present (Germany and/or London). Sven On Mon, Feb 27, 2017 at 5:25 AM Jonathan Buzzard wrote: > On Mon, 2017-02-27 at 12:39 +0000, Simon Thompson (Research Computing - > IT Services) wrote: > > Yeah but that uses snapshots, which is pretty heavy-weight for what I > > want to do, particularly given mmbackup seems to have a way of > > tracking deletes... > > > > It has been discussed in the past, but the way to track stuff is to > enable HSM and then hook into the DSMAPI. That way you can see all the > file creates and deletes "live". > > I can't however find a reference to it now. I have a feeling it was in > the IBM GPFS forum however. > > It would however require you to get your hands dirty writing code. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 13:41:47 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 13:41:47 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: Manchester ... The UK meeting is most likely going to be in Manchester ... 9th/10th May if you wanted to pencil something in (we're just waiting for final confirmation of the venue being booked). Simon From: > on behalf of Sven Oehme > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Monday, 27 February 2017 at 13:37 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files we are working on a more scalable and supported solution to accomplish what is asks for (track operations, not just delete) , stay tuned in one of the next user group meetings where i will present (Germany and/or London). -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Mon Feb 27 13:55:26 2017 From: stef.coene at docum.org (Stef Coene) Date: Mon, 27 Feb 2017 14:55:26 +0100 Subject: [gpfsug-discuss] Policy question Message-ID: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> Hi, I have a file system with 2 pools: V500001 and NAS01. I want to use pool V500001 as the default and migrate the oldest files to the pool NAS01 when the pool V500001 fills up. Whatever rule combination I tried, I can not get this working. This is the currently defined policy (created by the GUI): RULE 'Migration' MIGRATE FROM POOL 'V500001' THRESHOLD(95,85) WEIGHT(100000 - DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) TO POOL 'NAS01' RULE 'Default to V5000' SET POOL 'V500001' And also, how can I monitor the migration processes? Stef From makaplan at us.ibm.com Mon Feb 27 16:00:24 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 27 Feb 2017 11:00:24 -0500 Subject: [gpfsug-discuss] Policy questions In-Reply-To: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> References: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> Message-ID: I think you have the sign wrong on your weight. A simple way of ordering the files oldest first is WEIGHT(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) adding 100,000 does nothing to change the order. WEIGHT can be any numeric SQL expression. So come to think of it WEIGHT( - DAYS(ACCESS_TIME) ) is even simpler and will yield the same ordering Also, you must run or schedule the mmapplypolicy command to run to actually do the migration. It doesn't happen until the mmapplypolicy command is running. You can run mmapplypolicy periodically (e.g. with crontab) or on demand with mmaddcallback (GPFS events facility) This is all covered in the very fine official Spectrum Scale documentation and/or some of the supplemental IBM red books, all available for free downloads from ibm.com --marc of GPFS From: Stef Coene To: gpfsug main discussion list Date: 02/27/2017 08:55 AM Subject: [gpfsug-discuss] Policy question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, I have a file system with 2 pools: V500001 and NAS01. I want to use pool V500001 as the default and migrate the oldest files to the pool NAS01 when the pool V500001 fills up. Whatever rule combination I tried, I can not get this working. This is the currently defined policy (created by the GUI): RULE 'Migration' MIGRATE FROM POOL 'V500001' THRESHOLD(95,85) WEIGHT(100000 - DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) TO POOL 'NAS01' RULE 'Default to V5000' SET POOL 'V500001' And also, how can I monitor the migration processes? Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 27 19:40:57 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 19:40:57 +0000 Subject: [gpfsug-discuss] SMB and AD authentication Message-ID: For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From YARD at il.ibm.com Mon Feb 27 19:46:07 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 27 Feb 2017 21:46:07 +0200 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8745 bytes Desc: not available URL: From laurence at qsplace.co.uk Mon Feb 27 19:46:59 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 27 Feb 2017 19:46:59 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Do you have UID/GID for the user in your AD schema? or the rfc 2307 extended schema? AFAIK it uses winbinds IDMAP so requires rfc 2307 attributes rather than using the windows SID and working the UID/GID using autorid etc. -- Lauz On 27 February 2017 19:40:57 GMT+00:00, "Mark.Bush at siriuscom.com" wrote: >For some reason, I just can?t seem to get this to work. I have >configured my protocol nodes to authenticate to AD using the following > >mmuserauth service create --type ad --data-access-method file --servers >192.168.88.3 --user-name administrator --netbios-name scale >--idmap-role master --password ********* --idmap-range-size 1000000 >--idmap-range 10000000-299999999 --enable-nfs-kerberos >--unixmap-domains 'sirius(10000-20000)' > > >All goes well, I see the nodes in AD and all of the wbinfo commands >show good (id Sirius\\administrator doesn?t work though), but when I >try to mount an SMB share (after doing all the necessary mmsmb export >stuff) I get permission denied. I?m curious if I missed a step >(followed the docs pretty much to the letter). I?m trying >Administrator, mark.bush, and a dummy aduser I created. None seem to >gain access to the share. > >Protocol gurus help! Any ideas are appreciated. > > >[id:image001.png at 01D2709D.6EF65720] >Mark R. Bush| Storage Architect >Mobile: 210-237-8415 >Twitter: @bushmr | LinkedIn: >/markreedbush >10100 Reunion Place, Suite 500, San Antonio, TX 78216 >www.siriuscom.com >|mark.bush at siriuscom.com > > >This message (including any attachments) is intended only for the use >of the individual or entity to which it is addressed and may contain >information that is non-public, proprietary, privileged, confidential, >and exempt from disclosure under applicable law. If you are not the >intended recipient, you are hereby notified that any use, >dissemination, distribution, or copying of this communication is >strictly prohibited. This message may be viewed by parties at Sirius >Computer Solutions other than those named in the message header. This >message does not contain an official representation of Sirius Computer >Solutions. If you have received this communication in error, notify >Sirius Computer Solutions immediately and (i) destroy this message if a >facsimile or (ii) delete this message immediately if this is an >electronic communication. Thank you. > >Sirius Computer Solutions -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 27 19:50:17 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 19:50:17 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: [root at n1 ~]# mmsmb export list share2 export path browseable guest ok smb encrypt share2 /gpfs/fs1/sales yes no auto [root at n1 ~]# ls -l /gpfs/fs1 total 0 drwxrwxrwx 2 root root 4096 Feb 25 12:33 sales From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 27, 2017 at 1:46 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] SMB and AD authentication Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D29100.6E55CCF0] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. [cid:image002.png at 01D29100.6E55CCF0] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 8746 bytes Desc: image002.png URL: From christof.schmitt at us.ibm.com Mon Feb 27 19:59:46 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 27 Feb 2017 12:59:46 -0700 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: --unixmap-domains 'sirius(10000-20000)' specifies that for the domain SIRIUS, all uid and gids are stored as rfc2307 attributes in the user and group objects in AD. If "id Sirius\\administrator" does not work, that might already point to missing data in AD. The requirement is that the user has a uidNumber defined, and the user's primary group in AD has to have a gidNumber defined. Note that a gidNumber defined for the user is not read by Spectrum Scale at this point. All uidNumber and gidNumber attributes have to fall in the defined range (10000-20000). If verifying the above points does not help, then a winbindd trace might help to point to the missing step: /usr/lpp/mmfs/bin/smbcontrol winbindd debug 10 id Sirius\\administrator /usr/lpp/mmfs/bin/smbcontrol winbindd debug 1 /var/adm/ras/log.winbindd-idmap is the log file for the idmap queries; it might show a failing ldap query in this case. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 12:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From YARD at il.ibm.com Mon Feb 27 20:04:09 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 27 Feb 2017 22:04:09 +0200 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Hi What does the command return when you run it on the protocols nodes: #id 'DOM\user' Please follow this steps: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/ibmspectrumscale42_content.html SA23-1452-06 05/2016 IBM Spectrum Scale V4.2: Administration and Programming Reference Page - 135 Creating SMB share Use the following information to create an SMB share: 1. Create the directory to be exported through SMB: mmcrfileset fs01 fileset --inode-space=new mmlinkfileset fs01 fileset -J /gpfs/fs01/fileset mkdir /gpfs/fs01/fileset/smb Note: IBM recommends an independent fileset for SMB shares. Create a new independent fileset with these commands: mmcrfileset fs01 fileset --inode-space=new mmlinkfileset fs01 fileset -J /gpfs/fs01/fileset If the directory to be exported does not exist, create the directory first by running the following command: mkdir /gpfs/fs01/fileset/smb" 2. The recommended approach for managing access to the SMB share is to manage the ACLs from a Windows client machine. To change the ACLs from a Windows client, change the owner of the share folder to a user ID that will be used to make the ACL changes by running the following command: chown ?DOMAIN\smbadmin? /gpfs/fs01/fileset/smb 3. Create the actual SMB share on the existing directory: mmsmb export add smbexport /gpfs/fs01/fileset/smb Additional options can be set during share creation. For the documentation of all supported options, see ?mmsmb command? on page 663. 4. Verify that the share has been created: mmsmb export list 5. Access the share from a Windows client using the user ID that has been previously made the owner of the folder. 6. Right-click the folder in the Windows Explorer, open the Security tab, click Advanced, and modify the Access Control List as required. Note: An SMB share can only be created when the ACL setting of the underlying file system is -k nfsv4. In all other cases, mmsmb export create will fail with an error. See ?Authorizing protocol users? on page 200 for details and limitations Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:50 PM Subject: Re: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org [root at n1 ~]# mmsmb export list share2 export path browseable guest ok smb encrypt share2 /gpfs/fs1/sales yes no auto [root at n1 ~]# ls -l /gpfs/fs1 total 0 drwxrwxrwx 2 root root 4096 Feb 25 12:33 sales From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 27, 2017 at 1:46 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] SMB and AD authentication Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8746 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Mon Feb 27 20:12:23 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 20:12:23 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: That was it. I just didn?t have the ScaleUsers group (special AD group I created) set as AD user Sirius\mark.bush?s primary group. Once I did that bam?shares show up and I can view and id works too. Thanks Christof. On 2/27/17, 1:59 PM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Christof Schmitt" wrote: --unixmap-domains 'sirius(10000-20000)' specifies that for the domain SIRIUS, all uid and gids are stored as rfc2307 attributes in the user and group objects in AD. If "id Sirius\\administrator" does not work, that might already point to missing data in AD. The requirement is that the user has a uidNumber defined, and the user's primary group in AD has to have a gidNumber defined. Note that a gidNumber defined for the user is not read by Spectrum Scale at this point. All uidNumber and gidNumber attributes have to fall in the defined range (10000-20000). If verifying the above points does not help, then a winbindd trace might help to point to the missing step: /usr/lpp/mmfs/bin/smbcontrol winbindd debug 10 id Sirius\\administrator /usr/lpp/mmfs/bin/smbcontrol winbindd debug 1 /var/adm/ras/log.winbindd-idmap is the log file for the idmap queries; it might show a failing ldap query in this case. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 12:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions From ewahl at osc.edu Mon Feb 27 20:50:49 2017 From: ewahl at osc.edu (Edward Wahl) Date: Mon, 27 Feb 2017 15:50:49 -0500 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: <20170227155049.22001bb0@osc.edu> I can think of a couple of ways to do this. But using snapshots seems heavy, but so does using mmbackup unless you are already running it every day. Diff the shadow files? Haha could be a _terrible_ idea if you have a couple hundred million files. But it IS possible. Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably stayed at a Holiday Inn express at least once in my heavy travel days) -query objects using '-ina=yes' and yesterdays date? Might be a touch slow. But it probably uses the next one as it's backend: -db2 query inside TSM to see a similar thing. This ought to be the fastest, and I'm sure with a little google'ing you can work this out. Tivoli MUST know exact dates of deletion as it uses that and the retention time to know when to purge/reclaim deleted objects from it's storage pools. (retain extra version or RETEXTRA or retain only version) Ed On Mon, 27 Feb 2017 13:32:42 +0000 "Simon Thompson (Research Computing - IT Services)" wrote: > >It has been discussed in the past, but the way to track stuff is to > >enable HSM and then hook into the DSMAPI. That way you can see all the > >file creates and deletes "live". > > Won't work, I already have a "real" HSM client attached to DMAPI > (dsmrecalld). > > I'm not actually wanting to backup for this use case, we already have > mmbackup running to do those things, but it was a list of deleted files > that I was after (I just thought it might be easy given mmbackup is > tracking it already). > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From makaplan at us.ibm.com Mon Feb 27 21:23:52 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 27 Feb 2017 16:23:52 -0500 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <20170227155049.22001bb0@osc.edu> References: <1488201921.4074.114.camel@buzzard.me.uk> <20170227155049.22001bb0@osc.edu> Message-ID: Diffing file lists can be fast - IF you keep the file lists sorted by a unique key, e.g. the inode number. I believe that's how mmbackup does it. Use the classic set difference algorithm. Standard diff is designed to do something else and is terribly slow on large file lists. From: Edward Wahl To: "Simon Thompson (Research Computing - IT Services)" Cc: gpfsug main discussion list Date: 02/27/2017 03:51 PM Subject: Re: [gpfsug-discuss] Tracking deleted files Sent by: gpfsug-discuss-bounces at spectrumscale.org I can think of a couple of ways to do this. But using snapshots seems heavy, but so does using mmbackup unless you are already running it every day. Diff the shadow files? Haha could be a _terrible_ idea if you have a couple hundred million files. But it IS possible. Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably stayed at a Holiday Inn express at least once in my heavy travel days) -query objects using '-ina=yes' and yesterdays date? Might be a touch slow. But it probably uses the next one as it's backend: -db2 query inside TSM to see a similar thing. This ought to be the fastest, and I'm sure with a little google'ing you can work this out. Tivoli MUST know exact dates of deletion as it uses that and the retention time to know when to purge/reclaim deleted objects from it's storage pools. (retain extra version or RETEXTRA or retain only version) Ed On Mon, 27 Feb 2017 13:32:42 +0000 "Simon Thompson (Research Computing - IT Services)" wrote: > >It has been discussed in the past, but the way to track stuff is to > >enable HSM and then hook into the DSMAPI. That way you can see all the > >file creates and deletes "live". > > Won't work, I already have a "real" HSM client attached to DMAPI > (dsmrecalld). > > I'm not actually wanting to backup for this use case, we already have > mmbackup running to do those things, but it was a list of deleted files > that I was after (I just thought it might be easy given mmbackup is > tracking it already). > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Feb 27 22:13:46 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 27 Feb 2017 23:13:46 +0100 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> <20170227155049.22001bb0@osc.edu> Message-ID: AFM apparently keeps track og this, so maybe it would be possible to run AFM-SW with disconnected home and query the queue of changes? But would require some way of clearing the queue as well.. -jf On Monday, February 27, 2017, Marc A Kaplan wrote: > Diffing file lists can be fast - IF you keep the file lists sorted by a > unique key, e.g. the inode number. > I believe that's how mmbackup does it. Use the classic set difference > algorithm. > > Standard diff is designed to do something else and is terribly slow on > large file lists. > > > > From: Edward Wahl > > To: "Simon Thompson (Research Computing - IT Services)" < > S.J.Thompson at bham.ac.uk > > > Cc: gpfsug main discussion list > > Date: 02/27/2017 03:51 PM > Subject: Re: [gpfsug-discuss] Tracking deleted files > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------ > > > > I can think of a couple of ways to do this. But using snapshots seems > heavy, > but so does using mmbackup unless you are already running it every day. > > Diff the shadow files? Haha could be a _terrible_ idea if you have a > couple > hundred million files. But it IS possible. > > > Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably > stayed > at a Holiday Inn express at least once in my heavy travel days) > > -query objects using '-ina=yes' and yesterdays date? Might be a touch > slow. But > it probably uses the next one as it's backend: > > -db2 query inside TSM to see a similar thing. This ought to be the > fastest, > and I'm sure with a little google'ing you can work this out. Tivoli MUST > know > exact dates of deletion as it uses that and the retention time to know > when to purge/reclaim deleted objects from it's storage pools. > (retain extra version or RETEXTRA or retain only version) > > Ed > > On Mon, 27 Feb 2017 13:32:42 +0000 > "Simon Thompson (Research Computing - IT Services)" < > S.J.Thompson at bham.ac.uk > > > wrote: > > > >It has been discussed in the past, but the way to track stuff is to > > >enable HSM and then hook into the DSMAPI. That way you can see all the > > >file creates and deletes "live". > > > > Won't work, I already have a "real" HSM client attached to DMAPI > > (dsmrecalld). > > > > I'm not actually wanting to backup for this use case, we already have > > mmbackup running to do those things, but it was a list of deleted files > > that I was after (I just thought it might be easy given mmbackup is > > tracking it already). > > > > Simon > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Tue Feb 28 08:44:26 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Tue, 28 Feb 2017 09:44:26 +0100 (CET) Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Message-ID: <1031275380.310791.1488271466638@email.1und1.de> An HTML attachment was scrubbed... URL: From ashish.thandavan at cs.ox.ac.uk Tue Feb 28 16:10:44 2017 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Tue, 28 Feb 2017 16:10:44 +0000 Subject: [gpfsug-discuss] mmbackup logging issue Message-ID: Dear all, We have a small GPFS cluster and a separate server running TSM and one of the three NSD servers backs up our GPFS filesystem to the TSM server using mmbackup. After a recent upgrade from v3.5 to 4.1.1, we've noticed that mmbackup no longer logs stuff like it used to : ... Thu Jan 19 05:45:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 2 failed. Thu Jan 19 06:15:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 3 failed. Thu Jan 19 06:45:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 3 failed. ... instead of ... Sat Dec 3 12:01:00 2016 mmbackup:Backing up files: 105030 backed up, 635456 expired, 30 failed. Sat Dec 3 12:31:00 2016 mmbackup:Backing up files: 205934 backed up, 635456 expired, 57 failed. Sat Dec 3 13:01:00 2016 mmbackup:Backing up files: 321702 backed up, 635456 expired, 169 failed. ... like it used to pre-upgrade. I am therefore unable to see how far long it has got, and indeed if it completed successfully, as this is what it logs at the end of a job : ... Tue Jan 17 18:07:31 2017 mmbackup:Completed policy backup run with 0 policy errors, 10012 files failed, 0 severe errors, returning rc=9. Tue Jan 17 18:07:31 2017 mmbackup:Policy for backup returned 9 Highest TSM error 12 mmbackup: TSM Summary Information: Total number of objects inspected: 20617273 Total number of objects backed up: 0 Total number of objects updated: 0 Total number of objects rebound: 0 Total number of objects deleted: 0 Total number of objects expired: 1 Total number of objects failed: 10012 Total number of objects encrypted: 0 Total number of bytes inspected: 3821624716861 Total number of bytes transferred: 3712040943672 Tue Jan 17 18:07:31 2017 mmbackup:Audit files /cs/mmbackup.audit.gpfs* contain 0 failed paths but there were 10012 failures. Cannot reconcile shadow database. Unable to compensate for all TSM errors in new shadow database. Preserving previous shadow database. Run next mmbackup with -q to synchronize shadow database. exit 12 If it helps, the mmbackup job is kicked off with the following options : /usr/lpp/mmfs/bin/mmbackup gpfs -n 8 -t full -B 20000 -L 1 --tsm-servers gpfs_weekly_stanza -N glossop1a | /usr/bin/tee /var/log/mmbackup/gpfs_weekly/backup_log.`date +%Y%m%d_%H_%M` (The excerpts above are from the backup_log. file.) Our NSD servers are running GPFS 4.1.1-11, TSM is at 7.1.1.100 and the File system version is 12.06 (3.4.0.3). Has anyone else seen this behaviour with mmbackup and if so, found a fix? Thanks, Regards, Ash -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk From TOMP at il.ibm.com Tue Feb 28 17:08:29 2017 From: TOMP at il.ibm.com (Tomer Perry) Date: Tue, 28 Feb 2017 19:08:29 +0200 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster In-Reply-To: <1031275380.310791.1488271466638@email.1und1.de> References: <1031275380.310791.1488271466638@email.1und1.de> Message-ID: Hans-Joachim, Since I'm the one that gave this answer...I'll work on adding it to the FAQ. But, in general: 1. The maximum number of "outbound clusters" - meaning "how many clusters can a client join - is limited to 31 ( 32 including the local cluster) 2. The maximum number or "inbound cluster" - meaning "how many clusters can join my cluster) - is not really limited. Thus, since the smallest cluster possible is a single node cluster, it means that 16383 nodes can join my cluster ( 16384 - 1). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Hans-Joachim Ehlers To: gpfsug main discussion list Date: 28/02/2017 10:44 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Sent by: gpfsug-discuss-bounces at spectrumscale.org First thx to all for the support on this list. It is highly appreciated. My new question: i have currently with IBM a discussion about the maximum number of remote clusters mounting GPFS from a local cluster. The answer was that there is almost no limit to the amount of REMOTE clusters accessing a given cluster. From memory I thought there was a limit of 24 remote clusters and the total amount of node must not exceed 16k nodes. The later is described in the GPFS FAQ but about the maximum number of remote cluster accessing a local cluster I could not find anything within the FAQ. So is there a limit of remote clusters accessing a given GPFS cluster or could I really have almost 16k-n(*) remote clusters ( One node cluster ) as long as the max amount of nodes does not exceed the 16K ? (*) n is the amount of local nodes. Maybe this info should be added also to the FAQ ? Info from the FAQ: https://www.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf Q5.4: What is the current limit on the number of nodes that may concurrently join a cluster? A5.4: As of GPFS V3.4.0.18 and GPFS V3.5.0.5, the total number of nodes that may concurrently join a cluster is limited to a maximum of 16384 nodes. tia Hajo -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Tue Feb 28 17:45:57 2017 From: service at metamodul.com (service at metamodul.com) Date: Tue, 28 Feb 2017 18:45:57 +0100 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Message-ID: Thx a lot Perry I never thought about outbound or inbound cluster access. Wish you all the best Hajo --? Unix Systems Engineer MetaModul GmbH +49 177 4393994 -------------- next part -------------- An HTML attachment was scrubbed... URL: From MDIETZ at de.ibm.com Wed Feb 1 09:04:14 2017 From: MDIETZ at de.ibm.com (Mathias Dietz) Date: Wed, 1 Feb 2017 10:04:14 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Feb 1 09:28:25 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 1 Feb 2017 09:28:25 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: Message-ID: Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Mathias Dietz" An:"gpfsug main discussion list" Datum:Mi. 01.02.2017 10:05Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Simon Thompson (Research Computing - IT Services)" An:"gpfsug main discussion list" Datum:Di. 31.01.2017 21:07Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave.bond at diamond.ac.uk Thu Feb 2 10:08:06 2017 From: dave.bond at diamond.ac.uk (dave.bond at diamond.ac.uk) Date: Thu, 2 Feb 2017 10:08:06 +0000 Subject: [gpfsug-discuss] GPFS meta data performance monitoring Message-ID: Hello Mailing list, Beyond mmpmon how are people monitoring their metadata performance? There are two parts I imagine to this question, the first being how do you get a detailed snapshot view of performance read and write etc. Then the second is does anyone collate this information for historical graphing, if so thoughts and ideas are very welcome. mmpmon is certainly useful but I would like to dig a little deeper, ideally without turning anything on that could impact stability or performance of a production file system. Dave (Diamond Light Source) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From olaf.weiser at de.ibm.com Thu Feb 2 15:55:44 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 16:55:44 +0100 Subject: [gpfsug-discuss] GPFS meta data performance monitoring In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Thu Feb 2 17:03:51 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 2 Feb 2017 12:03:51 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears Message-ID: Is there a way to accomplish this so the rest of cluster knows its down? My state now: [root at cl001 ~]# mmgetstate -aL cl004.cl.arc.internal: mmremote: determineMode: Missing file /var/mmfs/gen/mmsdrfs. cl004.cl.arc.internal: mmremote: This node does not belong to a GPFS cluster. mmdsh: cl004.cl.arc.internal remote shell process had return code 1. Node number Node name Quorum Nodes up Total nodes GPFS state Remarks ------------------------------------------------------------------------------------ 1 cl001 5 7 8 active quorum node 2 cl002 5 7 8 active quorum node 3 cl003 5 7 8 active quorum node 4 cl004 0 0 8 unknown quorum node 5 cl005 5 7 8 active quorum node 6 cl006 5 7 8 active quorum node 7 cl007 5 7 8 active quorum node 8 cl008 5 7 8 active quorum node cl004 we think has an internal raid controller blowout -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 2 17:28:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 18:28:22 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From jonathon.anderson at colorado.edu Thu Feb 2 17:44:45 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Thu, 2 Feb 2017 17:44:45 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 2 18:02:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 19:02:22 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Message-ID: An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Thu Feb 2 19:28:05 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Thu, 02 Feb 2017 14:28:05 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: Message-ID: <15501.1486063685@turing-police.cc.vt.edu> On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you > see a message like this.. > have you reinstalled that node / any backup/restore thing ? The internal RAID controller died a horrid death and basically took all the OS partitions with it. So the node was just sort of limping along, where the mmfsd process was still coping because it wasn't doing any I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work because that requires accessing stuff in /var. At that point, it starts getting tempting to just use ipmitool from another node to power the comatose one down - but that often causes a cascade of other issues while things are stuck waiting for timeouts. From aaron.s.knister at nasa.gov Thu Feb 2 19:33:41 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 2 Feb 2017 14:33:41 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: <15501.1486063685@turing-police.cc.vt.edu> References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: You could forcibly expel the node (one of my favorite GPFS commands): mmexpelnode -N $nodename and then power it off after the expulsion is complete and then do mmepelenode -r -N $nodename which will allow it to join the cluster next time you try and start up GPFS on it. You'll still likely have to go through recovery but you'll skip the part where GPFS wonders where the node went prior to it expelling it. -Aaron On 2/2/17 2:28 PM, valdis.kletnieks at vt.edu wrote: > On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > >> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you >> see a message like this.. >> have you reinstalled that node / any backup/restore thing ? > > The internal RAID controller died a horrid death and basically took > all the OS partitions with it. So the node was just sort of limping along, > where the mmfsd process was still coping because it wasn't doing any > I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work > because that requires accessing stuff in /var. > > At that point, it starts getting tempting to just use ipmitool from > another node to power the comatose one down - but that often causes > a cascade of other issues while things are stuck waiting for timeouts. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From olaf.weiser at de.ibm.com Thu Feb 2 21:28:01 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 22:28:01 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Fri Feb 3 12:46:30 2017 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 3 Feb 2017 12:46:30 +0000 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES Message-ID: I'm having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 The NFS clients are up to date Centos and Debian machines. All Scale servers and NFS clients have correct date and time via NTP. Creating a file, for instance 'touch file00', gives correct timestamp. Moving the file, 'mv file00 file01', gives correct timestamp Copying the file, 'cp file01 file02', gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. Have anyone seen this before? Regards, Andreas Mattsson _____________________________________________ [cid:part1.08040705.03090509 at maxiv.lu.se] Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 225 94 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.se www.maxiv.se -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: image001.png URL: From ulmer at ulmer.org Fri Feb 3 13:05:37 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 3 Feb 2017 08:05:37 -0500 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: That?s a cool one. :) What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? -- Stephen > On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: > > I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 > The NFS clients are up to date Centos and Debian machines. > All Scale servers and NFS clients have correct date and time via NTP. > > Creating a file, for instance ?touch file00?, gives correct timestamp. > Moving the file, ?mv file00 file01?, gives correct timestamp > Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. > > This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. > Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. > > Have anyone seen this before? > > Regards, > Andreas Mattsson > _____________________________________________ > > > Andreas Mattsson > Systems Engineer > > MAX IV Laboratory > Lund University > P.O. Box 118, SE-221 00 Lund, Sweden > Visiting address: Fotongatan 2, 225 94 Lund > Mobile: +46 706 64 95 44 > andreas.mattsson at maxiv.se > www.maxiv.se > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Fri Feb 3 13:19:37 2017 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 3 Feb 2017 13:19:37 +0000 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: That works. ?touch test100? Feb 3 14:16 test100 ?cp test100 test101? Feb 3 14:16 test100 Apr 21 2027 test101 ?touch ?r test100 test101? Feb 3 14:16 test100 Feb 3 14:16 test101 /Andreas That?s a cool one. :) What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? -- Stephen On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 The NFS clients are up to date Centos and Debian machines. All Scale servers and NFS clients have correct date and time via NTP. Creating a file, for instance ?touch file00?, gives correct timestamp. Moving the file, ?mv file00 file01?, gives correct timestamp Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. Have anyone seen this before? Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 225 94 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.se www.maxiv.se _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Feb 3 13:35:21 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 3 Feb 2017 08:35:21 -0500 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: Does the cp actually complete? As in, does it copy all of the blocks? What?s the exit code? A cp?d file should have ?new? metadata. That is, it should have it?s own dates, owners, etc. (not necessarily copied from the source file). I ran ?strace cp foo1 foo2?, and it was pretty instructive, maybe that would get you more info. On CentOS strace is in it?s own package, YMMV. -- Stephen > On Feb 3, 2017, at 8:19 AM, Andreas Mattsson > wrote: > > That works. > > ?touch test100? > > Feb 3 14:16 test100 > > ?cp test100 test101? > > Feb 3 14:16 test100 > Apr 21 2027 test101 > > ?touch ?r test100 test101? > > Feb 3 14:16 test100 > Feb 3 14:16 test101 > > /Andreas > > > That?s a cool one. :) > > What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? > > -- > Stephen > > > > On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: > > I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 > The NFS clients are up to date Centos and Debian machines. > All Scale servers and NFS clients have correct date and time via NTP. > > Creating a file, for instance ?touch file00?, gives correct timestamp. > Moving the file, ?mv file00 file01?, gives correct timestamp > Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. > > This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. > Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. > > Have anyone seen this before? > > Regards, > Andreas Mattsson > _____________________________________________ > > > Andreas Mattsson > Systems Engineer > > MAX IV Laboratory > Lund University > P.O. Box 118, SE-221 00 Lund, Sweden > Visiting address: Fotongatan 2, 225 94 Lund > Mobile: +46 706 64 95 44 > andreas.mattsson at maxiv.se > www.maxiv.se > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Fri Feb 3 13:46:49 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 3 Feb 2017 08:46:49 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: Well we got it into the down state using mmsdrrestore -p to recover stuff into /var/mmfs/gen to cl004. Anyhow we ended up unknown for cl004 when it powered off. Short of removing node, unknown is the state you get. Unknown seems stable for a hopefully short outage of cl004. Thanks On Thu, Feb 2, 2017 at 4:28 PM, Olaf Weiser wrote: > many ways lead to Rome .. and I agree .. mmexpelnode is a nice command .. > another approach... > power it off .. (not reachable by ping) .. mmdelnode ... power on/boot ... > mmaddnode .. > > > > From: Aaron Knister > To: > Date: 02/02/2017 08:37 PM > Subject: Re: [gpfsug-discuss] proper gpfs shutdown when node > disappears > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > You could forcibly expel the node (one of my favorite GPFS commands): > > mmexpelnode -N $nodename > > and then power it off after the expulsion is complete and then do > > mmepelenode -r -N $nodename > > which will allow it to join the cluster next time you try and start up > GPFS on it. You'll still likely have to go through recovery but you'll > skip the part where GPFS wonders where the node went prior to it > expelling it. > > -Aaron > > On 2/2/17 2:28 PM, valdis.kletnieks at vt.edu wrote: > > On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > > > >> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's > why you > >> see a message like this.. > >> have you reinstalled that node / any backup/restore thing ? > > > > The internal RAID controller died a horrid death and basically took > > all the OS partitions with it. So the node was just sort of limping > along, > > where the mmfsd process was still coping because it wasn't doing any > > I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work > > because that requires accessing stuff in /var. > > > > At that point, it starts getting tempting to just use ipmitool from > > another node to power the comatose one down - but that often causes > > a cascade of other issues while things are stuck waiting for timeouts. > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Feb 3 14:06:58 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 3 Feb 2017 15:06:58 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: An HTML attachment was scrubbed... URL: From service at metamodul.com Fri Feb 3 16:13:35 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Fri, 3 Feb 2017 17:13:35 +0100 (CET) Subject: [gpfsug-discuss] Mount of file set Message-ID: <738987264.170895.1486138416028@email.1und1.de> An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Feb 3 20:03:18 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 3 Feb 2017 20:03:18 +0000 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? Message-ID: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> I can?t seem to find some of these on fix central, have they been pulled? Specifically, I want: Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux https://www-945.ibm.com/support/fixcentral/swg/selectFixes?product=ibm%2FStorageSoftware%2FIBM+Spectrum+Scale&fixids=Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux&source=myna&myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E&function=fixId&parent=Software%20defined%20storage Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: IBM My Notifications Date: Monday, January 30, 2017 at 10:49 AM To: "Oesterlin, Robert" Subject: [EXTERNAL] IBM My notifications - Storage [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/headset.png] Check out the IBM Support beta [BM] [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/megaphone-m.png] Here are your weekly updates from IBM My Notifications. Contents: IBM Spectrum Scale IBM Spectrum Scale Spectrum_Scale_Protocols_Advanced-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. The pre-built SELinux policy within RHEL7.x conflicts with IBM Spectrum Scale NFS Ganesha [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] Ganesha running on CES nodes with seLinux in enforcing mode and selinux-policy-targeted-3.13.1-60.el7_2.7 installed causes the start of ganesha to fail and thus all CES nodes get UNHEALTHY. See https://bugzilla.redhat.com/show_bug.cgi?id=1383784 Note: IBM Spectrum Scale does not support CES with seLinux in enforcing mode Spectrum_Scale_Protocols_Data_Management-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Data_Management-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Advanced-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Data_Management-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-x86_64-Windows [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-x86_64-Windows [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/information.png] Your support Notifications display in English by default. Machine translation based on your IBM profile language setting is added if you specify this option in Delivery preferences within My Notifications. (Note: Not all languages are available at this time, and the English version always takes precedence over the machine translated version.) Manage your My Notifications subscriptions, or send questions and comments. Subscribe or Unsubscribe | Feedback Follow us on Twitter. To ensure proper delivery please add mynotify at stg.events.ihost.com to your address book. You received this email because you are subscribed to IBM My Notifications as: oester at gmail.com Please do not reply to this message as it is generated by an automated service machine. ?International Business Machines Corporation 2017. All rights reserved. -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Feb 3 19:57:29 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 3 Feb 2017 20:57:29 +0100 Subject: [gpfsug-discuss] Mount of file set In-Reply-To: <738987264.170895.1486138416028@email.1und1.de> References: <738987264.170895.1486138416028@email.1und1.de> Message-ID: An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Sun Feb 5 14:02:57 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Sun, 5 Feb 2017 14:02:57 +0000 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 912 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1463 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 6365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 2881 bytes Desc: not available URL: From martin at uni-mainz.de Mon Feb 6 11:15:31 2017 From: martin at uni-mainz.de (Christoph Martin) Date: Mon, 6 Feb 2017 12:15:31 +0100 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: I have already updated two GPFS installations with 4.2.2.2 with a download from Jan, 31. What issues with Ganesha do I have to expect until the fixed version is available? How can I see that the downloads have changed and are fixed? The information on the download site was: > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux-install (537.58 MB) > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux-install.md5 (97 bytes) > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux.readme.html (24.59 KB) Christoph Am 05.02.2017 um 15:02 schrieb Achim Rehor: > Yes, they have been pulled, all protocol 4.2.2.2 packages. there wsa an > issue with ganesha > > It was expected to see them back before the weekend, which is obviously > not the case. > So, i guess, a little patience is needed. -- ============================================================================ Christoph Martin, Leiter Unix-Systeme Zentrum f?r Datenverarbeitung, Uni-Mainz, Germany Anselm Franz von Bentzel-Weg 12, 55128 Mainz Telefon: +49(6131)3926337 Instant-Messaging: Jabber: martin at jabber.uni-mainz.de (Siehe http://www.zdv.uni-mainz.de/4010.php) -------------- next part -------------- A non-text attachment was scrubbed... Name: martin.vcf Type: text/x-vcard Size: 421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: OpenPGP digital signature URL: From bbanister at jumptrading.com Mon Feb 6 14:54:11 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 6 Feb 2017 14:54:11 +0000 Subject: [gpfsug-discuss] Mount of file set In-Reply-To: References: <738987264.170895.1486138416028@email.1und1.de> Message-ID: Is there an RFE for this yet that we can all vote up? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Olaf Weiser Sent: Friday, February 03, 2017 1:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Mount of file set Hi Ha-Jo, we do the same here .. so no news so far as I know... gruss vom laff From: Hans-Joachim Ehlers > To: gpfsug main discussion list > Date: 02/03/2017 05:14 PM Subject: [gpfsug-discuss] Mount of file set Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Moin Moin, is it nowaday possible to mount directly a GPFS Fileset ? In the old day i mounted the whole GPFS to a Mount point with 000 rights and did a Sub Mount of the needed Fileset. It works but it is ugly. -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Tue Feb 7 18:01:41 2017 From: mweil at wustl.edu (Matt Weil) Date: Tue, 7 Feb 2017 12:01:41 -0600 Subject: [gpfsug-discuss] stuck GracePeriodThread Message-ID: running cnfs # rpm -qa | grep gpfs gpfs.gpl-4.1.1-7.noarch gpfs.base-4.1.1-7.x86_64 gpfs.docs-4.1.1-7.noarch gpfs.gplbin-3.10.0-327.18.2.el7.x86_64-4.1.1-7.x86_64 pcp-pmda-gpfs-3.10.6-2.el7.x86_64 gpfs.ext-4.1.1-7.x86_64 gpfs.gskit-8.0.50-47.x86_64 gpfs.msg.en_US-4.1.1-7.noarch === mmdiag: waiters === 0x7F95F0008CF0 ( 19022) waiting 89.838355000 seconds, GracePeriodThread: delaying for 40.161645000 more seconds, reason: delayed do these cause issues and is there any other way besides stopping and restarting mmfsd to get rid of them. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From TROPPENS at de.ibm.com Wed Feb 8 08:36:45 2017 From: TROPPENS at de.ibm.com (Ulf Troppens) Date: Wed, 8 Feb 2017 09:36:45 +0100 Subject: [gpfsug-discuss] Spectrum Scale User Meeting - March 8+9 , 2017 - Ehningen, Germany Message-ID: There is an IBM organized Spectrum Scale User Meeting in Germany. Though, agenda and spirit are very close to user group organized events. Conference language is German. This is a two-day event. There is an introduction day for Spectrum Scale beginners a day before on March 7. See here for agenda and registration: https://www.spectrumscale.org/spectrum-scale-user-meeting-march-89-2027-ehningen-germany/ -- IBM Spectrum Scale Development - Client Engagements & Solutions Delivery Consulting IT Specialist Author "Storage Networks Explained" IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Wed Feb 8 08:48:06 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Wed, 8 Feb 2017 09:48:06 +0100 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Feb 9 14:30:18 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 9 Feb 2017 14:30:18 +0000 Subject: [gpfsug-discuss] AFM OpenFiles Message-ID: We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs From Mark.Bush at siriuscom.com Thu Feb 9 14:40:03 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Thu, 9 Feb 2017 14:40:03 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Message-ID: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Has any headway been made on this issue? I just ran into it as well. The CES ip addresses just disappeared from my two protocol nodes (4.2.2.0). From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Thursday, February 2, 2017 at 12:02 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes pls contact me directly olaf.weiser at de.ibm.com Mit freundlichen Gr??en / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform, ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland IBM Allee 1 71139 Ehningen Phone: +49-170-579-44-66 E-Mail: olaf.weiser at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From: Jonathon A Anderson To: gpfsug main discussion list Date: 02/02/2017 06:45 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Thu Feb 9 15:10:58 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Thu, 9 Feb 2017 20:40:58 +0530 Subject: [gpfsug-discuss] AFM OpenFiles In-Reply-To: References: Message-ID: What is the version of GPFS ? There was an issue fixed in Spectrum Scale 4.2.2 for file count(file_nr) leak. This issue mostly happens on Linux kernel version >= 3.6. ~Venkat (vpuvvada at in.ibm.com) From: Peter Childs To: gpfsug main discussion list Date: 02/09/2017 08:00 PM Subject: [gpfsug-discuss] AFM OpenFiles Sent by: gpfsug-discuss-bounces at spectrumscale.org We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Feb 9 15:34:25 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 9 Feb 2017 15:34:25 +0000 Subject: [gpfsug-discuss] AFM OpenFiles In-Reply-To: References: , Message-ID: 4.2.1.1 or CentOs 7. So that might account for it. Thanks Peter Childs ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Venkateswara R Puvvada Sent: Thursday, February 9, 2017 3:10:58 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM OpenFiles What is the version of GPFS ? There was an issue fixed in Spectrum Scale 4.2.2 for file count(file_nr) leak. This issue mostly happens on Linux kernel version >= 3.6. ~Venkat (vpuvvada at in.ibm.com) From: Peter Childs To: gpfsug main discussion list Date: 02/09/2017 08:00 PM Subject: [gpfsug-discuss] AFM OpenFiles Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From olaf.weiser at de.ibm.com Thu Feb 9 15:34:55 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 9 Feb 2017 16:34:55 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Message-ID: An HTML attachment was scrubbed... URL: From jonathon.anderson at colorado.edu Thu Feb 9 17:32:55 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Thu, 9 Feb 2017 17:32:55 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Message-ID: I was thinking that whether or not CES knows your nodes are up or not is dependent on how recently they were added to the cluster; but I?m starting to wonder if it?s dependent on the order in which nodes are brought up. Presumably you are running your CES nodes in a GPFS cluster with a large number of nodes? What happens if you bring your CES nodes up earlier (e.g., before your compute nodes)? ~jonathon From: on behalf of "Mark.Bush at siriuscom.com" Reply-To: gpfsug main discussion list Date: Thursday, February 9, 2017 at 7:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Has any headway been made on this issue? I just ran into it as well. The CES ip addresses just disappeared from my two protocol nodes (4.2.2.0). From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Thursday, February 2, 2017 at 12:02 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes pls contact me directly olaf.weiser at de.ibm.com Mit freundlichen Gr??en / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform, ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland IBM Allee 1 71139 Ehningen Phone: +49-170-579-44-66 E-Mail: olaf.weiser at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From: Jonathon A Anderson To: gpfsug main discussion list Date: 02/02/2017 06:45 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Fri Feb 10 16:33:26 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 10 Feb 2017 16:33:26 +0000 Subject: [gpfsug-discuss] Reverting to older versions Message-ID: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From S.J.Thompson at bham.ac.uk Fri Feb 10 16:51:43 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 10 Feb 2017 16:51:43 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> References: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Message-ID: Is it the 4.2.2 code or the protocol packages that broke? We found the 4.2.2.0 SMB packages don't work for us. We just reverted to the older SMB packages. Support have advised us to try the 4.2.2.1 packages, but it means a service break to upgrade protocol packages so we are trying to schedule in. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Mark.Bush at siriuscom.com [Mark.Bush at siriuscom.com] Sent: 10 February 2017 16:33 To: gpfsug main discussion list Subject: [gpfsug-discuss] Reverting to older versions Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From olaf.weiser at de.ibm.com Fri Feb 10 16:57:23 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Feb 2017 17:57:23 +0100 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> References: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8745 bytes Desc: not available URL: From duersch at us.ibm.com Fri Feb 10 17:05:23 2017 From: duersch at us.ibm.com (Steve Duersch) Date: Fri, 10 Feb 2017 17:05:23 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Fri Feb 10 17:08:48 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 10 Feb 2017 17:08:48 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: References: Message-ID: Excellent. Thanks to all. From: on behalf of Steve Duersch Reply-To: gpfsug main discussion list Date: Friday, February 10, 2017 at 11:05 AM To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Reverting to older versions See chapter 12 of the Concepts, Planning, and Installation guide. There is a section on reverting to a previous version. https://www.ibm.com/support/knowledgecenter/STXKQY/ibmspectrumscale_content.html Steve Duersch Spectrum Scale 845-433-7902 IBM Poughkeepsie, New York ----- Original message ----- From: gpfsug-discuss-request at spectrumscale.org Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: gpfsug-discuss Digest, Vol 61, Issue 18 Date: Fri, Feb 10, 2017 11:52 AM Message: 1 Date: Fri, 10 Feb 2017 16:33:26 +0000 From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Subject: [gpfsug-discuss] Reverting to older versions Message-ID: <484E02BE-463F-499D-90B8-47E6F10753E3 at siriuscom.com> Content-Type: text/plain; charset="utf-8" Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Fri Feb 10 21:56:55 2017 From: zgiles at gmail.com (Zachary Giles) Date: Fri, 10 Feb 2017 16:56:55 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression Message-ID: Hello All, I've been seeing some less than desirable behavior with mmap and compression in GPFS. Curious if others see similar or have any ideas if this is accurate.. The guys here want me to open an IBM ticket, but I figured I'd see if anyone has had this experience before. We have an internally developed app that runs on our cluster referencing data sitting in GPFS. It is using mmap to access the files due to a library we're using that requires it. If we run the app against some data on GPFS, it performs well.. finishing in a few minutes time -- Great. However, if we compress the file (in GPFS), the app is still running after 2 days time. stracing the app shows that is polling on a file descriptor, forever.. as if a data block is still pending. I know mmap is supported with compression according to the manual (with some stipulations), and that performance is expected to be much less since it's more large-block oriented due to decompressed in groups.. no problem. But it seems like some data should get returned. I'm surprised to find that a very small amount of data is sitting in the buffers (mmfsadm dump buffers) in reference to the inodes. The decompression thread is running continuously, while the app is still polling for data from memory and sleeping, retrying, sleeping, repeat. What I believe is happening is that the 4k pages are being pulled out of large decompression groups from an mmap read request, put in the buffer, then the compression group data is thrown away since it has the result it wants, only to need another piece of data that would have been in that group slightly later, which is recalled, put in the buffer.. etc. Thus an infinite slowdown. Perhaps also the data is expiring out of the buffer before the app has a chance to read it. I can't tell. In any case, the app makes zero progress. I tried without our app, using fio.. mmap on an uncompressed file with 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not impressive). However, on a compressed file it is only 20KB/s max. ( far less impressive ). Reading a file using aio etc is over 3GB/s on a single thread without even trying. What do you think? Anyone see anything like this? Perhaps there are some tunings to waste a bit more memory on cached blocks rather than make decompression recycle? I've searched back the archives a bit. There's a May 2013 thread about slowness as well. I think we're seeing much much less than that. Our page pools are of decent size. Its not just slowness, it's as if the app never gets a block back at all. ( We could handle slowness .. ) Thanks. Open to ideas.. -Zach Giles From mweil at wustl.edu Sat Feb 11 18:32:54 2017 From: mweil at wustl.edu (Matt Weil) Date: Sat, 11 Feb 2017 12:32:54 -0600 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: References: Message-ID: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> https://access.redhat.com/solutions/2437991 I ran into this issue the other day even with the echo "4096" > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that larger to get to the 2M IO size. So I never really seen this issue until the other day. I may have triggered it myself because I was adding new storage. Was wondering what version of GPFS fixes this. I really do not want to step back to and older kernel version. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From leoluan at us.ibm.com Sat Feb 11 22:23:24 2017 From: leoluan at us.ibm.com (Leo Luan) Date: Sat, 11 Feb 2017 22:23:24 +0000 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sun Feb 12 17:30:38 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sun, 12 Feb 2017 18:30:38 +0100 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> References: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> Message-ID: The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -jf On Sat, Feb 11, 2017 at 7:32 PM, Matt Weil wrote: > https://access.redhat.com/solutions/2437991 > > I ran into this issue the other day even with the echo "4096" > > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that > larger to get to the 2M IO size. So I never really seen this issue > until the other day. I may have triggered it myself because I was > adding new storage. > > Was wondering what version of GPFS fixes this. I really do not want to > step back to and older kernel version. > > Thanks > Matt > > ________________________________ > The materials in this message are private and may contain Protected > Healthcare Information or other information of a sensitive nature. If you > are not the intended recipient, be advised that any unauthorized use, > disclosure, copying or the taking of any action in reliance on the contents > of this information is strictly prohibited. If you have received this email > in error, please immediately notify the sender via telephone or return mail. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Feb 13 15:46:27 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 13 Feb 2017 09:46:27 -0600 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: References: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> Message-ID: excellent Thanks. On 2/12/17 11:30 AM, Jan-Frode Myklebust wrote: The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -jf On Sat, Feb 11, 2017 at 7:32 PM, Matt Weil > wrote: https://access.redhat.com/solutions/2437991 I ran into this issue the other day even with the echo "4096" > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that larger to get to the 2M IO size. So I never really seen this issue until the other day. I may have triggered it myself because I was adding new storage. Was wondering what version of GPFS fixes this. I really do not want to step back to and older kernel version. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Feb 13 15:49:07 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 13 Feb 2017 15:49:07 +0000 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later Message-ID: Alas, I ran into this as well ? only seems to impact some my older JBOD storage. The fix is vague, should I be worried about this turning up later, or will it happen right away? (if it does) Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Matt Weil Reply-To: gpfsug main discussion list Date: Monday, February 13, 2017 at 9:46 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Feb 13 17:00:10 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 13 Feb 2017 17:00:10 +0000 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later Message-ID: <34F66C99-B56D-4742-8C40-B6377B914FC0@nuance.com> See this technote for an alternative fix and details: http://www-01.ibm.com/support/docview.wss?uid=isg3T1024840&acss=danl_4184_web Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Matt Weil Reply-To: gpfsug main discussion list Date: Monday, February 13, 2017 at 9:46 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Feb 13 17:27:55 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 13 Feb 2017 12:27:55 -0500 Subject: [gpfsug-discuss] mmbackup examples using policy Message-ID: Anyone have any examples of this? I have a filesystem that has 2 pools and several filesets and would like daily progressive incremental backups of its contents. I found some stuff here(nothing real close to what I wanted however): /usr/lpp/mmfs/samples/ilm I have the tsm client installed on the server nsds. Thanks much -------------- next part -------------- An HTML attachment was scrubbed... URL: From usa-principal at gpfsug.org Tue Feb 14 06:07:05 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Mon, 13 Feb 2017 22:07:05 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: References: Message-ID: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> Just a follow up reminder to save the date, April 4-5, for a two-day Spectrum Scale Users Group event hosted by NERSC in Berkeley, California. We are working on the registration form and agenda and hope to be able to share more details soon. Best, Kristy & Bob On , usa-principal-gpfsug.org wrote: > Hello all and happy new year (depending upon where you are right now > :-) ). > > We'll have more details in 2017, but for now please save the date for > a two-day users group meeting at NERSC in Berkeley, California. > > April 4-5, 2017 > National Energy Research Scientific Computing Center (nersc.gov) > Berkeley, California > > We look forward to offering our first two-day event in the US. > > Best, > Kristy & Bob From zgiles at gmail.com Tue Feb 14 16:10:13 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 14 Feb 2017 11:10:13 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: Message-ID: Hi Leo, I agree with your view on compression and what it should be used for, in general. The read bandwidth amplification is definitely something we're seeing. Just a little more background on the files: The files themselves are not "cold" (archive), however, they are very lightly used. The data set is thousands of files that are each 100-200GB, totaling about a PB. the read pattern is a few GB from about 20% of the files once a month. So the total read is only several TB out of a PB every month. ( approximate ). We can get a compression of about 5:1 using GPFS with these files, so we can gain back 800TB with compression. The total run time of the app (reading all those all chunks, when uncompressed) is maybe an hour total. Although leaving the files uncompressed would let the app work, there's a huge gain to be had if we can make compression work by saving ~800TB As it's such a small amount of data read each time, and also not too predictable (it's semi-random historical), and as the length of the job is short enough, it's hard to justify decompressing large chunks of the system to run 1 job. I would have to decompress 200TB to read 10TB, recompress them, and decompress a different (overlapping) 200TB next month. The compression / decompression of sizable portions of the data takes days. I think there maybe more of an issue that just performance though.. The decompression thread is running, internal file metadata is read fine, most of the file is read fine. Just at times it gets stuck.. the decompression thread is running in GPFS, the app is polling, it just never comes back with the block. I feel like there's a race condition here where a block is read, available for the app, but thrown away before the app can read it, only to be decompressed again. It's strange how some block positions are slow (expected) and others just never come back (it will poll for days on a certain address). However, reading the file in-order is fine. Is this a block caching issue? Can we tune up the amount of blocks kept? I think with mmap the blocks are not kept in page pool, correct? -Zach On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan wrote: > Hi Zachary, > > When a compressed file is mmapped, each 4K read in your tests causes the > accessed part of the file to be decompressed (in the granularity of 10 GPFS > blocks). For usual file sizes, the parts being accessed will be > decompressed and IOs speed will be normal except for the first 4K IO in each > 10-GPFS-block group. For very large files, a large percentage of small > random IOs may keep getting amplified to 10-block decompression IO for a > long time. This is probably what happened in your mmap application run. > > The suggestion is to not compress files until they have become cold (not > likely to be accessed any time soon) and avoid compressing very large files > that may be accessed through mmap later. The product already has a built-in > protection preventing compression of files that are mmapped at compression > time. You can add an exclude rule in the compression policy run for files > that are identified to have mmap performance issues (in case they get > mmapped after being compressed in a periodical policy run). > > Leo Luan > > From: Zachary Giles > To: gpfsug main discussion list > Date: 02/10/2017 01:57 PM > Subject: [gpfsug-discuss] Questions about mmap GPFS and compression > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ________________________________ > > > > Hello All, > > I've been seeing some less than desirable behavior with mmap and > compression in GPFS. Curious if others see similar or have any ideas > if this is accurate.. > The guys here want me to open an IBM ticket, but I figured I'd see if > anyone has had this experience before. > > We have an internally developed app that runs on our cluster > referencing data sitting in GPFS. It is using mmap to access the files > due to a library we're using that requires it. > > If we run the app against some data on GPFS, it performs well.. > finishing in a few minutes time -- Great. However, if we compress the > file (in GPFS), the app is still running after 2 days time. > stracing the app shows that is polling on a file descriptor, forever.. > as if a data block is still pending. > > I know mmap is supported with compression according to the manual > (with some stipulations), and that performance is expected to be much > less since it's more large-block oriented due to decompressed in > groups.. no problem. But it seems like some data should get returned. > > I'm surprised to find that a very small amount of data is sitting in > the buffers (mmfsadm dump buffers) in reference to the inodes. The > decompression thread is running continuously, while the app is still > polling for data from memory and sleeping, retrying, sleeping, repeat. > > What I believe is happening is that the 4k pages are being pulled out > of large decompression groups from an mmap read request, put in the > buffer, then the compression group data is thrown away since it has > the result it wants, only to need another piece of data that would > have been in that group slightly later, which is recalled, put in the > buffer.. etc. Thus an infinite slowdown. Perhaps also the data is > expiring out of the buffer before the app has a chance to read it. I > can't tell. In any case, the app makes zero progress. > > I tried without our app, using fio.. mmap on an uncompressed file with > 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not > impressive). However, on a compressed file it is only 20KB/s max. ( > far less impressive ). Reading a file using aio etc is over 3GB/s on a > single thread without even trying. > > What do you think? > Anyone see anything like this? Perhaps there are some tunings to waste > a bit more memory on cached blocks rather than make decompression > recycle? > > I've searched back the archives a bit. There's a May 2013 thread about > slowness as well. I think we're seeing much much less than that. Our > page pools are of decent size. Its not just slowness, it's as if the > app never gets a block back at all. ( We could handle slowness .. ) > > Thanks. Open to ideas.. > > -Zach Giles > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Zach Giles zgiles at gmail.com From zgiles at gmail.com Tue Feb 14 16:25:09 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 14 Feb 2017 11:25:09 -0500 Subject: [gpfsug-discuss] read replica fastest tuning for short distance Message-ID: Hello all, ( Making good use of the mailing list recently.. :) ) I have two datacenters that are fairly close to each other (about 0.5ms away by-the-wire) and have a fairly small pipe between them ( single 40Gbit ). There is a stretched filesystem between the datacenters, two failure groups, and replicas=2 on all data and metadata. I'm trying to ensure that clients on each side only read their local replica instead of filling the pipe with reads from the other side. While readreplica=local would make sense, text suggests that it mostly checks to see if you're in the same subnet to check for local reads. This won't work for me since there are many many subnets on each side. The newer option of readreplica=fastest looks like a good idea, except that the latency of the connection between the datacenters is so small compared to the disk latency that reads often come from the wrong side. I've tried tuning fastestPolicyCmpThreshold down to 5 and fastestPolicyMinDiffPercent down to 10, but I still see reads from both sides. Does anyone have any pointers for tuning read replica using fastest on close-by multidatacenter installs to help ensure reads are only from one side? Any numbers that have been shown to work? I haven't been able to find a way to inspect the GPFS read latencies that it is using to make the decision. I looked in the dumps, but don't seem to see anything. Anyone know if it's possible and where they are? Thanks -Zach -- Zach Giles zgiles at gmail.com From usa-principal at gpfsug.org Tue Feb 14 19:29:04 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Tue, 14 Feb 2017 11:29:04 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> References: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> Message-ID: <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> I should have also asked for anyone interested in giving a talk, as usual, the users group meeting is not meant to be used as a sales and marketing platform, but user experiences are always welcome. If you're interested, or have an idea for a talk, please let us know so we can include it in the agenda. Thanks, Kristy & Bob On , usa-principal-gpfsug.org wrote: > Just a follow up reminder to save the date, April 4-5, for a two-day > Spectrum Scale Users Group event hosted by NERSC in Berkeley, > California. > > We are working on the registration form and agenda and hope to be able > to share more details soon. > > Best, > Kristy & Bob > > > On , usa-principal-gpfsug.org wrote: >> Hello all and happy new year (depending upon where you are right now >> :-) ). >> >> We'll have more details in 2017, but for now please save the date for >> a two-day users group meeting at NERSC in Berkeley, California. >> >> April 4-5, 2017 >> National Energy Research Scientific Computing Center (nersc.gov) >> Berkeley, California >> >> We look forward to offering our first two-day event in the US. >> >> Best, >> Kristy & Bob From mweil at wustl.edu Tue Feb 14 20:17:36 2017 From: mweil at wustl.edu (Matt Weil) Date: Tue, 14 Feb 2017 14:17:36 -0600 Subject: [gpfsug-discuss] GUI access Message-ID: Hello all, Some how we misplaced the password for our dev instance. Is there any way to reset it? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From r.sobey at imperial.ac.uk Tue Feb 14 20:31:16 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 14 Feb 2017 20:31:16 +0000 Subject: [gpfsug-discuss] GUI access In-Reply-To: References: Message-ID: Hi Matt This is what I got from support a few months ago when I had a problem with our "admin" user disappearing. "We have occasionally seen this issue in the past where it has been resolved by : /usr/lpp/mmfs/gui/cli/mkuser admin -p Passw0rd -g Administrator,SecurityAdmin This creates a new user named "admin" with the password "Passw0rd" " I was running 4.2.1-0 at the time iirc. ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Matt Weil Sent: 14 February 2017 20:17 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] GUI access Hello all, Some how we misplaced the password for our dev instance. Is there any way to reset it? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfadden at us.ibm.com Tue Feb 14 21:02:06 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 14 Feb 2017 21:02:06 +0000 Subject: [gpfsug-discuss] GUI access In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From leoluan at us.ibm.com Wed Feb 15 00:14:12 2017 From: leoluan at us.ibm.com (Leo Luan) Date: Wed, 15 Feb 2017 00:14:12 +0000 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Wed Feb 15 13:17:40 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 15 Feb 2017 08:17:40 -0500 Subject: [gpfsug-discuss] Fw: mmbackup examples using policy In-Reply-To: References: Message-ID: Hi Steven: Yes that is more or less what we want to do. We have tivoli here for backup so I'm somewhat familiar with inclexcl files. The filesystem I want to backup is a shared home. Right now I do have a policy...mmlspolicy home -L does return a policy. So if I did not want to backup core and cache files I could create a backup policy using /var/mmfs/mmbackup/.mmbackupRules.home and place in it?: EXCLUDE "/gpfs/home/.../core" EXCLUDE "/igpfs/home/.../.opera/cache4" EXCLUDE "/gpfs/home/.../.netscape/cache/.../*" EXCLUDE "/gpfs/home/.../.mozilla/default/.../Cache" EXCLUDE "/gpfs/home/.../.mozilla/.../Cache/*" EXCLUDE "/gpfs/home/.../.mozilla/.../Cache" EXCLUDE "/gpfs/home/.../.cache/mozilla/*" EXCLUDE.DIR "/gpfs/home/.../.mozilla/firefox/.../Cache" I did a test run of mmbackup and I noticed I got a template put in that location: [root at cl002 ~]# ll -al /var/mmfs/mmbackup/ total 12 drwxr-xr-x 2 root root 4096 Feb 15 07:43 . drwxr-xr-x 10 root root 4096 Jan 4 10:42 .. -r-------- 1 root root 1177 Feb 15 07:43 .mmbackupRules.home So I can copy this off into /var/mmfs/etc for example and to use next time with my edits. What is normally used to schedule the mmbackup? Cronjob? dsmcad? Thanks much. On Tue, Feb 14, 2017 at 11:21 AM, Steven Berman wrote: > Eric, > What specifically do you wish to accomplish? It sounds to me like > you want to use mmbackup to do incremental backup of parts or all of your > file system. But your question did not specify what specifically other > than "whole file system incremental" you want to accomplish. Mmbackup by > default, with "-t incremental" will back up the whole file system, > including all filesets of either variety, and without regard to storage > pools. If you wish to back up only a sub-tree of the file system, it must > be in an independent fileset (--inode-space=new) and the current product > supports doing the backup of just that fileset. If you want to backup > parts of the file system but exclude things in certain storage pools, from > anywhere in the tree, you can either use "include exclude rules" in your > Spectrum Protect (formerly TSM) configuration file, or you can hand-edit > the policy rules for mmbackup which can be copied from /var/mmfs/mmbackup/.mmbackupRules. system name> (only persistent during mmbackup execution). Copy that > file to a new location, hand-edit and run mmbackup next time with -P policy rules file>. Is there something else you want to accomplish? > > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/ > com.ibm.spectrum.scale.v4r22.doc/bl1adv_semaprul.htm > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/ > com.ibm.spectrum.scale.v4r22.doc/bl1adm_backupusingmmbackup.htm > > Steven Berman Spectrum Scale / HPC General Parallel File > System Dev. > Pittsburgh, PA (412) 667-6993 Tie-Line 989-6993 > sberman at us.ibm.com > ----Every once in a while, it is a good idea to call out, "Computer, end > program!" just to check. --David Noelle > ----All Your Base Are Belong To Us. --CATS > > > > > > From: "J. Eric Wonderley" > To: gpfsug main discussion list > > Date: 02/13/2017 10:28 AM > Subject: [gpfsug-discuss] mmbackup examples using policy > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Anyone have any examples of this? I have a filesystem that has 2 pools > and several filesets and would like daily progressive incremental backups > of its contents. > > I found some stuff here(nothing real close to what I wanted however): > /usr/lpp/mmfs/samples/ilm > > I have the tsm client installed on the server nsds. > > Thanks much_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Wed Feb 15 16:43:37 2017 From: zgiles at gmail.com (Zachary Giles) Date: Wed, 15 Feb 2017 11:43:37 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: Message-ID: Just checked, we are definitely using PROT_READ, and the users only have read permission to the files, so it should be purely read. I guess that furthers the concern since we shouldn't be seeing the IO overhead as you mentioned. We also use madvise.. not sure if that helps or hurts. On Tue, Feb 14, 2017 at 7:14 PM, Leo Luan wrote: > Does your internally developed application do only reads during in its > monthly run? If so, can you change it to use PROT_READ flag during the > mmap call? That way you will not get the 10-block decompression IO overhead > and your files will remain compressed. The decompression happens upon > pagein's only if the mmap call includes the PROT_WRITE flag (or upon actual > writes for non-mmap IOs). > > Leo > > > ----- Original message ----- > From: Zachary Giles > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Questions about mmap GPFS and compression > Date: Tue, Feb 14, 2017 8:10 AM > > Hi Leo, > > I agree with your view on compression and what it should be used for, > in general. The read bandwidth amplification is definitely something > we're seeing. > > Just a little more background on the files: > The files themselves are not "cold" (archive), however, they are very > lightly used. The data set is thousands of files that are each > 100-200GB, totaling about a PB. the read pattern is a few GB from > about 20% of the files once a month. So the total read is only several > TB out of a PB every month. ( approximate ). We can get a compression > of about 5:1 using GPFS with these files, so we can gain back 800TB > with compression. The total run time of the app (reading all those all > chunks, when uncompressed) is maybe an hour total. > > Although leaving the files uncompressed would let the app work, > there's a huge gain to be had if we can make compression work by > saving ~800TB As it's such a small amount of data read each time, and > also not too predictable (it's semi-random historical), and as the > length of the job is short enough, it's hard to justify decompressing > large chunks of the system to run 1 job. I would have to decompress > 200TB to read 10TB, recompress them, and decompress a different > (overlapping) 200TB next month. The compression / decompression of > sizable portions of the data takes days. > > I think there maybe more of an issue that just performance though.. > The decompression thread is running, internal file metadata is read > fine, most of the file is read fine. Just at times it gets stuck.. the > decompression thread is running in GPFS, the app is polling, it just > never comes back with the block. I feel like there's a race condition > here where a block is read, available for the app, but thrown away > before the app can read it, only to be decompressed again. > It's strange how some block positions are slow (expected) and others > just never come back (it will poll for days on a certain address). > However, reading the file in-order is fine. > > Is this a block caching issue? Can we tune up the amount of blocks kept? > I think with mmap the blocks are not kept in page pool, correct? > > -Zach > > On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan wrote: >> Hi Zachary, >> >> When a compressed file is mmapped, each 4K read in your tests causes the >> accessed part of the file to be decompressed (in the granularity of 10 >> GPFS >> blocks). For usual file sizes, the parts being accessed will be >> decompressed and IOs speed will be normal except for the first 4K IO in >> each >> 10-GPFS-block group. For very large files, a large percentage of small >> random IOs may keep getting amplified to 10-block decompression IO for a >> long time. This is probably what happened in your mmap application run. >> >> The suggestion is to not compress files until they have become cold (not >> likely to be accessed any time soon) and avoid compressing very large >> files >> that may be accessed through mmap later. The product already has a >> built-in >> protection preventing compression of files that are mmapped at compression >> time. You can add an exclude rule in the compression policy run for files >> that are identified to have mmap performance issues (in case they get >> mmapped after being compressed in a periodical policy run). >> >> Leo Luan >> >> From: Zachary Giles >> To: gpfsug main discussion list >> Date: 02/10/2017 01:57 PM >> Subject: [gpfsug-discuss] Questions about mmap GPFS and compression >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ________________________________ >> >> >> >> Hello All, >> >> I've been seeing some less than desirable behavior with mmap and >> compression in GPFS. Curious if others see similar or have any ideas >> if this is accurate.. >> The guys here want me to open an IBM ticket, but I figured I'd see if >> anyone has had this experience before. >> >> We have an internally developed app that runs on our cluster >> referencing data sitting in GPFS. It is using mmap to access the files >> due to a library we're using that requires it. >> >> If we run the app against some data on GPFS, it performs well.. >> finishing in a few minutes time -- Great. However, if we compress the >> file (in GPFS), the app is still running after 2 days time. >> stracing the app shows that is polling on a file descriptor, forever.. >> as if a data block is still pending. >> >> I know mmap is supported with compression according to the manual >> (with some stipulations), and that performance is expected to be much >> less since it's more large-block oriented due to decompressed in >> groups.. no problem. But it seems like some data should get returned. >> >> I'm surprised to find that a very small amount of data is sitting in >> the buffers (mmfsadm dump buffers) in reference to the inodes. The >> decompression thread is running continuously, while the app is still >> polling for data from memory and sleeping, retrying, sleeping, repeat. >> >> What I believe is happening is that the 4k pages are being pulled out >> of large decompression groups from an mmap read request, put in the >> buffer, then the compression group data is thrown away since it has >> the result it wants, only to need another piece of data that would >> have been in that group slightly later, which is recalled, put in the >> buffer.. etc. Thus an infinite slowdown. Perhaps also the data is >> expiring out of the buffer before the app has a chance to read it. I >> can't tell. In any case, the app makes zero progress. >> >> I tried without our app, using fio.. mmap on an uncompressed file with >> 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not >> impressive). However, on a compressed file it is only 20KB/s max. ( >> far less impressive ). Reading a file using aio etc is over 3GB/s on a >> single thread without even trying. >> >> What do you think? >> Anyone see anything like this? Perhaps there are some tunings to waste >> a bit more memory on cached blocks rather than make decompression >> recycle? >> >> I've searched back the archives a bit. There's a May 2013 thread about >> slowness as well. I think we're seeing much much less than that. Our >> page pools are of decent size. Its not just slowness, it's as if the >> app never gets a block back at all. ( We could handle slowness .. ) >> >> Thanks. Open to ideas.. >> >> -Zach Giles >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > -- > Zach Giles > zgiles at gmail.com > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Zach Giles zgiles at gmail.com From aaron.s.knister at nasa.gov Fri Feb 17 15:52:19 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 10:52:19 -0500 Subject: [gpfsug-discuss] bizarre performance behavior Message-ID: This is a good one. I've got an NSD server with 4x 16GB fibre connections coming in and 1x FDR10 and 1x QDR connection going out to the clients. I was having a really hard time getting anything resembling sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for reads). The back-end is a DDN SFA12K and I *know* it can do better than that. I don't remember quite how I figured this out but simply by running "openssl speed -multi 16" on the nsd server to drive up the load I saw an almost 4x performance jump which is pretty much goes against every sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to quadruple your i/o performance"). This feels like some type of C-states frequency scaling shenanigans that I haven't quite ironed down yet. I booted the box with the following kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which didn't seem to make much of a difference. I also tried setting the frequency governer to userspace and setting the minimum frequency to 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have to run something to drive up the CPU load and then performance improves. I'm wondering if this could be an issue with the C1E state? I'm curious if anyone has seen anything like this. The node is a dx360 M4 (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Fri Feb 17 16:43:34 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 17 Feb 2017 16:43:34 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] Sent: 17 February 2017 15:52 To: gpfsug main discussion list Subject: [gpfsug-discuss] bizarre performance behavior This is a good one. I've got an NSD server with 4x 16GB fibre connections coming in and 1x FDR10 and 1x QDR connection going out to the clients. I was having a really hard time getting anything resembling sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for reads). The back-end is a DDN SFA12K and I *know* it can do better than that. I don't remember quite how I figured this out but simply by running "openssl speed -multi 16" on the nsd server to drive up the load I saw an almost 4x performance jump which is pretty much goes against every sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to quadruple your i/o performance"). This feels like some type of C-states frequency scaling shenanigans that I haven't quite ironed down yet. I booted the box with the following kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which didn't seem to make much of a difference. I also tried setting the frequency governer to userspace and setting the minimum frequency to 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have to run something to drive up the CPU load and then performance improves. I'm wondering if this could be an issue with the C1E state? I'm curious if anyone has seen anything like this. The node is a dx360 M4 (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Fri Feb 17 16:53:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 11:53:00 -0500 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: <104dc3f8-a91c-d9ae-3a86-88136c46de39@nasa.gov> Well, disabling the C1E state seems to have done the trick. I removed the kernel parameters I mentioned and set the cpu governer back to ondemand with a minimum of 1.2ghz. I'm now getting 6.2GB/s of reads which I believe is pretty darned close to theoretical peak performance. -Aaron On 2/17/17 10:52 AM, Aaron Knister wrote: > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Fri Feb 17 17:13:08 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 12:13:08 -0500 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Robert.Oesterlin at nuance.com Fri Feb 17 17:26:29 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 17 Feb 2017 17:26:29 +0000 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages Message-ID: Any way to suppress these? I get them every time mmpmon is run: Feb 17 11:54:02 nrg5-gpfs01 mmfs[10375]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Feb 17 11:55:01 nrg5-gpfs01 mmfs[13668]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Feb 17 11:56:02 nrg5-gpfs01 mmfs[17318]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From syi at ca.ibm.com Fri Feb 17 17:54:39 2017 From: syi at ca.ibm.com (Yi Sun) Date: Fri, 17 Feb 2017 12:54:39 -0500 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages In-Reply-To: References: Message-ID: It may relate to CommandAudit http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1xx_soc.htm Yi Sun > ------------------------------ > > Message: 5 > Date: Fri, 17 Feb 2017 17:26:29 +0000 > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages > Message-ID: > Content-Type: text/plain; charset="utf-8" > > Any way to suppress these? I get them every time mmpmon is run: > > Feb 17 11:54:02 nrg5-gpfs01 mmfs[10375]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > Feb 17 11:55:01 nrg5-gpfs01 mmfs[13668]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > Feb 17 11:56:02 nrg5-gpfs01 mmfs[17318]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Feb 17 17:58:28 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 17 Feb 2017 17:58:28 +0000 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages Message-ID: <3E007FA1-7152-45FB-B78E-2C92A34B7727@nuance.com> Bingo, that was it. I wish I could control it in a more fine-grained manner. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Yi Sun Reply-To: gpfsug main discussion list Date: Friday, February 17, 2017 at 11:54 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmpmon messages in /var/log/messages It may relate to CommandAudit http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1xx_soc.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Fri Feb 17 18:29:46 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 17 Feb 2017 18:29:46 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Message-ID: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, > I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 20 15:35:09 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 15:35:09 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM Message-ID: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Feb 20 15:40:39 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 20 Feb 2017 15:40:39 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: Message-ID: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Mon Feb 20 15:47:57 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 17:47:57 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Mon Feb 20 15:55:47 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 15:55:47 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D28B5F.82432C40] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: From orichards at pixitmedia.com Mon Feb 20 16:00:50 2017 From: orichards at pixitmedia.com (Orlando Richards) Date: Mon, 20 Feb 2017 16:00:50 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Woo! Still going strong! Lovely to hear it still being useful - thanks Kevin :) -- *Orlando Richards* VP Product Development, Pixit Media 07930742808|orichards at pixitmedia.com www.pixitmedia.com |Tw:@pixitmedia On 20/02/2017 15:40, Buterbaugh, Kevin L wrote: > Hi Mark, > > Are you referring to this? > > http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html > > It?s not magical, but it?s pretty good! ;-) Seriously, we use it any > time we want to move stuff around in our GPFS filesystems. > > Kevin > >> On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com >> wrote: >> >> I have a client that has around 200 filesets (must be a good reason >> for it) and they need to migrate data but it?s really looking like >> this might bring AFM to its knees. At one point, I had heard of some >> magical version of RSYNC that IBM developed that could do something >> like this. Anyone have any details on such a tool and is it >> available. Or is there some other way I might do this? >> >> *Mark R. Bush*| *Storage Architect* >> Mobile: 210-237-8415 >> Twitter:@bushmr | LinkedIn:/markreedbush >> >> 10100 Reunion Place, Suite 500, San Antonio, TX 78216 >> www.siriuscom.com >> |mark.bush at siriuscom.com >> > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu > - (615)875-9633 > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Mon Feb 20 16:04:26 2017 From: zgiles at gmail.com (Zachary Giles) Date: Mon, 20 Feb 2017 11:04:26 -0500 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Message-ID: Hey Mark, I'm curious about the idea behind 200 filesets bring AFM to its knees. Any specific part you're concerned about? -Zach On Mon, Feb 20, 2017 at 11:00 AM, Orlando Richards wrote: > Woo! Still going strong! Lovely to hear it still being useful - thanks > Kevin :) > > > -- > *Orlando Richards* > VP Product Development, Pixit Media > 07930742808 | orichards at pixitmedia.com > www.pixitmedia.com | Tw:@pixitmedia > > > > On 20/02/2017 15:40, Buterbaugh, Kevin L wrote: > > Hi Mark, > > Are you referring to this? > > http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012- > October/000169.html > > It?s not magical, but it?s pretty good! ;-) Seriously, we use it any > time we want to move stuff around in our GPFS filesystems. > > Kevin > > On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: > > I have a client that has around 200 filesets (must be a good reason for > it) and they need to migrate data but it?s really looking like this might > bring AFM to its knees. At one point, I had heard of some magical version > of RSYNC that IBM developed that could do something like this. Anyone have > any details on such a tool and is it available. Or is there some other way > I might do this? > > > > > *Mark R. Bush*| *Storage Architect* > Mobile: 210-237-8415 <(210)%20237-8415> > Twitter: @bushmr | LinkedIn: /markreedbush > > 10100 Reunion Place, Suite 500, San Antonio, TX 78216 > www.siriuscom.com |mark.bush at siriuscom.com > > > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633> > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other person. > Please notify the sender immediately and delete this email from your > computer system. Any opinions expressed are not necessarily those of the > company from which this email was sent and, whilst to the best of our > knowledge no viruses or defects exist, no responsibility can be accepted > for any loss or damage arising from its receipt or subsequent use of this > email. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- Zach Giles zgiles at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Mon Feb 20 16:05:27 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 18:05:27 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Message-ID: Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Mon Feb 20 16:35:03 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Mon, 20 Feb 2017 17:35:03 +0100 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu><4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Message-ID: An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 20 16:54:23 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 16:54:23 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Message-ID: <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> Regular rsync apparently takes one week to sync up. I?m just the messenger getting more info from my client soon. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 10:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D28B67.B160D010] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image002.gif at 01D28B67.B160D010] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 1853 bytes Desc: image002.gif URL: From YARD at il.ibm.com Mon Feb 20 17:03:29 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 19:03:29 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu><05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> Message-ID: Hi Split rsync into the directory level so u can run parallel rsync session , this way you maximize the network usage. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 06:54 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Regular rsync apparently takes one week to sync up. I?m just the messenger getting more info from my client soon. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 10:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1853 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Feb 21 13:53:21 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 21 Feb 2017 13:53:21 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: Message-ID: Hey, we?ve got 400+ filesets and still adding more ? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark.Bush at siriuscom.com Sent: 20 February 2017 15:35 To: gpfsug main discussion list Subject: [gpfsug-discuss] 200 filesets and AFM I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From jonathon.anderson at colorado.edu Tue Feb 21 21:39:48 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Tue, 21 Feb 2017 21:39:48 +0000 Subject: [gpfsug-discuss] Bad performance with GPFS system monitoring (mmsysmon) in GPFS 4.2.1.1 Message-ID: This thread happened before I joined gpfsug-discuss; but be advised that we also experienced severe (1.5x-3x) performance degradation in user applications when running mmsysmon. In particular, we?re running a Haswell+OPA system. The issue appears to only happen when the user application is simultaneously using all available cores *and* communicating over the network. Synthetic cpu tests with HPL did not expose the issue, nor did OSU micro-benchmarks that were designed to maximize the network without necessarily using all CPUs. I?ve stopped mmsysmon by hand[^1] for now; but I haven?t yet gone so far as to remove the config file to prevent it from starting in the future. We intend to run further tests; but I wanted to share our experiences so far (as this took us way longer than I wish it had to diagnose). ~jonathon From dod2014 at med.cornell.edu Wed Feb 22 15:57:46 2017 From: dod2014 at med.cornell.edu (Douglas Duckworth) Date: Wed, 22 Feb 2017 10:57:46 -0500 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node Message-ID: Hello! I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. In addition I tried: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost However the same result. When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. So far I consulted the following documentation: http://ibm.co/2mcjK3P http://ibm.co/2lFSInH Could anyone please help? We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. Thanks so much! Best Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Wed Feb 22 16:12:15 2017 From: david_johnson at brown.edu (David D. Johnson) Date: Wed, 22 Feb 2017 11:12:15 -0500 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node In-Reply-To: References: Message-ID: I have a feeling that this is how mmchconfig is supposed to work. You?ve asked it to change the configuration of one node, but the database of configuration settings needs to be propagated to the entire cluster whenever a change is made. You?ll find a section in the mmlsconfig output specific to the node(s) that have been changed [node155] ?. At this point your configuration may be out of sync on any number of nodes. ? ddj Dave Johnson Brown University CCV/CIS > On Feb 22, 2017, at 10:57 AM, Douglas Duckworth wrote: > > Hello! > > I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! > > We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... > > Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: > > sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 > > When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. > > In addition I tried: > > sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost > > However the same result. > > When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. > > So far I consulted the following documentation: > > http://ibm.co/2mcjK3P > http://ibm.co/2lFSInH > > Could anyone please help? > > We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. > > Thanks so much! > > Best > Doug > > > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Physiology and Biophysics > Weill Cornell Medicine > E: doug at med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Feb 22 16:17:09 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 22 Feb 2017 16:17:09 +0000 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node In-Reply-To: References: Message-ID: I agree with this assessment. I would also recommend looking into user defined node classes so that your mmlsconfig output is more easily readable, otherwise each node will be listed in the mmlsconfig output. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of David D. Johnson Sent: Wednesday, February 22, 2017 10:12 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing verbsPorts On Single Node I have a feeling that this is how mmchconfig is supposed to work. You?ve asked it to change the configuration of one node, but the database of configuration settings needs to be propagated to the entire cluster whenever a change is made. You?ll find a section in the mmlsconfig output specific to the node(s) that have been changed [node155] ?. At this point your configuration may be out of sync on any number of nodes. ? ddj Dave Johnson Brown University CCV/CIS On Feb 22, 2017, at 10:57 AM, Douglas Duckworth > wrote: Hello! I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. In addition I tried: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost However the same result. When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. So far I consulted the following documentation: http://ibm.co/2mcjK3P http://ibm.co/2lFSInH Could anyone please help? We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. Thanks so much! Best Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Feb 23 15:46:20 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 23 Feb 2017 15:46:20 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Message-ID: For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" Reply-To: "dW-notify at us.ibm.com" Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28DB9.AEDC8740] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From aaron.s.knister at nasa.gov Thu Feb 23 17:03:18 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 23 Feb 2017 12:03:18 -0500 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> On a particularly heavy loaded NSD server I'm seeing a lot of these messages: 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' I've tried tweaking verbsRdmasPerConnection but the issue seems to persist. Has anyone has encountered this and if so how'd you fix it? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Thu Feb 23 17:12:40 2017 From: oehmes at gmail.com (Sven Oehme) Date: Thu, 23 Feb 2017 17:12:40 +0000 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: all this waiter shows is that you have more in flight than the node or connection can currently serve. the reasons for that can be misconfiguration or you simply run out of resources on the node, not the connection. with latest code you shouldn't see this anymore for node limits as the system automatically adjusts the number of maximum RDMA's according to the systems Node capabilities : you should see messages in your mmfslog like : 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes verbsRdmaUseCompVectors=yes 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so (version >= 1.1) loaded and initialized. 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased from* 3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes.* 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE we want to eliminate all this configurable limits eventually, but this takes time, but as you can see above, we make progress on each release :-) Sven On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister wrote: > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From usa-principal at gpfsug.org Thu Feb 23 21:54:01 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Thu, 23 Feb 2017 13:54:01 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> References: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> Message-ID: <06d616c6d0da5b6aabae1f8d4bbc0b84@webmail.gpfsug.org> Hello, Information, including the registration form, for the April 4-5 User Group Meeting at NERSC (Berkeley, CA) is now available. Please register as early as possible so we can make final decisions about room selection and a science facility tour. The agenda is still be being finalized and we will continue to update the online agenda as details get settled. *We still have room for 2-3 20-minute user talks, if you are interested, please let us know.* Details, and a link to the registration form can be found here: https://www.nersc.gov/research-and-development/data-analytics/spectrum-user-group-meeting/ Looking forward to seeing you in April. Cheers, Kristy & Bob On , usa-principal-gpfsug.org wrote: > I should have also asked for anyone interested in giving a talk, as > usual, the users group meeting is not meant to be used as a sales and > marketing platform, but user experiences are always welcome. > > If you're interested, or have an idea for a talk, please let us know > so we can include it in the agenda. > > Thanks, > Kristy & Bob > > > On , usa-principal-gpfsug.org wrote: >> Just a follow up reminder to save the date, April 4-5, for a two-day >> Spectrum Scale Users Group event hosted by NERSC in Berkeley, >> California. >> >> We are working on the registration form and agenda and hope to be able >> to share more details soon. >> >> Best, >> Kristy & Bob >> >> >> On , usa-principal-gpfsug.org wrote: >>> Hello all and happy new year (depending upon where you are right now >>> :-) ). >>> >>> We'll have more details in 2017, but for now please save the date for >>> a two-day users group meeting at NERSC in Berkeley, California. >>> >>> April 4-5, 2017 >>> National Energy Research Scientific Computing Center (nersc.gov) >>> Berkeley, California >>> >>> We look forward to offering our first two-day event in the US. >>> >>> Best, >>> Kristy & Bob From willi.engeli at id.ethz.ch Fri Feb 24 12:39:03 2017 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Fri, 24 Feb 2017 12:39:03 +0000 Subject: [gpfsug-discuss] Performance Tests using Bonnie++ forces expell of the client running the test Message-ID: Dear all, Does one of you know if Bonnie++ io Test is compatible with GPFS and if, what could force expell of the client from the cluster? Thanks Willi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From Achim.Rehor at de.ibm.com Fri Feb 24 13:24:50 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Fri, 24 Feb 2017 14:24:50 +0100 Subject: [gpfsug-discuss] Performance Tests using Bonnie++ forces expell of the client running the test In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: From bbanister at jumptrading.com Fri Feb 24 14:08:19 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 14:08:19 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E75.28281900] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From Paul.Sanchez at deshaw.com Fri Feb 24 15:15:59 2017 From: Paul.Sanchez at deshaw.com (Sanchez, Paul) Date: Fri, 24 Feb 2017 15:15:59 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E86.6F1F9BB0] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From bbanister at jumptrading.com Fri Feb 24 15:25:14 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E7F.E769D830] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From jfosburg at mdanderson.org Fri Feb 24 15:29:41 2017 From: jfosburg at mdanderson.org (Fosburgh,Jonathan) Date: Fri, 24 Feb 2017 15:29:41 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> References: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> Message-ID: <1487950179.11933.2.camel@mdanderson.org> FWIW, my contact said to do everything, even client only clusters. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations jfosburg at mdanderson.org (713) 745-9346 -----Original Message----- Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption To: gpfsug main discussion list > Reply-to: gpfsug main discussion list From: Bryan Banister > I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:1487950179.36938.0.camel at mdanderson.org] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From bbanister at jumptrading.com Fri Feb 24 16:21:07 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 16:21:07 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: <1487950179.11933.2.camel@mdanderson.org> References: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> <1487950179.11933.2.camel@mdanderson.org> Message-ID: Here is the latest I got from IBM: The fix only needs to be installed on the file system manager nodes. About how to know if your cluster is affected already, you can check if there was any MMFS_FSSTRUCT error in the system logs. If you encounter any lookup failure, funny ls cmd outputs. Or if any cmd would give some replica mismatch error or warning. If you encountered the following kind of Assertion failure you hit the bug. Thu Jul 21 03:26:32.373 2016: [X] *** Assert exp(prevIndEntryP->nextP->dataBlockNum > dataBlockNum) in line 4552 of file /project/sprelbmd/build/rbmd1629a/src/avs/fs/mmfs/ts/log/repUpdate.C Thu Jul 21 03:26:32.374 2016: [E] *** Traceback: Thu Jul 21 03:26:32.375 2016: [E] 2:0x7FE6E141AB36 logAssertFailed + 0x2D6 at Logger.C:546 Thu Jul 21 03:26:32.376 2016: [E] 3:0x7FE6E13FCD25 InodeRecoveryList::addInodeAndIndBlock(long long, unsigned int, RepDiskAddr const&, InodeRecoveryList::FlagsToSet, long long, RepDiskAddr const&) + 0x355 at repUpdate.C:4552 Thu Jul 21 03:26:32.377 2016: [E] 4:0x7FE6E1066879 RecoverDirEntry(StripeGroup*, LogRecovery*, LogFile*, LogRecordType, long long, int, unsigned int*, char*, int*, RepDiskAddr) + 0x1089 at direct.C:2312 Thu Jul 21 03:26:32.378 2016: [E] 5:0x7FE6E13F8741 LogRecovery::recoverOneObject(long long) + 0x1E1 at recoverlog.C:362 Thu Jul 21 03:26:32.379 2016: [E] 6:0x7FE6E0F29B25 MultiThreadWork::doNextStep() + 0xC5 at workthread.C:533 Thu Jul 21 03:26:32.380 2016: [E] 7:0x7FE6E0F29FBB MultiThreadWork::helperThreadBody(void*) + 0xCB at workthread.C:455 Thu Jul 21 03:26:32.381 2016: [E] 8:0x7FE6E0F5FB26 Thread::callBody(Thread*) + 0x46 at thread.C:393 Thu Jul 21 03:26:32.382 2016: [E] 9:0x7FE6E0F4DD12 Thread::callBodyWrapper(Thread*) + 0xA2 at mastdep.C:1077 Thu Jul 21 03:26:32.383 2016: [E] 10:0x7FE6E0667851 start_thread + 0xD1 at mastdep.C:1077 Thu Jul 21 03:26:32.384 2016: [E] 11:0x7FE6DF7BE90D clone + 0x6D at mastdep.C:1077 Hope that helps, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Fosburgh,Jonathan Sent: Friday, February 24, 2017 9:30 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption FWIW, my contact said to do everything, even client only clusters. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations jfosburg at mdanderson.org (713) 745-9346 -----Original Message----- Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption To: gpfsug main discussion list > Reply-to: gpfsug main discussion list > From: Bryan Banister > I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E87.B52EFB90] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From SAnderson at convergeone.com Fri Feb 24 16:58:34 2017 From: SAnderson at convergeone.com (Shaun Anderson) Date: Fri, 24 Feb 2017 16:58:34 +0000 Subject: [gpfsug-discuss] NFS Permission matchup to mmnfs command Message-ID: <1487955513211.95497@convergeone.com> I have a customer currently using native NFS and we are going to move them over the CES. I'm looking at the mmnfs command and trying to map the nfs export arguments with the CES arguments. My customer has these currently: no_wdelay, nohide, rw, sync, no_root_squash, no_all_squash I have this so far: mmnfs export add /gpfs/ltfsee/ --client XX.XX.XX.XX ( Access_Type=RW, Squash=no_root_squash,noidsquash, NFS_COMMIT=true ) So the only arguments that don't appear accounted for is the 'nohide' parameter. Does this look right? SHAUN ANDERSON STORAGE ARCHITECT O 208.577.2112 M 214.263.7014 NOTICE: This email message and any attachments here to may contain confidential information. Any unauthorized review, use, disclosure, or distribution of such information is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy the original message and all copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Feb 24 19:31:08 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 24 Feb 2017 14:31:08 -0500 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Fri Feb 24 19:39:30 2017 From: oehmes at gmail.com (Sven Oehme) Date: Fri, 24 Feb 2017 19:39:30 +0000 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: its more likely you run out of verbsRdmasPerNode which is the top limit across all connections for a given node. Sven On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister wrote: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Fri Feb 24 23:10:07 2017 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Fri, 24 Feb 2017 23:10:07 +0000 Subject: [gpfsug-discuss] Hardening sudo wrapper? In-Reply-To: References: Message-ID: <1487977807260.32706@UTSouthwestern.edu> As per the knowledge page suggested (https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_configsudo.htm), a sudo wapper can work around with PermitRootLogin no. However, giving sudo right to a gpfsadmin account with /usr/bin/scp could be dangerous in the case of this gpfsadmin account been compromised. eg. [gpfsadmin at adminNode ~] $ sudo /usr/bin/scp `/bin/echo /dev/random` /path/to/any_important_files.txt Is it possible to remove scp from the sudoers commands? Instead of the recommended here, # Allow members of the gpfs group to run all commands but only selected commands without a password: %gpfsadmin ALL=(ALL) PASSWD: ALL, NOPASSWD: /usr/lpp/mmfs/bin/mmremote, /usr/bin/scp, /bin/echo, /usr/lpp/mmfs/bin/mmsdrrestore We would like to have this line like this: # Disabled command alias Cmnd_alias MMDELCMDS = /usr/lpp/mmfs/bin/mmdeldisk, /usr/lpp/mmfs/bin/mmdelfileset, /usr/lpp/mmfs/bin/mmdelfs, /usr/lpp/mmfs/bin/mmdelnsd, /usr/lpp/mmfs/bin/mmdelsnapshot %gpfsadmin ALL=(root : gpfsadmin) NOPASSWD: /bin/echo, /usr/lpp/mmfs/bin/?, !MMDELCMDS In this case, we limit the gpfsadmin group user to run only selected mm commands, also not including /usr/bin/scp. In the event of system breach, by loosing gpfsadmin group user account, scp will overwrite system config / user data. From my initial test, this seems to be OK for basic admin commands (such as mmstartup, mmshutdown, mmrepquota, mmchfs), but it did not pass the mmcommon test scpwrap command. ?[gpfsadmin at adminNode ~]$ sudo /usr/lpp/mmfs/bin/mmcommon test scpwrap node1 sudo: no tty present and no askpass program specified lost connection mmcommon: Remote copy file command "/usr/lpp/mmfs/bin/scpwrap" failed (push operation). Return code is 1. mmcommon test scpwrap: Command failed. Examine previous error messages to determine cause. [gpfsadmin at adminNode ~]$ sudo /usr/lpp/mmfs/bin/mmcommon test sshwrap node1 mmcommon test sshwrap: Command successfully completed It is unclear to me now that what exactly does the scp do in the sudo wrapper in the GPFS 4.2.0 version as per Yuri Volobuev's note GPFS and Remote Shell (https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20and%20Remote%20Shell). Will the mmsdrrestore still use scp or rcp to copy the cluster configuration file mmsdrfs around from the central node? Or it uses RPC to synchronize? Are we OK to drop scp/rcp and limit the commands to run? Is there any risk, security wise and performance wise? Can we limit the gpfsadmin account to a very very small level of privilege? I have send this message to gpfs at us.ibm.com and posted at developer works, but I think the answer could benefit other users. Thanks Wei Guo ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Friday, February 24, 2017 1:39 PM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 61, Issue 46 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. NFS Permission matchup to mmnfs command (Shaun Anderson) 2. Re: waiting for conn rdmas < conn maxrdmas (Aaron Knister) 3. Re: waiting for conn rdmas < conn maxrdmas (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Fri, 24 Feb 2017 16:58:34 +0000 From: Shaun Anderson To: gpfsug main discussion list Subject: [gpfsug-discuss] NFS Permission matchup to mmnfs command Message-ID: <1487955513211.95497 at convergeone.com> Content-Type: text/plain; charset="iso-8859-1" I have a customer currently using native NFS and we are going to move them over the CES. I'm looking at the mmnfs command and trying to map the nfs export arguments with the CES arguments. My customer has these currently: no_wdelay, nohide, rw, sync, no_root_squash, no_all_squash I have this so far: mmnfs export add /gpfs/ltfsee/ --client XX.XX.XX.XX ( Access_Type=RW, Squash=no_root_squash,noidsquash, NFS_COMMIT=true ) So the only arguments that don't appear accounted for is the 'nohide' parameter. Does this look right? SHAUN ANDERSON STORAGE ARCHITECT O 208.577.2112 M 214.263.7014 NOTICE: This email message and any attachments here to may contain confidential information. Any unauthorized review, use, disclosure, or distribution of such information is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy the original message and all copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 24 Feb 2017 14:31:08 -0500 From: Aaron Knister To: Subject: Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: Content-Type: text/plain; charset="windows-1252"; format=flowed Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ------------------------------ Message: 3 Date: Fri, 24 Feb 2017 19:39:30 +0000 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: Content-Type: text/plain; charset="utf-8" its more likely you run out of verbsRdmasPerNode which is the top limit across all connections for a given node. Sven On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister wrote: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 61, Issue 46 ********************************************** ________________________________ UT Southwestern Medical Center The future of medicine, today. From service at metamodul.com Mon Feb 27 10:22:48 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 27 Feb 2017 11:22:48 +0100 (CET) Subject: [gpfsug-discuss] Q: backup with dsmc & .snapshots directory Message-ID: <459383319.282012.1488190969081@email.1und1.de> An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Feb 27 11:13:59 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 27 Feb 2017 11:13:59 +0000 Subject: [gpfsug-discuss] Q: backup with dsmc & .snapshots directory In-Reply-To: <459383319.282012.1488190969081@email.1und1.de> Message-ID: I usually exclude them. Otherwise you will end up with lots of data on the TSM backend. -- Cheers > On 27 Feb 2017, at 12.23, Hans-Joachim Ehlers wrote: > > Hi, > > short question: if we are using the native TSM dsmc Client, should we exclude the "./.snapshots/." directory from the backup or is it best practise to backup the .snapshots as well. > > Note: We DO NOT use a dedicated .snapshots directory for backups right now. The snapshots directory is created by a policy which is not adapted for TSM so the snapshot creation and deletion is not synchronized with TSM. In the near future we might use dedicated .snapshots for the backup. > > tia > > Hajo > > - > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 > Mail: service at metamodul.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 11:30:15 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 11:30:15 +0000 Subject: [gpfsug-discuss] Tracking deleted files Message-ID: Hi, Is there a way to track files which have been deleted easily? I'm assuming that we can't easily use a policy scan as they files are no longer in the file-system unless we do some sort of diff? I'm assuming there must be a way of doing this as mmbackup must track deleted files to notify TSM of expired objects. Basically I want a list of new files, changed files and deleted files since a certain time. I'm assuming the first two will be relatively simple with a policyscan, but the latter I'm not sure about. Thanks Simon From jtucker at pixitmedia.com Mon Feb 27 11:59:44 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 27 Feb 2017 11:59:44 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Hi Simon I presented exactly this (albeit briefly) at the 2016 UG. See the snapdiff section of the presentation at: http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf We can track creations, modifications, deletions and moves (from, to) for files and directories between one point in time and another. The selections can be returned via a manner of your choice. If anyone wants to know more, hit me up directly. Incidentally - I will be at BVE this week (http://www.bvexpo.com/) showing new things driven by the Python API and GPFS - so if anyone is in the area and wants to chat about technicals in person rather than on mail, drop me a line and we can sort that out. Best, Jez On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT Services) wrote: > Hi, > > Is there a way to track files which have been deleted easily? I'm assuming > that we can't easily use a policy scan as they files are no longer in the > file-system unless we do some sort of diff? > > I'm assuming there must be a way of doing this as mmbackup must track > deleted files to notify TSM of expired objects. > > Basically I want a list of new files, changed files and deleted files > since a certain time. I'm assuming the first two will be relatively simple > with a policyscan, but the latter I'm not sure about. > > Thanks > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Mon Feb 27 12:00:54 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 27 Feb 2017 13:00:54 +0100 (CET) Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: <783766399.287097.1488196854922@email.1und1.de> An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 12:39:02 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 12:39:02 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Yeah but that uses snapshots, which is pretty heavy-weight for what I want to do, particularly given mmbackup seems to have a way of tracking deletes... Simon From: > on behalf of Jez Tucker > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Monday, 27 February 2017 at 11:59 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files Hi Simon I presented exactly this (albeit briefly) at the 2016 UG. See the snapdiff section of the presentation at: http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf We can track creations, modifications, deletions and moves (from, to) for files and directories between one point in time and another. The selections can be returned via a manner of your choice. If anyone wants to know more, hit me up directly. Incidentally - I will be at BVE this week (http://www.bvexpo.com/) showing new things driven by the Python API and GPFS - so if anyone is in the area and wants to chat about technicals in person rather than on mail, drop me a line and we can sort that out. Best, Jez On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT Services) > wrote: Hi, Is there a way to track files which have been deleted easily? I'm assuming that we can't easily use a policy scan as they files are no longer in the file-system unless we do some sort of diff? I'm assuming there must be a way of doing this as mmbackup must track deleted files to notify TSM of expired objects. Basically I want a list of new files, changed files and deleted files since a certain time. I'm assuming the first two will be relatively simple with a policyscan, but the latter I'm not sure about. Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- [http://www.pixitmedia.com/sig/pxone_pt1.png][http://www.pixitmedia.com/sig/pxone_pt2.png][http://www.pixitmedia.com/sig/pxone_pt3.png][http://www.pixitmedia.com/sig/pxone_pt4.png] [http://pixitmedia.com/sig/BVE-Banner4.png] This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Mon Feb 27 13:11:59 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 27 Feb 2017 13:11:59 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Hi Whilst it does use snapshots, I'd argue that snapshot creation is pretty lightweight - and always consistent. Your alternative via the mmbackup 'tracking' route is to parse out the mmbackup shadow file. AFAIK to do this /properly in a timely fashion/ you'd need to do this as an inline post process after the scan phase of mmbackup has run, else you're instead looking at the outdated view of the shadow file post previous mmbackup run. mmbackup does not 'track' file changes, it performs a comparison pass between the filesystem contents and what TSM _believes_ is the known state of the file system during each run. If a change is made oob of TSM then you need to re-generate the show file to regain total consistency. Sensibly you should be running any mmbackup process from a snapshot to perform consistent backups without dsmc errors. So all things being equal, using snapshots for exact consistency and not having to regenerate (very heavyweight) or parse out a shadow file periodically is a lighter weight, smoother and reliably consistent workflow. YMMV with either approach depending on your management of TSM and your interpretation of 'consistent view' vs 'good enough'. Jez On Mon, 27 Feb 2017 at 12:39, Simon Thompson (Research Computing - IT Services) wrote: > Yeah but that uses snapshots, which is pretty heavy-weight for what I want > to do, particularly given mmbackup seems to have a way of tracking > deletes... > > Simon > > From: on behalf of Jez Tucker < > jtucker at pixitmedia.com> > Reply-To: "gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > Date: Monday, 27 February 2017 at 11:59 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files > > Hi Simon > > I presented exactly this (albeit briefly) at the 2016 UG. > > See the snapdiff section of the presentation at: > > > http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf > > We can track creations, modifications, deletions and moves (from, to) for > files and directories between one point in time and another. > > The selections can be returned via a manner of your choice. > > If anyone wants to know more, hit me up directly. > > Incidentally - I will be at BVE this week (http://www.bvexpo.com/) > showing new things driven by the Python API and GPFS - so if anyone is in > the area and wants to chat about technicals in person rather than on mail, > drop me a line and we can sort that out. > > Best, > > Jez > > > On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT > Services) wrote: > > Hi, > > Is there a way to track files which have been deleted easily? I'm assuming > that we can't easily use a policy scan as they files are no longer in the > file-system unless we do some sort of diff? > > I'm assuming there must be a way of doing this as mmbackup must track > deleted files to notify TSM of expired objects. > > Basically I want a list of new files, changed files and deleted files > since a certain time. I'm assuming the first two will be relatively simple > with a policyscan, but the latter I'm not sure about. > > Thanks > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other person. > Please notify the sender immediately and delete this email from your > computer system. Any opinions expressed are not necessarily those of the > company from which this email was sent and, whilst to the best of our > knowledge no viruses or defects exist, no responsibility can be accepted > for any loss or damage arising from its receipt or subsequent use of this > email. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Feb 27 13:25:21 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 27 Feb 2017 13:25:21 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: <1488201921.4074.114.camel@buzzard.me.uk> On Mon, 2017-02-27 at 12:39 +0000, Simon Thompson (Research Computing - IT Services) wrote: > Yeah but that uses snapshots, which is pretty heavy-weight for what I > want to do, particularly given mmbackup seems to have a way of > tracking deletes... > It has been discussed in the past, but the way to track stuff is to enable HSM and then hook into the DSMAPI. That way you can see all the file creates and deletes "live". I can't however find a reference to it now. I have a feeling it was in the IBM GPFS forum however. It would however require you to get your hands dirty writing code. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From luis.bolinches at fi.ibm.com Mon Feb 27 13:25:15 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 27 Feb 2017 13:25:15 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 13:32:42 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 13:32:42 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <1488201921.4074.114.camel@buzzard.me.uk> References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: >It has been discussed in the past, but the way to track stuff is to >enable HSM and then hook into the DSMAPI. That way you can see all the >file creates and deletes "live". Won't work, I already have a "real" HSM client attached to DMAPI (dsmrecalld). I'm not actually wanting to backup for this use case, we already have mmbackup running to do those things, but it was a list of deleted files that I was after (I just thought it might be easy given mmbackup is tracking it already). Simon From oehmes at gmail.com Mon Feb 27 13:37:46 2017 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 27 Feb 2017 13:37:46 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <1488201921.4074.114.camel@buzzard.me.uk> References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: a couple of years ago tridge demonstrated things you can do with DMAPI interface and even delivered some non supported example code to demonstrate it : https://www.samba.org/~tridge/hacksm/ keep in mind that the DMAPI interface has some severe limitations in terms of scaling, it can only run on one node and can have only one subscriber. we are working on a more scalable and supported solution to accomplish what is asks for (track operations, not just delete) , stay tuned in one of the next user group meetings where i will present (Germany and/or London). Sven On Mon, Feb 27, 2017 at 5:25 AM Jonathan Buzzard wrote: > On Mon, 2017-02-27 at 12:39 +0000, Simon Thompson (Research Computing - > IT Services) wrote: > > Yeah but that uses snapshots, which is pretty heavy-weight for what I > > want to do, particularly given mmbackup seems to have a way of > > tracking deletes... > > > > It has been discussed in the past, but the way to track stuff is to > enable HSM and then hook into the DSMAPI. That way you can see all the > file creates and deletes "live". > > I can't however find a reference to it now. I have a feeling it was in > the IBM GPFS forum however. > > It would however require you to get your hands dirty writing code. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 13:41:47 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 13:41:47 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: Manchester ... The UK meeting is most likely going to be in Manchester ... 9th/10th May if you wanted to pencil something in (we're just waiting for final confirmation of the venue being booked). Simon From: > on behalf of Sven Oehme > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Monday, 27 February 2017 at 13:37 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files we are working on a more scalable and supported solution to accomplish what is asks for (track operations, not just delete) , stay tuned in one of the next user group meetings where i will present (Germany and/or London). -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Mon Feb 27 13:55:26 2017 From: stef.coene at docum.org (Stef Coene) Date: Mon, 27 Feb 2017 14:55:26 +0100 Subject: [gpfsug-discuss] Policy question Message-ID: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> Hi, I have a file system with 2 pools: V500001 and NAS01. I want to use pool V500001 as the default and migrate the oldest files to the pool NAS01 when the pool V500001 fills up. Whatever rule combination I tried, I can not get this working. This is the currently defined policy (created by the GUI): RULE 'Migration' MIGRATE FROM POOL 'V500001' THRESHOLD(95,85) WEIGHT(100000 - DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) TO POOL 'NAS01' RULE 'Default to V5000' SET POOL 'V500001' And also, how can I monitor the migration processes? Stef From makaplan at us.ibm.com Mon Feb 27 16:00:24 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 27 Feb 2017 11:00:24 -0500 Subject: [gpfsug-discuss] Policy questions In-Reply-To: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> References: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> Message-ID: I think you have the sign wrong on your weight. A simple way of ordering the files oldest first is WEIGHT(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) adding 100,000 does nothing to change the order. WEIGHT can be any numeric SQL expression. So come to think of it WEIGHT( - DAYS(ACCESS_TIME) ) is even simpler and will yield the same ordering Also, you must run or schedule the mmapplypolicy command to run to actually do the migration. It doesn't happen until the mmapplypolicy command is running. You can run mmapplypolicy periodically (e.g. with crontab) or on demand with mmaddcallback (GPFS events facility) This is all covered in the very fine official Spectrum Scale documentation and/or some of the supplemental IBM red books, all available for free downloads from ibm.com --marc of GPFS From: Stef Coene To: gpfsug main discussion list Date: 02/27/2017 08:55 AM Subject: [gpfsug-discuss] Policy question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, I have a file system with 2 pools: V500001 and NAS01. I want to use pool V500001 as the default and migrate the oldest files to the pool NAS01 when the pool V500001 fills up. Whatever rule combination I tried, I can not get this working. This is the currently defined policy (created by the GUI): RULE 'Migration' MIGRATE FROM POOL 'V500001' THRESHOLD(95,85) WEIGHT(100000 - DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) TO POOL 'NAS01' RULE 'Default to V5000' SET POOL 'V500001' And also, how can I monitor the migration processes? Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 27 19:40:57 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 19:40:57 +0000 Subject: [gpfsug-discuss] SMB and AD authentication Message-ID: For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From YARD at il.ibm.com Mon Feb 27 19:46:07 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 27 Feb 2017 21:46:07 +0200 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8745 bytes Desc: not available URL: From laurence at qsplace.co.uk Mon Feb 27 19:46:59 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 27 Feb 2017 19:46:59 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Do you have UID/GID for the user in your AD schema? or the rfc 2307 extended schema? AFAIK it uses winbinds IDMAP so requires rfc 2307 attributes rather than using the windows SID and working the UID/GID using autorid etc. -- Lauz On 27 February 2017 19:40:57 GMT+00:00, "Mark.Bush at siriuscom.com" wrote: >For some reason, I just can?t seem to get this to work. I have >configured my protocol nodes to authenticate to AD using the following > >mmuserauth service create --type ad --data-access-method file --servers >192.168.88.3 --user-name administrator --netbios-name scale >--idmap-role master --password ********* --idmap-range-size 1000000 >--idmap-range 10000000-299999999 --enable-nfs-kerberos >--unixmap-domains 'sirius(10000-20000)' > > >All goes well, I see the nodes in AD and all of the wbinfo commands >show good (id Sirius\\administrator doesn?t work though), but when I >try to mount an SMB share (after doing all the necessary mmsmb export >stuff) I get permission denied. I?m curious if I missed a step >(followed the docs pretty much to the letter). I?m trying >Administrator, mark.bush, and a dummy aduser I created. None seem to >gain access to the share. > >Protocol gurus help! Any ideas are appreciated. > > >[id:image001.png at 01D2709D.6EF65720] >Mark R. Bush| Storage Architect >Mobile: 210-237-8415 >Twitter: @bushmr | LinkedIn: >/markreedbush >10100 Reunion Place, Suite 500, San Antonio, TX 78216 >www.siriuscom.com >|mark.bush at siriuscom.com > > >This message (including any attachments) is intended only for the use >of the individual or entity to which it is addressed and may contain >information that is non-public, proprietary, privileged, confidential, >and exempt from disclosure under applicable law. If you are not the >intended recipient, you are hereby notified that any use, >dissemination, distribution, or copying of this communication is >strictly prohibited. This message may be viewed by parties at Sirius >Computer Solutions other than those named in the message header. This >message does not contain an official representation of Sirius Computer >Solutions. If you have received this communication in error, notify >Sirius Computer Solutions immediately and (i) destroy this message if a >facsimile or (ii) delete this message immediately if this is an >electronic communication. Thank you. > >Sirius Computer Solutions -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 27 19:50:17 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 19:50:17 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: [root at n1 ~]# mmsmb export list share2 export path browseable guest ok smb encrypt share2 /gpfs/fs1/sales yes no auto [root at n1 ~]# ls -l /gpfs/fs1 total 0 drwxrwxrwx 2 root root 4096 Feb 25 12:33 sales From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 27, 2017 at 1:46 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] SMB and AD authentication Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D29100.6E55CCF0] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. [cid:image002.png at 01D29100.6E55CCF0] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 8746 bytes Desc: image002.png URL: From christof.schmitt at us.ibm.com Mon Feb 27 19:59:46 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 27 Feb 2017 12:59:46 -0700 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: --unixmap-domains 'sirius(10000-20000)' specifies that for the domain SIRIUS, all uid and gids are stored as rfc2307 attributes in the user and group objects in AD. If "id Sirius\\administrator" does not work, that might already point to missing data in AD. The requirement is that the user has a uidNumber defined, and the user's primary group in AD has to have a gidNumber defined. Note that a gidNumber defined for the user is not read by Spectrum Scale at this point. All uidNumber and gidNumber attributes have to fall in the defined range (10000-20000). If verifying the above points does not help, then a winbindd trace might help to point to the missing step: /usr/lpp/mmfs/bin/smbcontrol winbindd debug 10 id Sirius\\administrator /usr/lpp/mmfs/bin/smbcontrol winbindd debug 1 /var/adm/ras/log.winbindd-idmap is the log file for the idmap queries; it might show a failing ldap query in this case. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 12:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From YARD at il.ibm.com Mon Feb 27 20:04:09 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 27 Feb 2017 22:04:09 +0200 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Hi What does the command return when you run it on the protocols nodes: #id 'DOM\user' Please follow this steps: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/ibmspectrumscale42_content.html SA23-1452-06 05/2016 IBM Spectrum Scale V4.2: Administration and Programming Reference Page - 135 Creating SMB share Use the following information to create an SMB share: 1. Create the directory to be exported through SMB: mmcrfileset fs01 fileset --inode-space=new mmlinkfileset fs01 fileset -J /gpfs/fs01/fileset mkdir /gpfs/fs01/fileset/smb Note: IBM recommends an independent fileset for SMB shares. Create a new independent fileset with these commands: mmcrfileset fs01 fileset --inode-space=new mmlinkfileset fs01 fileset -J /gpfs/fs01/fileset If the directory to be exported does not exist, create the directory first by running the following command: mkdir /gpfs/fs01/fileset/smb" 2. The recommended approach for managing access to the SMB share is to manage the ACLs from a Windows client machine. To change the ACLs from a Windows client, change the owner of the share folder to a user ID that will be used to make the ACL changes by running the following command: chown ?DOMAIN\smbadmin? /gpfs/fs01/fileset/smb 3. Create the actual SMB share on the existing directory: mmsmb export add smbexport /gpfs/fs01/fileset/smb Additional options can be set during share creation. For the documentation of all supported options, see ?mmsmb command? on page 663. 4. Verify that the share has been created: mmsmb export list 5. Access the share from a Windows client using the user ID that has been previously made the owner of the folder. 6. Right-click the folder in the Windows Explorer, open the Security tab, click Advanced, and modify the Access Control List as required. Note: An SMB share can only be created when the ACL setting of the underlying file system is -k nfsv4. In all other cases, mmsmb export create will fail with an error. See ?Authorizing protocol users? on page 200 for details and limitations Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:50 PM Subject: Re: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org [root at n1 ~]# mmsmb export list share2 export path browseable guest ok smb encrypt share2 /gpfs/fs1/sales yes no auto [root at n1 ~]# ls -l /gpfs/fs1 total 0 drwxrwxrwx 2 root root 4096 Feb 25 12:33 sales From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 27, 2017 at 1:46 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] SMB and AD authentication Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8746 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Mon Feb 27 20:12:23 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 20:12:23 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: That was it. I just didn?t have the ScaleUsers group (special AD group I created) set as AD user Sirius\mark.bush?s primary group. Once I did that bam?shares show up and I can view and id works too. Thanks Christof. On 2/27/17, 1:59 PM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Christof Schmitt" wrote: --unixmap-domains 'sirius(10000-20000)' specifies that for the domain SIRIUS, all uid and gids are stored as rfc2307 attributes in the user and group objects in AD. If "id Sirius\\administrator" does not work, that might already point to missing data in AD. The requirement is that the user has a uidNumber defined, and the user's primary group in AD has to have a gidNumber defined. Note that a gidNumber defined for the user is not read by Spectrum Scale at this point. All uidNumber and gidNumber attributes have to fall in the defined range (10000-20000). If verifying the above points does not help, then a winbindd trace might help to point to the missing step: /usr/lpp/mmfs/bin/smbcontrol winbindd debug 10 id Sirius\\administrator /usr/lpp/mmfs/bin/smbcontrol winbindd debug 1 /var/adm/ras/log.winbindd-idmap is the log file for the idmap queries; it might show a failing ldap query in this case. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 12:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions From ewahl at osc.edu Mon Feb 27 20:50:49 2017 From: ewahl at osc.edu (Edward Wahl) Date: Mon, 27 Feb 2017 15:50:49 -0500 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: <20170227155049.22001bb0@osc.edu> I can think of a couple of ways to do this. But using snapshots seems heavy, but so does using mmbackup unless you are already running it every day. Diff the shadow files? Haha could be a _terrible_ idea if you have a couple hundred million files. But it IS possible. Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably stayed at a Holiday Inn express at least once in my heavy travel days) -query objects using '-ina=yes' and yesterdays date? Might be a touch slow. But it probably uses the next one as it's backend: -db2 query inside TSM to see a similar thing. This ought to be the fastest, and I'm sure with a little google'ing you can work this out. Tivoli MUST know exact dates of deletion as it uses that and the retention time to know when to purge/reclaim deleted objects from it's storage pools. (retain extra version or RETEXTRA or retain only version) Ed On Mon, 27 Feb 2017 13:32:42 +0000 "Simon Thompson (Research Computing - IT Services)" wrote: > >It has been discussed in the past, but the way to track stuff is to > >enable HSM and then hook into the DSMAPI. That way you can see all the > >file creates and deletes "live". > > Won't work, I already have a "real" HSM client attached to DMAPI > (dsmrecalld). > > I'm not actually wanting to backup for this use case, we already have > mmbackup running to do those things, but it was a list of deleted files > that I was after (I just thought it might be easy given mmbackup is > tracking it already). > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From makaplan at us.ibm.com Mon Feb 27 21:23:52 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 27 Feb 2017 16:23:52 -0500 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <20170227155049.22001bb0@osc.edu> References: <1488201921.4074.114.camel@buzzard.me.uk> <20170227155049.22001bb0@osc.edu> Message-ID: Diffing file lists can be fast - IF you keep the file lists sorted by a unique key, e.g. the inode number. I believe that's how mmbackup does it. Use the classic set difference algorithm. Standard diff is designed to do something else and is terribly slow on large file lists. From: Edward Wahl To: "Simon Thompson (Research Computing - IT Services)" Cc: gpfsug main discussion list Date: 02/27/2017 03:51 PM Subject: Re: [gpfsug-discuss] Tracking deleted files Sent by: gpfsug-discuss-bounces at spectrumscale.org I can think of a couple of ways to do this. But using snapshots seems heavy, but so does using mmbackup unless you are already running it every day. Diff the shadow files? Haha could be a _terrible_ idea if you have a couple hundred million files. But it IS possible. Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably stayed at a Holiday Inn express at least once in my heavy travel days) -query objects using '-ina=yes' and yesterdays date? Might be a touch slow. But it probably uses the next one as it's backend: -db2 query inside TSM to see a similar thing. This ought to be the fastest, and I'm sure with a little google'ing you can work this out. Tivoli MUST know exact dates of deletion as it uses that and the retention time to know when to purge/reclaim deleted objects from it's storage pools. (retain extra version or RETEXTRA or retain only version) Ed On Mon, 27 Feb 2017 13:32:42 +0000 "Simon Thompson (Research Computing - IT Services)" wrote: > >It has been discussed in the past, but the way to track stuff is to > >enable HSM and then hook into the DSMAPI. That way you can see all the > >file creates and deletes "live". > > Won't work, I already have a "real" HSM client attached to DMAPI > (dsmrecalld). > > I'm not actually wanting to backup for this use case, we already have > mmbackup running to do those things, but it was a list of deleted files > that I was after (I just thought it might be easy given mmbackup is > tracking it already). > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Feb 27 22:13:46 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 27 Feb 2017 23:13:46 +0100 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> <20170227155049.22001bb0@osc.edu> Message-ID: AFM apparently keeps track og this, so maybe it would be possible to run AFM-SW with disconnected home and query the queue of changes? But would require some way of clearing the queue as well.. -jf On Monday, February 27, 2017, Marc A Kaplan wrote: > Diffing file lists can be fast - IF you keep the file lists sorted by a > unique key, e.g. the inode number. > I believe that's how mmbackup does it. Use the classic set difference > algorithm. > > Standard diff is designed to do something else and is terribly slow on > large file lists. > > > > From: Edward Wahl > > To: "Simon Thompson (Research Computing - IT Services)" < > S.J.Thompson at bham.ac.uk > > > Cc: gpfsug main discussion list > > Date: 02/27/2017 03:51 PM > Subject: Re: [gpfsug-discuss] Tracking deleted files > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------ > > > > I can think of a couple of ways to do this. But using snapshots seems > heavy, > but so does using mmbackup unless you are already running it every day. > > Diff the shadow files? Haha could be a _terrible_ idea if you have a > couple > hundred million files. But it IS possible. > > > Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably > stayed > at a Holiday Inn express at least once in my heavy travel days) > > -query objects using '-ina=yes' and yesterdays date? Might be a touch > slow. But > it probably uses the next one as it's backend: > > -db2 query inside TSM to see a similar thing. This ought to be the > fastest, > and I'm sure with a little google'ing you can work this out. Tivoli MUST > know > exact dates of deletion as it uses that and the retention time to know > when to purge/reclaim deleted objects from it's storage pools. > (retain extra version or RETEXTRA or retain only version) > > Ed > > On Mon, 27 Feb 2017 13:32:42 +0000 > "Simon Thompson (Research Computing - IT Services)" < > S.J.Thompson at bham.ac.uk > > > wrote: > > > >It has been discussed in the past, but the way to track stuff is to > > >enable HSM and then hook into the DSMAPI. That way you can see all the > > >file creates and deletes "live". > > > > Won't work, I already have a "real" HSM client attached to DMAPI > > (dsmrecalld). > > > > I'm not actually wanting to backup for this use case, we already have > > mmbackup running to do those things, but it was a list of deleted files > > that I was after (I just thought it might be easy given mmbackup is > > tracking it already). > > > > Simon > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Tue Feb 28 08:44:26 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Tue, 28 Feb 2017 09:44:26 +0100 (CET) Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Message-ID: <1031275380.310791.1488271466638@email.1und1.de> An HTML attachment was scrubbed... URL: From ashish.thandavan at cs.ox.ac.uk Tue Feb 28 16:10:44 2017 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Tue, 28 Feb 2017 16:10:44 +0000 Subject: [gpfsug-discuss] mmbackup logging issue Message-ID: Dear all, We have a small GPFS cluster and a separate server running TSM and one of the three NSD servers backs up our GPFS filesystem to the TSM server using mmbackup. After a recent upgrade from v3.5 to 4.1.1, we've noticed that mmbackup no longer logs stuff like it used to : ... Thu Jan 19 05:45:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 2 failed. Thu Jan 19 06:15:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 3 failed. Thu Jan 19 06:45:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 3 failed. ... instead of ... Sat Dec 3 12:01:00 2016 mmbackup:Backing up files: 105030 backed up, 635456 expired, 30 failed. Sat Dec 3 12:31:00 2016 mmbackup:Backing up files: 205934 backed up, 635456 expired, 57 failed. Sat Dec 3 13:01:00 2016 mmbackup:Backing up files: 321702 backed up, 635456 expired, 169 failed. ... like it used to pre-upgrade. I am therefore unable to see how far long it has got, and indeed if it completed successfully, as this is what it logs at the end of a job : ... Tue Jan 17 18:07:31 2017 mmbackup:Completed policy backup run with 0 policy errors, 10012 files failed, 0 severe errors, returning rc=9. Tue Jan 17 18:07:31 2017 mmbackup:Policy for backup returned 9 Highest TSM error 12 mmbackup: TSM Summary Information: Total number of objects inspected: 20617273 Total number of objects backed up: 0 Total number of objects updated: 0 Total number of objects rebound: 0 Total number of objects deleted: 0 Total number of objects expired: 1 Total number of objects failed: 10012 Total number of objects encrypted: 0 Total number of bytes inspected: 3821624716861 Total number of bytes transferred: 3712040943672 Tue Jan 17 18:07:31 2017 mmbackup:Audit files /cs/mmbackup.audit.gpfs* contain 0 failed paths but there were 10012 failures. Cannot reconcile shadow database. Unable to compensate for all TSM errors in new shadow database. Preserving previous shadow database. Run next mmbackup with -q to synchronize shadow database. exit 12 If it helps, the mmbackup job is kicked off with the following options : /usr/lpp/mmfs/bin/mmbackup gpfs -n 8 -t full -B 20000 -L 1 --tsm-servers gpfs_weekly_stanza -N glossop1a | /usr/bin/tee /var/log/mmbackup/gpfs_weekly/backup_log.`date +%Y%m%d_%H_%M` (The excerpts above are from the backup_log. file.) Our NSD servers are running GPFS 4.1.1-11, TSM is at 7.1.1.100 and the File system version is 12.06 (3.4.0.3). Has anyone else seen this behaviour with mmbackup and if so, found a fix? Thanks, Regards, Ash -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk From TOMP at il.ibm.com Tue Feb 28 17:08:29 2017 From: TOMP at il.ibm.com (Tomer Perry) Date: Tue, 28 Feb 2017 19:08:29 +0200 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster In-Reply-To: <1031275380.310791.1488271466638@email.1und1.de> References: <1031275380.310791.1488271466638@email.1und1.de> Message-ID: Hans-Joachim, Since I'm the one that gave this answer...I'll work on adding it to the FAQ. But, in general: 1. The maximum number of "outbound clusters" - meaning "how many clusters can a client join - is limited to 31 ( 32 including the local cluster) 2. The maximum number or "inbound cluster" - meaning "how many clusters can join my cluster) - is not really limited. Thus, since the smallest cluster possible is a single node cluster, it means that 16383 nodes can join my cluster ( 16384 - 1). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Hans-Joachim Ehlers To: gpfsug main discussion list Date: 28/02/2017 10:44 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Sent by: gpfsug-discuss-bounces at spectrumscale.org First thx to all for the support on this list. It is highly appreciated. My new question: i have currently with IBM a discussion about the maximum number of remote clusters mounting GPFS from a local cluster. The answer was that there is almost no limit to the amount of REMOTE clusters accessing a given cluster. From memory I thought there was a limit of 24 remote clusters and the total amount of node must not exceed 16k nodes. The later is described in the GPFS FAQ but about the maximum number of remote cluster accessing a local cluster I could not find anything within the FAQ. So is there a limit of remote clusters accessing a given GPFS cluster or could I really have almost 16k-n(*) remote clusters ( One node cluster ) as long as the max amount of nodes does not exceed the 16K ? (*) n is the amount of local nodes. Maybe this info should be added also to the FAQ ? Info from the FAQ: https://www.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf Q5.4: What is the current limit on the number of nodes that may concurrently join a cluster? A5.4: As of GPFS V3.4.0.18 and GPFS V3.5.0.5, the total number of nodes that may concurrently join a cluster is limited to a maximum of 16384 nodes. tia Hajo -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Tue Feb 28 17:45:57 2017 From: service at metamodul.com (service at metamodul.com) Date: Tue, 28 Feb 2017 18:45:57 +0100 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Message-ID: Thx a lot Perry I never thought about outbound or inbound cluster access. Wish you all the best Hajo --? Unix Systems Engineer MetaModul GmbH +49 177 4393994 -------------- next part -------------- An HTML attachment was scrubbed... URL: From MDIETZ at de.ibm.com Wed Feb 1 09:04:14 2017 From: MDIETZ at de.ibm.com (Mathias Dietz) Date: Wed, 1 Feb 2017 10:04:14 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Wed Feb 1 09:28:25 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Wed, 1 Feb 2017 09:28:25 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: Message-ID: Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Mathias Dietz" An:"gpfsug main discussion list" Datum:Mi. 01.02.2017 10:05Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von:"Simon Thompson (Research Computing - IT Services)" An:"gpfsug main discussion list" Datum:Di. 31.01.2017 21:07Betreff:Re: [gpfsug-discuss] CES doesn't assign addresses to nodes We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From dave.bond at diamond.ac.uk Thu Feb 2 10:08:06 2017 From: dave.bond at diamond.ac.uk (dave.bond at diamond.ac.uk) Date: Thu, 2 Feb 2017 10:08:06 +0000 Subject: [gpfsug-discuss] GPFS meta data performance monitoring Message-ID: Hello Mailing list, Beyond mmpmon how are people monitoring their metadata performance? There are two parts I imagine to this question, the first being how do you get a detailed snapshot view of performance read and write etc. Then the second is does anyone collate this information for historical graphing, if so thoughts and ideas are very welcome. mmpmon is certainly useful but I would like to dig a little deeper, ideally without turning anything on that could impact stability or performance of a production file system. Dave (Diamond Light Source) -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom From olaf.weiser at de.ibm.com Thu Feb 2 15:55:44 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 16:55:44 +0100 Subject: [gpfsug-discuss] GPFS meta data performance monitoring In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Thu Feb 2 17:03:51 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Thu, 2 Feb 2017 12:03:51 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears Message-ID: Is there a way to accomplish this so the rest of cluster knows its down? My state now: [root at cl001 ~]# mmgetstate -aL cl004.cl.arc.internal: mmremote: determineMode: Missing file /var/mmfs/gen/mmsdrfs. cl004.cl.arc.internal: mmremote: This node does not belong to a GPFS cluster. mmdsh: cl004.cl.arc.internal remote shell process had return code 1. Node number Node name Quorum Nodes up Total nodes GPFS state Remarks ------------------------------------------------------------------------------------ 1 cl001 5 7 8 active quorum node 2 cl002 5 7 8 active quorum node 3 cl003 5 7 8 active quorum node 4 cl004 0 0 8 unknown quorum node 5 cl005 5 7 8 active quorum node 6 cl006 5 7 8 active quorum node 7 cl007 5 7 8 active quorum node 8 cl008 5 7 8 active quorum node cl004 we think has an internal raid controller blowout -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 2 17:28:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 18:28:22 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From jonathon.anderson at colorado.edu Thu Feb 2 17:44:45 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Thu, 2 Feb 2017 17:44:45 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: Message-ID: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Thu Feb 2 18:02:22 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 19:02:22 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Message-ID: An HTML attachment was scrubbed... URL: From valdis.kletnieks at vt.edu Thu Feb 2 19:28:05 2017 From: valdis.kletnieks at vt.edu (valdis.kletnieks at vt.edu) Date: Thu, 02 Feb 2017 14:28:05 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: Message-ID: <15501.1486063685@turing-police.cc.vt.edu> On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you > see a message like this.. > have you reinstalled that node / any backup/restore thing ? The internal RAID controller died a horrid death and basically took all the OS partitions with it. So the node was just sort of limping along, where the mmfsd process was still coping because it wasn't doing any I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work because that requires accessing stuff in /var. At that point, it starts getting tempting to just use ipmitool from another node to power the comatose one down - but that often causes a cascade of other issues while things are stuck waiting for timeouts. From aaron.s.knister at nasa.gov Thu Feb 2 19:33:41 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 2 Feb 2017 14:33:41 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: <15501.1486063685@turing-police.cc.vt.edu> References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: You could forcibly expel the node (one of my favorite GPFS commands): mmexpelnode -N $nodename and then power it off after the expulsion is complete and then do mmepelenode -r -N $nodename which will allow it to join the cluster next time you try and start up GPFS on it. You'll still likely have to go through recovery but you'll skip the part where GPFS wonders where the node went prior to it expelling it. -Aaron On 2/2/17 2:28 PM, valdis.kletnieks at vt.edu wrote: > On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > >> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's why you >> see a message like this.. >> have you reinstalled that node / any backup/restore thing ? > > The internal RAID controller died a horrid death and basically took > all the OS partitions with it. So the node was just sort of limping along, > where the mmfsd process was still coping because it wasn't doing any > I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work > because that requires accessing stuff in /var. > > At that point, it starts getting tempting to just use ipmitool from > another node to power the comatose one down - but that often causes > a cascade of other issues while things are stuck waiting for timeouts. > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From olaf.weiser at de.ibm.com Thu Feb 2 21:28:01 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 2 Feb 2017 22:28:01 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Fri Feb 3 12:46:30 2017 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 3 Feb 2017 12:46:30 +0000 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES Message-ID: I'm having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 The NFS clients are up to date Centos and Debian machines. All Scale servers and NFS clients have correct date and time via NTP. Creating a file, for instance 'touch file00', gives correct timestamp. Moving the file, 'mv file00 file01', gives correct timestamp Copying the file, 'cp file01 file02', gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. Have anyone seen this before? Regards, Andreas Mattsson _____________________________________________ [cid:part1.08040705.03090509 at maxiv.lu.se] Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 225 94 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.se www.maxiv.se -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 5610 bytes Desc: image001.png URL: From ulmer at ulmer.org Fri Feb 3 13:05:37 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 3 Feb 2017 08:05:37 -0500 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: That?s a cool one. :) What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? -- Stephen > On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: > > I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 > The NFS clients are up to date Centos and Debian machines. > All Scale servers and NFS clients have correct date and time via NTP. > > Creating a file, for instance ?touch file00?, gives correct timestamp. > Moving the file, ?mv file00 file01?, gives correct timestamp > Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. > > This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. > Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. > > Have anyone seen this before? > > Regards, > Andreas Mattsson > _____________________________________________ > > > Andreas Mattsson > Systems Engineer > > MAX IV Laboratory > Lund University > P.O. Box 118, SE-221 00 Lund, Sweden > Visiting address: Fotongatan 2, 225 94 Lund > Mobile: +46 706 64 95 44 > andreas.mattsson at maxiv.se > www.maxiv.se > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From andreas.mattsson at maxiv.lu.se Fri Feb 3 13:19:37 2017 From: andreas.mattsson at maxiv.lu.se (Andreas Mattsson) Date: Fri, 3 Feb 2017 13:19:37 +0000 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: That works. ?touch test100? Feb 3 14:16 test100 ?cp test100 test101? Feb 3 14:16 test100 Apr 21 2027 test101 ?touch ?r test100 test101? Feb 3 14:16 test100 Feb 3 14:16 test101 /Andreas That?s a cool one. :) What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? -- Stephen On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 The NFS clients are up to date Centos and Debian machines. All Scale servers and NFS clients have correct date and time via NTP. Creating a file, for instance ?touch file00?, gives correct timestamp. Moving the file, ?mv file00 file01?, gives correct timestamp Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. Have anyone seen this before? Regards, Andreas Mattsson _____________________________________________ Andreas Mattsson Systems Engineer MAX IV Laboratory Lund University P.O. Box 118, SE-221 00 Lund, Sweden Visiting address: Fotongatan 2, 225 94 Lund Mobile: +46 706 64 95 44 andreas.mattsson at maxiv.se www.maxiv.se _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From ulmer at ulmer.org Fri Feb 3 13:35:21 2017 From: ulmer at ulmer.org (Stephen Ulmer) Date: Fri, 3 Feb 2017 08:35:21 -0500 Subject: [gpfsug-discuss] Strange timestamp behaviour on NFS via CES In-Reply-To: References: Message-ID: Does the cp actually complete? As in, does it copy all of the blocks? What?s the exit code? A cp?d file should have ?new? metadata. That is, it should have it?s own dates, owners, etc. (not necessarily copied from the source file). I ran ?strace cp foo1 foo2?, and it was pretty instructive, maybe that would get you more info. On CentOS strace is in it?s own package, YMMV. -- Stephen > On Feb 3, 2017, at 8:19 AM, Andreas Mattsson > wrote: > > That works. > > ?touch test100? > > Feb 3 14:16 test100 > > ?cp test100 test101? > > Feb 3 14:16 test100 > Apr 21 2027 test101 > > ?touch ?r test100 test101? > > Feb 3 14:16 test100 > Feb 3 14:16 test101 > > /Andreas > > > That?s a cool one. :) > > What if you use the "random date" file as a time reference to touch another file (like, 'touch -r file02 file03?)? > > -- > Stephen > > > > On Feb 3, 2017, at 7:46 AM, Andreas Mattsson > wrote: > > I?m having some really strange timestamp behaviour when doing file operations on NFS mounts shared via CES on spectrum scale 4.2.1.1 > The NFS clients are up to date Centos and Debian machines. > All Scale servers and NFS clients have correct date and time via NTP. > > Creating a file, for instance ?touch file00?, gives correct timestamp. > Moving the file, ?mv file00 file01?, gives correct timestamp > Copying the file, ?cp file01 file02?, gives a random timestamp anywhere in time, for instance Oct 12 2095 or Feb 29 1976 or something similar. > > This is only via NFS. Copying the file via a native gpfs-mount or via SMB gives a correct timestamp. > Doing the same operation over NFS to other NFS-servers works correct, it is only when operating on the NFS-share from the Spectrum Scale CES the issue occurs. > > Have anyone seen this before? > > Regards, > Andreas Mattsson > _____________________________________________ > > > Andreas Mattsson > Systems Engineer > > MAX IV Laboratory > Lund University > P.O. Box 118, SE-221 00 Lund, Sweden > Visiting address: Fotongatan 2, 225 94 Lund > Mobile: +46 706 64 95 44 > andreas.mattsson at maxiv.se > www.maxiv.se > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Fri Feb 3 13:46:49 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Fri, 3 Feb 2017 08:46:49 -0500 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: Well we got it into the down state using mmsdrrestore -p to recover stuff into /var/mmfs/gen to cl004. Anyhow we ended up unknown for cl004 when it powered off. Short of removing node, unknown is the state you get. Unknown seems stable for a hopefully short outage of cl004. Thanks On Thu, Feb 2, 2017 at 4:28 PM, Olaf Weiser wrote: > many ways lead to Rome .. and I agree .. mmexpelnode is a nice command .. > another approach... > power it off .. (not reachable by ping) .. mmdelnode ... power on/boot ... > mmaddnode .. > > > > From: Aaron Knister > To: > Date: 02/02/2017 08:37 PM > Subject: Re: [gpfsug-discuss] proper gpfs shutdown when node > disappears > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > You could forcibly expel the node (one of my favorite GPFS commands): > > mmexpelnode -N $nodename > > and then power it off after the expulsion is complete and then do > > mmepelenode -r -N $nodename > > which will allow it to join the cluster next time you try and start up > GPFS on it. You'll still likely have to go through recovery but you'll > skip the part where GPFS wonders where the node went prior to it > expelling it. > > -Aaron > > On 2/2/17 2:28 PM, valdis.kletnieks at vt.edu wrote: > > On Thu, 02 Feb 2017 18:28:22 +0100, "Olaf Weiser" said: > > > >> but the /var/mmfs DIR is obviously damaged/empty .. what ever.. that's > why you > >> see a message like this.. > >> have you reinstalled that node / any backup/restore thing ? > > > > The internal RAID controller died a horrid death and basically took > > all the OS partitions with it. So the node was just sort of limping > along, > > where the mmfsd process was still coping because it wasn't doing any > > I/O to the OS partitions - but 'ssh bad-node mmshutdown' wouldn't work > > because that requires accessing stuff in /var. > > > > At that point, it starts getting tempting to just use ipmitool from > > another node to power the comatose one down - but that often causes > > a cascade of other issues while things are stuck waiting for timeouts. > > > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Feb 3 14:06:58 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 3 Feb 2017 15:06:58 +0100 Subject: [gpfsug-discuss] proper gpfs shutdown when node disappears In-Reply-To: References: <15501.1486063685@turing-police.cc.vt.edu> Message-ID: An HTML attachment was scrubbed... URL: From service at metamodul.com Fri Feb 3 16:13:35 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Fri, 3 Feb 2017 17:13:35 +0100 (CET) Subject: [gpfsug-discuss] Mount of file set Message-ID: <738987264.170895.1486138416028@email.1und1.de> An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Feb 3 20:03:18 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 3 Feb 2017 20:03:18 +0000 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? Message-ID: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> I can?t seem to find some of these on fix central, have they been pulled? Specifically, I want: Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux https://www-945.ibm.com/support/fixcentral/swg/selectFixes?product=ibm%2FStorageSoftware%2FIBM+Spectrum+Scale&fixids=Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux&source=myna&myns=s033&mynp=OCSTXKQY&mync=E&cm_sp=s033-_-OCSTXKQY-_-E&function=fixId&parent=Software%20defined%20storage Bob Oesterlin Sr Principal Storage Engineer, Nuance 507-269-0413 From: IBM My Notifications Date: Monday, January 30, 2017 at 10:49 AM To: "Oesterlin, Robert" Subject: [EXTERNAL] IBM My notifications - Storage [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/headset.png] Check out the IBM Support beta [BM] [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/megaphone-m.png] Here are your weekly updates from IBM My Notifications. Contents: IBM Spectrum Scale IBM Spectrum Scale Spectrum_Scale_Protocols_Advanced-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. The pre-built SELinux policy within RHEL7.x conflicts with IBM Spectrum Scale NFS Ganesha [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] Ganesha running on CES nodes with seLinux in enforcing mode and selinux-policy-targeted-3.13.1-60.el7_2.7 installed causes the start of ganesha to fail and thus all CES nodes get UNHEALTHY. See https://bugzilla.redhat.com/show_bug.cgi?id=1383784 Note: IBM Spectrum Scale does not support CES with seLinux in enforcing mode Spectrum_Scale_Protocols_Data_Management-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Data_Management-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Advanced-4.2.2.2-ppc64LE-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Advanced-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Data_Management-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Data_Management-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Standard-4.2.2.2-x86_64-Windows [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-s390x-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Protocols_Standard-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-x86_64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Advanced-4.2.2.2-ppc64-Linux [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-x86_64-Windows [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. Spectrum_Scale_Express-4.2.2.2-ppc64-AIX [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/forward-arrow.png] This fixpack is cumulative and includes all fixes completed since the last release. [http://www.ibm.com/it-infrastructure/us-en/support/csn/images/information.png] Your support Notifications display in English by default. Machine translation based on your IBM profile language setting is added if you specify this option in Delivery preferences within My Notifications. (Note: Not all languages are available at this time, and the English version always takes precedence over the machine translated version.) Manage your My Notifications subscriptions, or send questions and comments. Subscribe or Unsubscribe | Feedback Follow us on Twitter. To ensure proper delivery please add mynotify at stg.events.ihost.com to your address book. You received this email because you are subscribed to IBM My Notifications as: oester at gmail.com Please do not reply to this message as it is generated by an automated service machine. ?International Business Machines Corporation 2017. All rights reserved. -------------- next part -------------- An HTML attachment was scrubbed... URL: From olaf.weiser at de.ibm.com Fri Feb 3 19:57:29 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 3 Feb 2017 20:57:29 +0100 Subject: [gpfsug-discuss] Mount of file set In-Reply-To: <738987264.170895.1486138416028@email.1und1.de> References: <738987264.170895.1486138416028@email.1und1.de> Message-ID: An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Sun Feb 5 14:02:57 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Sun, 5 Feb 2017 14:02:57 +0000 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 912 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 1463 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 6365 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 371 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 2881 bytes Desc: not available URL: From martin at uni-mainz.de Mon Feb 6 11:15:31 2017 From: martin at uni-mainz.de (Christoph Martin) Date: Mon, 6 Feb 2017 12:15:31 +0100 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: I have already updated two GPFS installations with 4.2.2.2 with a download from Jan, 31. What issues with Ganesha do I have to expect until the fixed version is available? How can I see that the downloads have changed and are fixed? The information on the download site was: > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux-install (537.58 MB) > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux-install.md5 (97 bytes) > Spectrum_Scale_Protocols_Standard-4.2.2.2-x86_64-Linux.readme.html (24.59 KB) Christoph Am 05.02.2017 um 15:02 schrieb Achim Rehor: > Yes, they have been pulled, all protocol 4.2.2.2 packages. there wsa an > issue with ganesha > > It was expected to see them back before the weekend, which is obviously > not the case. > So, i guess, a little patience is needed. -- ============================================================================ Christoph Martin, Leiter Unix-Systeme Zentrum f?r Datenverarbeitung, Uni-Mainz, Germany Anselm Franz von Bentzel-Weg 12, 55128 Mainz Telefon: +49(6131)3926337 Instant-Messaging: Jabber: martin at jabber.uni-mainz.de (Siehe http://www.zdv.uni-mainz.de/4010.php) -------------- next part -------------- A non-text attachment was scrubbed... Name: martin.vcf Type: text/x-vcard Size: 421 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 801 bytes Desc: OpenPGP digital signature URL: From bbanister at jumptrading.com Mon Feb 6 14:54:11 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Mon, 6 Feb 2017 14:54:11 +0000 Subject: [gpfsug-discuss] Mount of file set In-Reply-To: References: <738987264.170895.1486138416028@email.1und1.de> Message-ID: Is there an RFE for this yet that we can all vote up? -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Olaf Weiser Sent: Friday, February 03, 2017 1:57 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Mount of file set Hi Ha-Jo, we do the same here .. so no news so far as I know... gruss vom laff From: Hans-Joachim Ehlers > To: gpfsug main discussion list > Date: 02/03/2017 05:14 PM Subject: [gpfsug-discuss] Mount of file set Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Moin Moin, is it nowaday possible to mount directly a GPFS Fileset ? In the old day i mounted the whole GPFS to a Mount point with 000 rights and did a Sub Mount of the needed Fileset. It works but it is ugly. -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Tue Feb 7 18:01:41 2017 From: mweil at wustl.edu (Matt Weil) Date: Tue, 7 Feb 2017 12:01:41 -0600 Subject: [gpfsug-discuss] stuck GracePeriodThread Message-ID: running cnfs # rpm -qa | grep gpfs gpfs.gpl-4.1.1-7.noarch gpfs.base-4.1.1-7.x86_64 gpfs.docs-4.1.1-7.noarch gpfs.gplbin-3.10.0-327.18.2.el7.x86_64-4.1.1-7.x86_64 pcp-pmda-gpfs-3.10.6-2.el7.x86_64 gpfs.ext-4.1.1-7.x86_64 gpfs.gskit-8.0.50-47.x86_64 gpfs.msg.en_US-4.1.1-7.noarch === mmdiag: waiters === 0x7F95F0008CF0 ( 19022) waiting 89.838355000 seconds, GracePeriodThread: delaying for 40.161645000 more seconds, reason: delayed do these cause issues and is there any other way besides stopping and restarting mmfsd to get rid of them. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From TROPPENS at de.ibm.com Wed Feb 8 08:36:45 2017 From: TROPPENS at de.ibm.com (Ulf Troppens) Date: Wed, 8 Feb 2017 09:36:45 +0100 Subject: [gpfsug-discuss] Spectrum Scale User Meeting - March 8+9 , 2017 - Ehningen, Germany Message-ID: There is an IBM organized Spectrum Scale User Meeting in Germany. Though, agenda and spirit are very close to user group organized events. Conference language is German. This is a two-day event. There is an introduction day for Spectrum Scale beginners a day before on March 7. See here for agenda and registration: https://www.spectrumscale.org/spectrum-scale-user-meeting-march-89-2027-ehningen-germany/ -- IBM Spectrum Scale Development - Client Engagements & Solutions Delivery Consulting IT Specialist Author "Storage Networks Explained" IBM Deutschland Research & Development GmbH Vorsitzende des Aufsichtsrats: Martina Koederitz Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 -------------- next part -------------- An HTML attachment was scrubbed... URL: From Achim.Rehor at de.ibm.com Wed Feb 8 08:48:06 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Wed, 8 Feb 2017 09:48:06 +0100 Subject: [gpfsug-discuss] 4.2.2-2 downloads - not on fix central? In-Reply-To: References: <61BF998A-544D-4201-9280-9729624DFD7C@nuance.com> Message-ID: An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Feb 9 14:30:18 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 9 Feb 2017 14:30:18 +0000 Subject: [gpfsug-discuss] AFM OpenFiles Message-ID: We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs From Mark.Bush at siriuscom.com Thu Feb 9 14:40:03 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Thu, 9 Feb 2017 14:40:03 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> Message-ID: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Has any headway been made on this issue? I just ran into it as well. The CES ip addresses just disappeared from my two protocol nodes (4.2.2.0). From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Thursday, February 2, 2017 at 12:02 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes pls contact me directly olaf.weiser at de.ibm.com Mit freundlichen Gr??en / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform, ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland IBM Allee 1 71139 Ehningen Phone: +49-170-579-44-66 E-Mail: olaf.weiser at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From: Jonathon A Anderson To: gpfsug main discussion list Date: 02/02/2017 06:45 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From vpuvvada at in.ibm.com Thu Feb 9 15:10:58 2017 From: vpuvvada at in.ibm.com (Venkateswara R Puvvada) Date: Thu, 9 Feb 2017 20:40:58 +0530 Subject: [gpfsug-discuss] AFM OpenFiles In-Reply-To: References: Message-ID: What is the version of GPFS ? There was an issue fixed in Spectrum Scale 4.2.2 for file count(file_nr) leak. This issue mostly happens on Linux kernel version >= 3.6. ~Venkat (vpuvvada at in.ibm.com) From: Peter Childs To: gpfsug main discussion list Date: 02/09/2017 08:00 PM Subject: [gpfsug-discuss] AFM OpenFiles Sent by: gpfsug-discuss-bounces at spectrumscale.org We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From p.childs at qmul.ac.uk Thu Feb 9 15:34:25 2017 From: p.childs at qmul.ac.uk (Peter Childs) Date: Thu, 9 Feb 2017 15:34:25 +0000 Subject: [gpfsug-discuss] AFM OpenFiles In-Reply-To: References: , Message-ID: 4.2.1.1 or CentOs 7. So that might account for it. Thanks Peter Childs ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Venkateswara R Puvvada Sent: Thursday, February 9, 2017 3:10:58 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] AFM OpenFiles What is the version of GPFS ? There was an issue fixed in Spectrum Scale 4.2.2 for file count(file_nr) leak. This issue mostly happens on Linux kernel version >= 3.6. ~Venkat (vpuvvada at in.ibm.com) From: Peter Childs To: gpfsug main discussion list Date: 02/09/2017 08:00 PM Subject: [gpfsug-discuss] AFM OpenFiles Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ We are trying to preform a file migration from our old GPFS cluster to our New GPFS Cluster using AFM. Currently we have 142 AFM Filesets setup one for each fileset on the old cluster, and are attempting to prefetch the files. in batched of 100,000 files with "mmafmctl home prefetch -j $fileset --list-file=$curfile --home-fs-path=/data/$fileset 2>&1" I'm doing this on a separate gateway node from our main gpfs servers and its work quiet well, However there seams to be a leak in AFM with file handles and after a couple of days of prefetch the gateway will run out of file handles and need rebooting before we can continue. We thought to begin with this was improved by not doing --metadata-only on the prefetch but in fact (As we where attempting to get the metadata before getting the main data) but in truth the machine was just lasting a little longer. Does anyone know of any setting that may help this or what is wrong? Thanks Peter Childs _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From olaf.weiser at de.ibm.com Thu Feb 9 15:34:55 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Thu, 9 Feb 2017 16:34:55 +0100 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Message-ID: An HTML attachment was scrubbed... URL: From jonathon.anderson at colorado.edu Thu Feb 9 17:32:55 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Thu, 9 Feb 2017 17:32:55 +0000 Subject: [gpfsug-discuss] CES doesn't assign addresses to nodes In-Reply-To: <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> References: <52DA499E-6C85-4136-93FA-F691DDE714E4@colorado.edu> <24AE8C99-6452-470A-A3BC-23579B1D557D@siriuscom.com> Message-ID: I was thinking that whether or not CES knows your nodes are up or not is dependent on how recently they were added to the cluster; but I?m starting to wonder if it?s dependent on the order in which nodes are brought up. Presumably you are running your CES nodes in a GPFS cluster with a large number of nodes? What happens if you bring your CES nodes up earlier (e.g., before your compute nodes)? ~jonathon From: on behalf of "Mark.Bush at siriuscom.com" Reply-To: gpfsug main discussion list Date: Thursday, February 9, 2017 at 7:40 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Has any headway been made on this issue? I just ran into it as well. The CES ip addresses just disappeared from my two protocol nodes (4.2.2.0). From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Thursday, February 2, 2017 at 12:02 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes pls contact me directly olaf.weiser at de.ibm.com Mit freundlichen Gr??en / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform, ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland IBM Allee 1 71139 Ehningen Phone: +49-170-579-44-66 E-Mail: olaf.weiser at de.ibm.com ------------------------------------------------------------------------------------------------------------------------------------------- IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin Jetter Gesch?ftsf?hrung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert Janzen, Dr. Christian Keller, Ivo Koerner, Markus Koerner Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From: Jonathon A Anderson To: gpfsug main discussion list Date: 02/02/2017 06:45 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Any chance I can get that PMR# also, so I can reference it in my DDN case? ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Wednesday, February 1, 2017 at 2:28 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Pmr opened... send the # directly to u Gesendet von IBM Verse Mathias Dietz --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Mathias Dietz" An: "gpfsug main discussion list" Datum: Mi. 01.02.2017 10:05 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ >I ll open a pmr here for my env ... the issue may hurt you inralf a ces env. only... but needs to be fixed in core gpfs.base i think Thanks for opening the PMR. The problem is inside the gpfs base code and we are working on a fix right now. In the meantime until the fix is available we will use the PMR to propose/discuss potential work arounds. Mit freundlichen Gr??en / Kind regards Mathias Dietz Spectrum Scale - Release Lead Architect (4.2.X Release) System Health and Problem Determination Architect IBM Certified Software Engineer ---------------------------------------------------------------------------------------------------------- IBM Deutschland Hechtsheimer Str. 2 55131 Mainz Phone: +49-6131-84-2027 Mobile: +49-15152801035 E-Mail: mdietz at de.ibm.com ---------------------------------------------------------------------------------------------------------- IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Koederitz, Gesch?ftsf?hrung: Dirk Wittkopp Sitz der Gesellschaft: B?blingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 From: Olaf Weiser/Germany/IBM at IBMDE To: "gpfsug main discussion list" Cc: "gpfsug main discussion list" Date: 01/31/2017 11:47 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Yeah... depending on the #nodes you 're affected or not. ..... So if your remote ces cluster is small enough in terms of the #nodes ... you'll neuer hit into this issue Gesendet von IBM Verse Simon Thompson (Research Computing - IT Services) --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Simon Thompson (Research Computing - IT Services)" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 21:07 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ We use multicluster for our environment, storage systems in a separate cluster to hpc nodes on a separate cluster from protocol nodes. According to the docs, this isn't supported, but we haven't seen any issues. Note unsupported as opposed to broken. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Jonathon A Anderson [jonathon.anderson at colorado.edu] Sent: 31 January 2017 17:47 To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Yeah, I searched around for places where ` tsctl shownodes up` appears in the GPFS code I have access to (i.e., the ksh and python stuff); but it?s only in CES. I suspect there just haven?t been that many people exporting CES out of an HPC cluster environment. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 10:45 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes I ll open a pmr here for my env ... the issue may hurt you in a ces env. only... but needs to be fixed in core gpfs.base i thi k Gesendet von IBM Verse Jonathon A Anderson --- Re: [gpfsug-discuss] CES doesn't assign addresses to nodes --- Von: "Jonathon A Anderson" An: "gpfsug main discussion list" Datum: Di. 31.01.2017 17:32 Betreff: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ________________________________ No, I?m having trouble getting this through DDN support because, while we have a GPFS server license and GRIDScaler support, apparently we don?t have ?protocol node? support, so they?ve pushed back on supporting this as an overall CES-rooted effort. I do have a DDN case open, though: 78804. If you are (as I suspect) a GPFS developer, do you mind if I cite your info from here in my DDN case to get them to open a PMR? Thanks. ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 8:42 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes ok.. so obviously ... it seems , that we have several issues.. the 3983 characters is obviously a defect have you already raised a PMR , if so , can you send me the number ? From: Jonathon A Anderson To: gpfsug main discussion list Date: 01/31/2017 04:14 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ The tail isn?t the issue; that? my addition, so that I didn?t have to paste the hundred or so line nodelist into the thread. The actual command is tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile But you can see in my tailed output that the last hostname listed is cut-off halfway through the hostname. Less obvious in the example, but true, is the fact that it?s only showing the first 120 hosts, when we have 403 nodes in our gpfs cluster. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | wc -l 120 [root at sgate2 ~]# mmlscluster | grep '\-opa' | wc -l 403 Perhaps more explicitly, it looks like `tsctl shownodes up` can only transmit 3983 characters. [root at sgate2 ~]# tsctl shownodes up | wc -c 3983 Again, I?m convinced this is a bug not only because the command doesn?t actually produce a list of all of the up nodes in our cluster; but because the last name listed is incomplete. [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail -n 1 shas0260-opa.rc.int.col[root at sgate2 ~]# I?d continue my investigation within tsctl itself but, alas, it?s a binary with no source code available to me. :) I?m trying to get this opened as a bug / PMR; but I?m still working through the DDN support infrastructure. Thanks for reporting it, though. For the record: [root at sgate2 ~]# rpm -qa | grep -i gpfs gpfs.base-4.2.1-2.x86_64 gpfs.msg.en_US-4.2.1-2.noarch gpfs.gplbin-3.10.0-327.el7.x86_64-4.2.1-0.x86_64 gpfs.gskit-8.0.50-57.x86_64 gpfs.gpl-4.2.1-2.noarch nfs-ganesha-gpfs-2.3.2-0.ibm24.el7.x86_64 gpfs.ext-4.2.1-2.x86_64 gpfs.gplbin-3.10.0-327.36.3.el7.x86_64-4.2.1-2.x86_64 gpfs.docs-4.2.1-2.noarch ~jonathon From: on behalf of Olaf Weiser Reply-To: gpfsug main discussion list Date: Tuesday, January 31, 2017 at 1:30 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Hi ...same thing here.. everything after 10 nodes will be truncated.. though I don't have an issue with it ... I 'll open a PMR .. and I recommend you to do the same thing.. ;-) the reason seems simple.. it is the "| tail" .at the end of the command.. .. which truncates the output to the last 10 items... should be easy to fix.. cheers olaf From: Jonathon A Anderson To: "gpfsug-discuss at spectrumscale.org" Date: 01/30/2017 11:11 PM Subject: Re: [gpfsug-discuss] CES doesn't assign addresses to nodes Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ In trying to figure this out on my own, I?m relatively certain I?ve found a bug in GPFS related to the truncation of output from `tsctl shownodes up`. Any chance someone in development can confirm? Here are the details of my investigation: ## GPFS is up on sgate2 [root at sgate2 ~]# mmgetstate Node number Node name GPFS state ------------------------------------------ 414 sgate2-opa active ## but if I tell ces to explicitly put one of our ces addresses on that node, it says that GPFS is down [root at sgate2 ~]# mmces address move --ces-ip 10.225.71.102 --ces-node sgate2-opa mmces address move: GPFS is down on this node. mmces address move: Command failed. Examine previous error messages to determine cause. ## the ?GPFS is down on this node? message is defined as code 109 in mmglobfuncs [root at sgate2 ~]# grep --before-context=1 "GPFS is down on this node." /usr/lpp/mmfs/bin/mmglobfuncs 109 ) msgTxt=\ "%s: GPFS is down on this node." ## and is generated by printErrorMsg in mmcesnetmvaddress when it detects that the current node is identified as ?down? by getDownCesNodeList [root at sgate2 ~]# grep --before-context=5 'printErrorMsg 109' /usr/lpp/mmfs/bin/mmcesnetmvaddress downNodeList=$(getDownCesNodeList) for downNode in $downNodeList do if [[ $toNodeName == $downNode ]] then printErrorMsg 109 "$mmcmd" ## getDownCesNodeList is the intersection of all ces nodes with GPFS cluster nodes listed in `tsctl shownodes up` [root at sgate2 ~]# grep --after-context=16 '^function getDownCesNodeList' /usr/lpp/mmfs/bin/mmcesfuncs function getDownCesNodeList { typeset sourceFile="mmcesfuncs.sh" [[ -n $DEBUG || -n $DEBUGgetDownCesNodeList ]] &&set -x $mmTRACE_ENTER "$*" typeset upnodefile=${cmdTmpDir}upnodefile typeset downNodeList # get all CES nodes $sort -o $nodefile $mmfsCesNodes.dae $tsctl shownodes up | $tr ',' '\n' | $sort -o $upnodefile downNodeList=$($comm -23 $nodefile $upnodefile) print -- $downNodeList } #----- end of function getDownCesNodeList -------------------- ## but not only are the sgate nodes not listed by `tsctl shownodes up`; its output is obviously and erroneously truncated [root at sgate2 ~]# tsctl shownodes up | tr ',' '\n' | tail shas0251-opa.rc.int.colorado.edu shas0252-opa.rc.int.colorado.edu shas0253-opa.rc.int.colorado.edu shas0254-opa.rc.int.colorado.edu shas0255-opa.rc.int.colorado.edu shas0256-opa.rc.int.colorado.edu shas0257-opa.rc.int.colorado.edu shas0258-opa.rc.int.colorado.edu shas0259-opa.rc.int.colorado.edu shas0260-opa.rc.int.col[root at sgate2 ~]# ## I expect that this is a bug in GPFS, likely related to a maximum output buffer for `tsctl shownodes up`. On 1/24/17, 12:48 PM, "Jonathon A Anderson" wrote: I think I'm having the same issue described here: http://www.spectrumscale.org/pipermail/gpfsug-discuss/2016-October/002288.html Any advice or further troubleshooting steps would be much appreciated. Full disclosure: I also have a DDN case open. (78804) We've got a four-node (snsd{1..4}) DDN gridscaler system. I'm trying to add two CES protocol nodes (sgate{1,2}) to serve NFS. Here's the steps I took: --- mmcrnodeclass protocol -N sgate1-opa,sgate2-opa mmcrnodeclass nfs -N sgate1-opa,sgate2-opa mmchconfig cesSharedRoot=/gpfs/summit/ces mmchcluster --ccr-enable mmchnode --ces-enable -N protocol mmces service enable NFS mmces service start NFS -N nfs mmces address add --ces-ip 10.225.71.104,10.225.71.105 mmces address policy even-coverage mmces address move --rebalance --- This worked the very first time I ran it, but the CES addresses weren't re-distributed after restarting GPFS or a node reboot. Things I've tried: * disabling ces on the sgate nodes and re-running the above procedure * moving the cluster and filesystem managers to different snsd nodes * deleting and re-creating the cesSharedRoot directory Meanwhile, the following log entry appears in mmfs.log.latest every ~30s: --- Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.104 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Found unassigned address 10.225.71.105 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: handleNetworkProblem with lock held: assignIP 10.225.71.104_0-_+,10.225.71.105_0-_+ 1 Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: Assigning addresses: 10.225.71.104_0-_+,10.225.71.105_0-_+ Mon Jan 23 20:31:20 MST 2017: mmcesnetworkmonitor: moveCesIPs: 10.225.71.104_0-_+,10.225.71.105_0-_+ --- Also notable, whenever I add or remove addresses now, I see this in mmsysmonitor.log (among a lot of other entries): --- 2017-01-23T20:40:56.363 sgate1 D ET_cesnetwork Entity state without requireUnique: ces_network_ips_down WARNING No CES relevant NICs detected - Service.calculateAndUpdateState:275 2017-01-23T20:40:11.364 sgate1 D ET_cesnetwork Update multiple entities at once {'p2p2': 1, 'bond0': 1, 'p2p1': 1} - Service.setLocalState:333 --- For the record, here's the interface I expect to get the address on sgate1: --- 11: bond0: mtu 9000 qdisc noqueue state UP link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff inet 10.225.71.107/20 brd 10.225.79.255 scope global bond0 valid_lft forever preferred_lft forever inet6 fe80::3efd:feff:fe08:a7c0/64 scope link valid_lft forever preferred_lft forever --- which is a bond of p2p1 and p2p2. --- 6: p2p1: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff 7: p2p2: mtu 9000 qdisc mq master bond0 state UP qlen 1000 link/ether 3c:fd:fe:08:a7:c0 brd ff:ff:ff:ff:ff:ff --- A similar bond0 exists on sgate2. I crawled around in /usr/lpp/mmfs/lib/mmsysmon/CESNetworkService.py for a while trying to figure it out, but have been unsuccessful so far. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Fri Feb 10 16:33:26 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 10 Feb 2017 16:33:26 +0000 Subject: [gpfsug-discuss] Reverting to older versions Message-ID: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From S.J.Thompson at bham.ac.uk Fri Feb 10 16:51:43 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 10 Feb 2017 16:51:43 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> References: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Message-ID: Is it the 4.2.2 code or the protocol packages that broke? We found the 4.2.2.0 SMB packages don't work for us. We just reverted to the older SMB packages. Support have advised us to try the 4.2.2.1 packages, but it means a service break to upgrade protocol packages so we are trying to schedule in. Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Mark.Bush at siriuscom.com [Mark.Bush at siriuscom.com] Sent: 10 February 2017 16:33 To: gpfsug main discussion list Subject: [gpfsug-discuss] Reverting to older versions Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From olaf.weiser at de.ibm.com Fri Feb 10 16:57:23 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Fri, 10 Feb 2017 17:57:23 +0100 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> References: <484E02BE-463F-499D-90B8-47E6F10753E3@siriuscom.com> Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8745 bytes Desc: not available URL: From duersch at us.ibm.com Fri Feb 10 17:05:23 2017 From: duersch at us.ibm.com (Steve Duersch) Date: Fri, 10 Feb 2017 17:05:23 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Fri Feb 10 17:08:48 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Fri, 10 Feb 2017 17:08:48 +0000 Subject: [gpfsug-discuss] Reverting to older versions In-Reply-To: References: Message-ID: Excellent. Thanks to all. From: on behalf of Steve Duersch Reply-To: gpfsug main discussion list Date: Friday, February 10, 2017 at 11:05 AM To: "gpfsug-discuss at spectrumscale.org" Subject: Re: [gpfsug-discuss] Reverting to older versions See chapter 12 of the Concepts, Planning, and Installation guide. There is a section on reverting to a previous version. https://www.ibm.com/support/knowledgecenter/STXKQY/ibmspectrumscale_content.html Steve Duersch Spectrum Scale 845-433-7902 IBM Poughkeepsie, New York ----- Original message ----- From: gpfsug-discuss-request at spectrumscale.org Sent by: gpfsug-discuss-bounces at spectrumscale.org To: gpfsug-discuss at spectrumscale.org Cc: Subject: gpfsug-discuss Digest, Vol 61, Issue 18 Date: Fri, Feb 10, 2017 11:52 AM Message: 1 Date: Fri, 10 Feb 2017 16:33:26 +0000 From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Subject: [gpfsug-discuss] Reverting to older versions Message-ID: <484E02BE-463F-499D-90B8-47E6F10753E3 at siriuscom.com> Content-Type: text/plain; charset="utf-8" Is there a documented way to go down a level of GPFS code?. For example since 4.2.2.x has broken my protocol nodes, is there a straight forward way to revert back to 4.2.1.x?. Can I just stop my cluster remove RPMS and add older version RPMS? This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Fri Feb 10 21:56:55 2017 From: zgiles at gmail.com (Zachary Giles) Date: Fri, 10 Feb 2017 16:56:55 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression Message-ID: Hello All, I've been seeing some less than desirable behavior with mmap and compression in GPFS. Curious if others see similar or have any ideas if this is accurate.. The guys here want me to open an IBM ticket, but I figured I'd see if anyone has had this experience before. We have an internally developed app that runs on our cluster referencing data sitting in GPFS. It is using mmap to access the files due to a library we're using that requires it. If we run the app against some data on GPFS, it performs well.. finishing in a few minutes time -- Great. However, if we compress the file (in GPFS), the app is still running after 2 days time. stracing the app shows that is polling on a file descriptor, forever.. as if a data block is still pending. I know mmap is supported with compression according to the manual (with some stipulations), and that performance is expected to be much less since it's more large-block oriented due to decompressed in groups.. no problem. But it seems like some data should get returned. I'm surprised to find that a very small amount of data is sitting in the buffers (mmfsadm dump buffers) in reference to the inodes. The decompression thread is running continuously, while the app is still polling for data from memory and sleeping, retrying, sleeping, repeat. What I believe is happening is that the 4k pages are being pulled out of large decompression groups from an mmap read request, put in the buffer, then the compression group data is thrown away since it has the result it wants, only to need another piece of data that would have been in that group slightly later, which is recalled, put in the buffer.. etc. Thus an infinite slowdown. Perhaps also the data is expiring out of the buffer before the app has a chance to read it. I can't tell. In any case, the app makes zero progress. I tried without our app, using fio.. mmap on an uncompressed file with 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not impressive). However, on a compressed file it is only 20KB/s max. ( far less impressive ). Reading a file using aio etc is over 3GB/s on a single thread without even trying. What do you think? Anyone see anything like this? Perhaps there are some tunings to waste a bit more memory on cached blocks rather than make decompression recycle? I've searched back the archives a bit. There's a May 2013 thread about slowness as well. I think we're seeing much much less than that. Our page pools are of decent size. Its not just slowness, it's as if the app never gets a block back at all. ( We could handle slowness .. ) Thanks. Open to ideas.. -Zach Giles From mweil at wustl.edu Sat Feb 11 18:32:54 2017 From: mweil at wustl.edu (Matt Weil) Date: Sat, 11 Feb 2017 12:32:54 -0600 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: References: Message-ID: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> https://access.redhat.com/solutions/2437991 I ran into this issue the other day even with the echo "4096" > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that larger to get to the 2M IO size. So I never really seen this issue until the other day. I may have triggered it myself because I was adding new storage. Was wondering what version of GPFS fixes this. I really do not want to step back to and older kernel version. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From leoluan at us.ibm.com Sat Feb 11 22:23:24 2017 From: leoluan at us.ibm.com (Leo Luan) Date: Sat, 11 Feb 2017 22:23:24 +0000 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression Message-ID: An HTML attachment was scrubbed... URL: From janfrode at tanso.net Sun Feb 12 17:30:38 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Sun, 12 Feb 2017 18:30:38 +0100 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> References: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> Message-ID: The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -jf On Sat, Feb 11, 2017 at 7:32 PM, Matt Weil wrote: > https://access.redhat.com/solutions/2437991 > > I ran into this issue the other day even with the echo "4096" > > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that > larger to get to the 2M IO size. So I never really seen this issue > until the other day. I may have triggered it myself because I was > adding new storage. > > Was wondering what version of GPFS fixes this. I really do not want to > step back to and older kernel version. > > Thanks > Matt > > ________________________________ > The materials in this message are private and may contain Protected > Healthcare Information or other information of a sensitive nature. If you > are not the intended recipient, be advised that any unauthorized use, > disclosure, copying or the taking of any action in reliance on the contents > of this information is strictly prohibited. If you have received this email > in error, please immediately notify the sender via telephone or return mail. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From mweil at wustl.edu Mon Feb 13 15:46:27 2017 From: mweil at wustl.edu (Matt Weil) Date: Mon, 13 Feb 2017 09:46:27 -0600 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later In-Reply-To: References: <28f0013b-19a5-98fa-e348-a2f5cd70860a@wustl.edu> Message-ID: excellent Thanks. On 2/12/17 11:30 AM, Jan-Frode Myklebust wrote: The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -jf On Sat, Feb 11, 2017 at 7:32 PM, Matt Weil > wrote: https://access.redhat.com/solutions/2437991 I ran into this issue the other day even with the echo "4096" > /sys/block/$ii/queue/max_sectors_kb; in place. I have always made that larger to get to the 2M IO size. So I never really seen this issue until the other day. I may have triggered it myself because I was adding new storage. Was wondering what version of GPFS fixes this. I really do not want to step back to and older kernel version. Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Feb 13 15:49:07 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 13 Feb 2017 15:49:07 +0000 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later Message-ID: Alas, I ran into this as well ? only seems to impact some my older JBOD storage. The fix is vague, should I be worried about this turning up later, or will it happen right away? (if it does) Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Matt Weil Reply-To: gpfsug main discussion list Date: Monday, February 13, 2017 at 9:46 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Mon Feb 13 17:00:10 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Mon, 13 Feb 2017 17:00:10 +0000 Subject: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later Message-ID: <34F66C99-B56D-4742-8C40-B6377B914FC0@nuance.com> See this technote for an alternative fix and details: http://www-01.ibm.com/support/docview.wss?uid=isg3T1024840&acss=danl_4184_web Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Matt Weil Reply-To: gpfsug main discussion list Date: Monday, February 13, 2017 at 9:46 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] Getting 'blk_cloned_rq_check_limits: over max size limit' errors after updating the systems to kernel 2.6.32-642.el6 or later The 4.2.2.2 readme says: * Fix a multipath device failure that reads "blk_cloned_rq_check_limits: over max size limit" which can occur when kernel function bio_get_nr_vecs() returns a value which is larger than the value of max sectors of the block device. -------------- next part -------------- An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Mon Feb 13 17:27:55 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Mon, 13 Feb 2017 12:27:55 -0500 Subject: [gpfsug-discuss] mmbackup examples using policy Message-ID: Anyone have any examples of this? I have a filesystem that has 2 pools and several filesets and would like daily progressive incremental backups of its contents. I found some stuff here(nothing real close to what I wanted however): /usr/lpp/mmfs/samples/ilm I have the tsm client installed on the server nsds. Thanks much -------------- next part -------------- An HTML attachment was scrubbed... URL: From usa-principal at gpfsug.org Tue Feb 14 06:07:05 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Mon, 13 Feb 2017 22:07:05 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: References: Message-ID: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> Just a follow up reminder to save the date, April 4-5, for a two-day Spectrum Scale Users Group event hosted by NERSC in Berkeley, California. We are working on the registration form and agenda and hope to be able to share more details soon. Best, Kristy & Bob On , usa-principal-gpfsug.org wrote: > Hello all and happy new year (depending upon where you are right now > :-) ). > > We'll have more details in 2017, but for now please save the date for > a two-day users group meeting at NERSC in Berkeley, California. > > April 4-5, 2017 > National Energy Research Scientific Computing Center (nersc.gov) > Berkeley, California > > We look forward to offering our first two-day event in the US. > > Best, > Kristy & Bob From zgiles at gmail.com Tue Feb 14 16:10:13 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 14 Feb 2017 11:10:13 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: Message-ID: Hi Leo, I agree with your view on compression and what it should be used for, in general. The read bandwidth amplification is definitely something we're seeing. Just a little more background on the files: The files themselves are not "cold" (archive), however, they are very lightly used. The data set is thousands of files that are each 100-200GB, totaling about a PB. the read pattern is a few GB from about 20% of the files once a month. So the total read is only several TB out of a PB every month. ( approximate ). We can get a compression of about 5:1 using GPFS with these files, so we can gain back 800TB with compression. The total run time of the app (reading all those all chunks, when uncompressed) is maybe an hour total. Although leaving the files uncompressed would let the app work, there's a huge gain to be had if we can make compression work by saving ~800TB As it's such a small amount of data read each time, and also not too predictable (it's semi-random historical), and as the length of the job is short enough, it's hard to justify decompressing large chunks of the system to run 1 job. I would have to decompress 200TB to read 10TB, recompress them, and decompress a different (overlapping) 200TB next month. The compression / decompression of sizable portions of the data takes days. I think there maybe more of an issue that just performance though.. The decompression thread is running, internal file metadata is read fine, most of the file is read fine. Just at times it gets stuck.. the decompression thread is running in GPFS, the app is polling, it just never comes back with the block. I feel like there's a race condition here where a block is read, available for the app, but thrown away before the app can read it, only to be decompressed again. It's strange how some block positions are slow (expected) and others just never come back (it will poll for days on a certain address). However, reading the file in-order is fine. Is this a block caching issue? Can we tune up the amount of blocks kept? I think with mmap the blocks are not kept in page pool, correct? -Zach On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan wrote: > Hi Zachary, > > When a compressed file is mmapped, each 4K read in your tests causes the > accessed part of the file to be decompressed (in the granularity of 10 GPFS > blocks). For usual file sizes, the parts being accessed will be > decompressed and IOs speed will be normal except for the first 4K IO in each > 10-GPFS-block group. For very large files, a large percentage of small > random IOs may keep getting amplified to 10-block decompression IO for a > long time. This is probably what happened in your mmap application run. > > The suggestion is to not compress files until they have become cold (not > likely to be accessed any time soon) and avoid compressing very large files > that may be accessed through mmap later. The product already has a built-in > protection preventing compression of files that are mmapped at compression > time. You can add an exclude rule in the compression policy run for files > that are identified to have mmap performance issues (in case they get > mmapped after being compressed in a periodical policy run). > > Leo Luan > > From: Zachary Giles > To: gpfsug main discussion list > Date: 02/10/2017 01:57 PM > Subject: [gpfsug-discuss] Questions about mmap GPFS and compression > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ________________________________ > > > > Hello All, > > I've been seeing some less than desirable behavior with mmap and > compression in GPFS. Curious if others see similar or have any ideas > if this is accurate.. > The guys here want me to open an IBM ticket, but I figured I'd see if > anyone has had this experience before. > > We have an internally developed app that runs on our cluster > referencing data sitting in GPFS. It is using mmap to access the files > due to a library we're using that requires it. > > If we run the app against some data on GPFS, it performs well.. > finishing in a few minutes time -- Great. However, if we compress the > file (in GPFS), the app is still running after 2 days time. > stracing the app shows that is polling on a file descriptor, forever.. > as if a data block is still pending. > > I know mmap is supported with compression according to the manual > (with some stipulations), and that performance is expected to be much > less since it's more large-block oriented due to decompressed in > groups.. no problem. But it seems like some data should get returned. > > I'm surprised to find that a very small amount of data is sitting in > the buffers (mmfsadm dump buffers) in reference to the inodes. The > decompression thread is running continuously, while the app is still > polling for data from memory and sleeping, retrying, sleeping, repeat. > > What I believe is happening is that the 4k pages are being pulled out > of large decompression groups from an mmap read request, put in the > buffer, then the compression group data is thrown away since it has > the result it wants, only to need another piece of data that would > have been in that group slightly later, which is recalled, put in the > buffer.. etc. Thus an infinite slowdown. Perhaps also the data is > expiring out of the buffer before the app has a chance to read it. I > can't tell. In any case, the app makes zero progress. > > I tried without our app, using fio.. mmap on an uncompressed file with > 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not > impressive). However, on a compressed file it is only 20KB/s max. ( > far less impressive ). Reading a file using aio etc is over 3GB/s on a > single thread without even trying. > > What do you think? > Anyone see anything like this? Perhaps there are some tunings to waste > a bit more memory on cached blocks rather than make decompression > recycle? > > I've searched back the archives a bit. There's a May 2013 thread about > slowness as well. I think we're seeing much much less than that. Our > page pools are of decent size. Its not just slowness, it's as if the > app never gets a block back at all. ( We could handle slowness .. ) > > Thanks. Open to ideas.. > > -Zach Giles > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Zach Giles zgiles at gmail.com From zgiles at gmail.com Tue Feb 14 16:25:09 2017 From: zgiles at gmail.com (Zachary Giles) Date: Tue, 14 Feb 2017 11:25:09 -0500 Subject: [gpfsug-discuss] read replica fastest tuning for short distance Message-ID: Hello all, ( Making good use of the mailing list recently.. :) ) I have two datacenters that are fairly close to each other (about 0.5ms away by-the-wire) and have a fairly small pipe between them ( single 40Gbit ). There is a stretched filesystem between the datacenters, two failure groups, and replicas=2 on all data and metadata. I'm trying to ensure that clients on each side only read their local replica instead of filling the pipe with reads from the other side. While readreplica=local would make sense, text suggests that it mostly checks to see if you're in the same subnet to check for local reads. This won't work for me since there are many many subnets on each side. The newer option of readreplica=fastest looks like a good idea, except that the latency of the connection between the datacenters is so small compared to the disk latency that reads often come from the wrong side. I've tried tuning fastestPolicyCmpThreshold down to 5 and fastestPolicyMinDiffPercent down to 10, but I still see reads from both sides. Does anyone have any pointers for tuning read replica using fastest on close-by multidatacenter installs to help ensure reads are only from one side? Any numbers that have been shown to work? I haven't been able to find a way to inspect the GPFS read latencies that it is using to make the decision. I looked in the dumps, but don't seem to see anything. Anyone know if it's possible and where they are? Thanks -Zach -- Zach Giles zgiles at gmail.com From usa-principal at gpfsug.org Tue Feb 14 19:29:04 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Tue, 14 Feb 2017 11:29:04 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> References: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> Message-ID: <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> I should have also asked for anyone interested in giving a talk, as usual, the users group meeting is not meant to be used as a sales and marketing platform, but user experiences are always welcome. If you're interested, or have an idea for a talk, please let us know so we can include it in the agenda. Thanks, Kristy & Bob On , usa-principal-gpfsug.org wrote: > Just a follow up reminder to save the date, April 4-5, for a two-day > Spectrum Scale Users Group event hosted by NERSC in Berkeley, > California. > > We are working on the registration form and agenda and hope to be able > to share more details soon. > > Best, > Kristy & Bob > > > On , usa-principal-gpfsug.org wrote: >> Hello all and happy new year (depending upon where you are right now >> :-) ). >> >> We'll have more details in 2017, but for now please save the date for >> a two-day users group meeting at NERSC in Berkeley, California. >> >> April 4-5, 2017 >> National Energy Research Scientific Computing Center (nersc.gov) >> Berkeley, California >> >> We look forward to offering our first two-day event in the US. >> >> Best, >> Kristy & Bob From mweil at wustl.edu Tue Feb 14 20:17:36 2017 From: mweil at wustl.edu (Matt Weil) Date: Tue, 14 Feb 2017 14:17:36 -0600 Subject: [gpfsug-discuss] GUI access Message-ID: Hello all, Some how we misplaced the password for our dev instance. Is there any way to reset it? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. From r.sobey at imperial.ac.uk Tue Feb 14 20:31:16 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 14 Feb 2017 20:31:16 +0000 Subject: [gpfsug-discuss] GUI access In-Reply-To: References: Message-ID: Hi Matt This is what I got from support a few months ago when I had a problem with our "admin" user disappearing. "We have occasionally seen this issue in the past where it has been resolved by : /usr/lpp/mmfs/gui/cli/mkuser admin -p Passw0rd -g Administrator,SecurityAdmin This creates a new user named "admin" with the password "Passw0rd" " I was running 4.2.1-0 at the time iirc. ________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of Matt Weil Sent: 14 February 2017 20:17 To: gpfsug-discuss at spectrumscale.org Subject: [gpfsug-discuss] GUI access Hello all, Some how we misplaced the password for our dev instance. Is there any way to reset it? Thanks Matt ________________________________ The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From sfadden at us.ibm.com Tue Feb 14 21:02:06 2017 From: sfadden at us.ibm.com (Scott Fadden) Date: Tue, 14 Feb 2017 21:02:06 +0000 Subject: [gpfsug-discuss] GUI access In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From leoluan at us.ibm.com Wed Feb 15 00:14:12 2017 From: leoluan at us.ibm.com (Leo Luan) Date: Wed, 15 Feb 2017 00:14:12 +0000 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From eric.wonderley at vt.edu Wed Feb 15 13:17:40 2017 From: eric.wonderley at vt.edu (J. Eric Wonderley) Date: Wed, 15 Feb 2017 08:17:40 -0500 Subject: [gpfsug-discuss] Fw: mmbackup examples using policy In-Reply-To: References: Message-ID: Hi Steven: Yes that is more or less what we want to do. We have tivoli here for backup so I'm somewhat familiar with inclexcl files. The filesystem I want to backup is a shared home. Right now I do have a policy...mmlspolicy home -L does return a policy. So if I did not want to backup core and cache files I could create a backup policy using /var/mmfs/mmbackup/.mmbackupRules.home and place in it?: EXCLUDE "/gpfs/home/.../core" EXCLUDE "/igpfs/home/.../.opera/cache4" EXCLUDE "/gpfs/home/.../.netscape/cache/.../*" EXCLUDE "/gpfs/home/.../.mozilla/default/.../Cache" EXCLUDE "/gpfs/home/.../.mozilla/.../Cache/*" EXCLUDE "/gpfs/home/.../.mozilla/.../Cache" EXCLUDE "/gpfs/home/.../.cache/mozilla/*" EXCLUDE.DIR "/gpfs/home/.../.mozilla/firefox/.../Cache" I did a test run of mmbackup and I noticed I got a template put in that location: [root at cl002 ~]# ll -al /var/mmfs/mmbackup/ total 12 drwxr-xr-x 2 root root 4096 Feb 15 07:43 . drwxr-xr-x 10 root root 4096 Jan 4 10:42 .. -r-------- 1 root root 1177 Feb 15 07:43 .mmbackupRules.home So I can copy this off into /var/mmfs/etc for example and to use next time with my edits. What is normally used to schedule the mmbackup? Cronjob? dsmcad? Thanks much. On Tue, Feb 14, 2017 at 11:21 AM, Steven Berman wrote: > Eric, > What specifically do you wish to accomplish? It sounds to me like > you want to use mmbackup to do incremental backup of parts or all of your > file system. But your question did not specify what specifically other > than "whole file system incremental" you want to accomplish. Mmbackup by > default, with "-t incremental" will back up the whole file system, > including all filesets of either variety, and without regard to storage > pools. If you wish to back up only a sub-tree of the file system, it must > be in an independent fileset (--inode-space=new) and the current product > supports doing the backup of just that fileset. If you want to backup > parts of the file system but exclude things in certain storage pools, from > anywhere in the tree, you can either use "include exclude rules" in your > Spectrum Protect (formerly TSM) configuration file, or you can hand-edit > the policy rules for mmbackup which can be copied from /var/mmfs/mmbackup/.mmbackupRules. system name> (only persistent during mmbackup execution). Copy that > file to a new location, hand-edit and run mmbackup next time with -P policy rules file>. Is there something else you want to accomplish? > > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/ > com.ibm.spectrum.scale.v4r22.doc/bl1adv_semaprul.htm > https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.2/ > com.ibm.spectrum.scale.v4r22.doc/bl1adm_backupusingmmbackup.htm > > Steven Berman Spectrum Scale / HPC General Parallel File > System Dev. > Pittsburgh, PA (412) 667-6993 Tie-Line 989-6993 > sberman at us.ibm.com > ----Every once in a while, it is a good idea to call out, "Computer, end > program!" just to check. --David Noelle > ----All Your Base Are Belong To Us. --CATS > > > > > > From: "J. Eric Wonderley" > To: gpfsug main discussion list > > Date: 02/13/2017 10:28 AM > Subject: [gpfsug-discuss] mmbackup examples using policy > Sent by: gpfsug-discuss-bounces at spectrumscale.org > ------------------------------ > > > > Anyone have any examples of this? I have a filesystem that has 2 pools > and several filesets and would like daily progressive incremental backups > of its contents. > > I found some stuff here(nothing real close to what I wanted however): > /usr/lpp/mmfs/samples/ilm > > I have the tsm client installed on the server nsds. > > Thanks much_______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Wed Feb 15 16:43:37 2017 From: zgiles at gmail.com (Zachary Giles) Date: Wed, 15 Feb 2017 11:43:37 -0500 Subject: [gpfsug-discuss] Questions about mmap GPFS and compression In-Reply-To: References: Message-ID: Just checked, we are definitely using PROT_READ, and the users only have read permission to the files, so it should be purely read. I guess that furthers the concern since we shouldn't be seeing the IO overhead as you mentioned. We also use madvise.. not sure if that helps or hurts. On Tue, Feb 14, 2017 at 7:14 PM, Leo Luan wrote: > Does your internally developed application do only reads during in its > monthly run? If so, can you change it to use PROT_READ flag during the > mmap call? That way you will not get the 10-block decompression IO overhead > and your files will remain compressed. The decompression happens upon > pagein's only if the mmap call includes the PROT_WRITE flag (or upon actual > writes for non-mmap IOs). > > Leo > > > ----- Original message ----- > From: Zachary Giles > Sent by: gpfsug-discuss-bounces at spectrumscale.org > To: gpfsug main discussion list > Cc: > Subject: Re: [gpfsug-discuss] Questions about mmap GPFS and compression > Date: Tue, Feb 14, 2017 8:10 AM > > Hi Leo, > > I agree with your view on compression and what it should be used for, > in general. The read bandwidth amplification is definitely something > we're seeing. > > Just a little more background on the files: > The files themselves are not "cold" (archive), however, they are very > lightly used. The data set is thousands of files that are each > 100-200GB, totaling about a PB. the read pattern is a few GB from > about 20% of the files once a month. So the total read is only several > TB out of a PB every month. ( approximate ). We can get a compression > of about 5:1 using GPFS with these files, so we can gain back 800TB > with compression. The total run time of the app (reading all those all > chunks, when uncompressed) is maybe an hour total. > > Although leaving the files uncompressed would let the app work, > there's a huge gain to be had if we can make compression work by > saving ~800TB As it's such a small amount of data read each time, and > also not too predictable (it's semi-random historical), and as the > length of the job is short enough, it's hard to justify decompressing > large chunks of the system to run 1 job. I would have to decompress > 200TB to read 10TB, recompress them, and decompress a different > (overlapping) 200TB next month. The compression / decompression of > sizable portions of the data takes days. > > I think there maybe more of an issue that just performance though.. > The decompression thread is running, internal file metadata is read > fine, most of the file is read fine. Just at times it gets stuck.. the > decompression thread is running in GPFS, the app is polling, it just > never comes back with the block. I feel like there's a race condition > here where a block is read, available for the app, but thrown away > before the app can read it, only to be decompressed again. > It's strange how some block positions are slow (expected) and others > just never come back (it will poll for days on a certain address). > However, reading the file in-order is fine. > > Is this a block caching issue? Can we tune up the amount of blocks kept? > I think with mmap the blocks are not kept in page pool, correct? > > -Zach > > On Sat, Feb 11, 2017 at 5:23 PM, Leo Luan wrote: >> Hi Zachary, >> >> When a compressed file is mmapped, each 4K read in your tests causes the >> accessed part of the file to be decompressed (in the granularity of 10 >> GPFS >> blocks). For usual file sizes, the parts being accessed will be >> decompressed and IOs speed will be normal except for the first 4K IO in >> each >> 10-GPFS-block group. For very large files, a large percentage of small >> random IOs may keep getting amplified to 10-block decompression IO for a >> long time. This is probably what happened in your mmap application run. >> >> The suggestion is to not compress files until they have become cold (not >> likely to be accessed any time soon) and avoid compressing very large >> files >> that may be accessed through mmap later. The product already has a >> built-in >> protection preventing compression of files that are mmapped at compression >> time. You can add an exclude rule in the compression policy run for files >> that are identified to have mmap performance issues (in case they get >> mmapped after being compressed in a periodical policy run). >> >> Leo Luan >> >> From: Zachary Giles >> To: gpfsug main discussion list >> Date: 02/10/2017 01:57 PM >> Subject: [gpfsug-discuss] Questions about mmap GPFS and compression >> Sent by: gpfsug-discuss-bounces at spectrumscale.org >> ________________________________ >> >> >> >> Hello All, >> >> I've been seeing some less than desirable behavior with mmap and >> compression in GPFS. Curious if others see similar or have any ideas >> if this is accurate.. >> The guys here want me to open an IBM ticket, but I figured I'd see if >> anyone has had this experience before. >> >> We have an internally developed app that runs on our cluster >> referencing data sitting in GPFS. It is using mmap to access the files >> due to a library we're using that requires it. >> >> If we run the app against some data on GPFS, it performs well.. >> finishing in a few minutes time -- Great. However, if we compress the >> file (in GPFS), the app is still running after 2 days time. >> stracing the app shows that is polling on a file descriptor, forever.. >> as if a data block is still pending. >> >> I know mmap is supported with compression according to the manual >> (with some stipulations), and that performance is expected to be much >> less since it's more large-block oriented due to decompressed in >> groups.. no problem. But it seems like some data should get returned. >> >> I'm surprised to find that a very small amount of data is sitting in >> the buffers (mmfsadm dump buffers) in reference to the inodes. The >> decompression thread is running continuously, while the app is still >> polling for data from memory and sleeping, retrying, sleeping, repeat. >> >> What I believe is happening is that the 4k pages are being pulled out >> of large decompression groups from an mmap read request, put in the >> buffer, then the compression group data is thrown away since it has >> the result it wants, only to need another piece of data that would >> have been in that group slightly later, which is recalled, put in the >> buffer.. etc. Thus an infinite slowdown. Perhaps also the data is >> expiring out of the buffer before the app has a chance to read it. I >> can't tell. In any case, the app makes zero progress. >> >> I tried without our app, using fio.. mmap on an uncompressed file with >> 1 thread 1 iodepth, random read, 4k blocks, yields ~76MB/s (not >> impressive). However, on a compressed file it is only 20KB/s max. ( >> far less impressive ). Reading a file using aio etc is over 3GB/s on a >> single thread without even trying. >> >> What do you think? >> Anyone see anything like this? Perhaps there are some tunings to waste >> a bit more memory on cached blocks rather than make decompression >> recycle? >> >> I've searched back the archives a bit. There's a May 2013 thread about >> slowness as well. I think we're seeing much much less than that. Our >> page pools are of decent size. Its not just slowness, it's as if the >> app never gets a block back at all. ( We could handle slowness .. ) >> >> Thanks. Open to ideas.. >> >> -Zach Giles >> >> >> _______________________________________________ >> gpfsug-discuss mailing list >> gpfsug-discuss at spectrumscale.org >> http://gpfsug.org/mailman/listinfo/gpfsug-discuss >> > > > > -- > Zach Giles > zgiles at gmail.com > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Zach Giles zgiles at gmail.com From aaron.s.knister at nasa.gov Fri Feb 17 15:52:19 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 10:52:19 -0500 Subject: [gpfsug-discuss] bizarre performance behavior Message-ID: This is a good one. I've got an NSD server with 4x 16GB fibre connections coming in and 1x FDR10 and 1x QDR connection going out to the clients. I was having a really hard time getting anything resembling sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for reads). The back-end is a DDN SFA12K and I *know* it can do better than that. I don't remember quite how I figured this out but simply by running "openssl speed -multi 16" on the nsd server to drive up the load I saw an almost 4x performance jump which is pretty much goes against every sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to quadruple your i/o performance"). This feels like some type of C-states frequency scaling shenanigans that I haven't quite ironed down yet. I booted the box with the following kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which didn't seem to make much of a difference. I also tried setting the frequency governer to userspace and setting the minimum frequency to 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have to run something to drive up the CPU load and then performance improves. I'm wondering if this could be an issue with the C1E state? I'm curious if anyone has seen anything like this. The node is a dx360 M4 (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From S.J.Thompson at bham.ac.uk Fri Feb 17 16:43:34 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Fri, 17 Feb 2017 16:43:34 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) Simon ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] Sent: 17 February 2017 15:52 To: gpfsug main discussion list Subject: [gpfsug-discuss] bizarre performance behavior This is a good one. I've got an NSD server with 4x 16GB fibre connections coming in and 1x FDR10 and 1x QDR connection going out to the clients. I was having a really hard time getting anything resembling sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for reads). The back-end is a DDN SFA12K and I *know* it can do better than that. I don't remember quite how I figured this out but simply by running "openssl speed -multi 16" on the nsd server to drive up the load I saw an almost 4x performance jump which is pretty much goes against every sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to quadruple your i/o performance"). This feels like some type of C-states frequency scaling shenanigans that I haven't quite ironed down yet. I booted the box with the following kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which didn't seem to make much of a difference. I also tried setting the frequency governer to userspace and setting the minimum frequency to 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have to run something to drive up the CPU load and then performance improves. I'm wondering if this could be an issue with the C1E state? I'm curious if anyone has seen anything like this. The node is a dx360 M4 (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From aaron.s.knister at nasa.gov Fri Feb 17 16:53:00 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 11:53:00 -0500 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: <104dc3f8-a91c-d9ae-3a86-88136c46de39@nasa.gov> Well, disabling the C1E state seems to have done the trick. I removed the kernel parameters I mentioned and set the cpu governer back to ondemand with a minimum of 1.2ghz. I'm now getting 6.2GB/s of reads which I believe is pretty darned close to theoretical peak performance. -Aaron On 2/17/17 10:52 AM, Aaron Knister wrote: > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From aaron.s.knister at nasa.gov Fri Feb 17 17:13:08 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 17 Feb 2017 12:13:08 -0500 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: References: Message-ID: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Well, I'm somewhat scrounging for hardware. This is in our test environment :) And yep, it's got the 2U gpu-tray in it although even without the riser it has 2 PCIe slots onboard (excluding the on-board dual-port mezz card) so I think it would make a fine NSD server even without the riser. -Aaron On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) wrote: > Maybe its related to interrupt handlers somehow? You drive the load up on one socket, you push all the interrupt handling to the other socket where the fabric card is attached? > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, I assume its some 2U gpu-tray riser one or something !) > > Simon > ________________________________________ > From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [aaron.s.knister at nasa.gov] > Sent: 17 February 2017 15:52 > To: gpfsug main discussion list > Subject: [gpfsug-discuss] bizarre performance behavior > > This is a good one. I've got an NSD server with 4x 16GB fibre > connections coming in and 1x FDR10 and 1x QDR connection going out to > the clients. I was having a really hard time getting anything resembling > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > reads). The back-end is a DDN SFA12K and I *know* it can do better than > that. > > I don't remember quite how I figured this out but simply by running > "openssl speed -multi 16" on the nsd server to drive up the load I saw > an almost 4x performance jump which is pretty much goes against every > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > quadruple your i/o performance"). > > This feels like some type of C-states frequency scaling shenanigans that > I haven't quite ironed down yet. I booted the box with the following > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > didn't seem to make much of a difference. I also tried setting the > frequency governer to userspace and setting the minimum frequency to > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > to run something to drive up the CPU load and then performance improves. > > I'm wondering if this could be an issue with the C1E state? I'm curious > if anyone has seen anything like this. The node is a dx360 M4 > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From Robert.Oesterlin at nuance.com Fri Feb 17 17:26:29 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 17 Feb 2017 17:26:29 +0000 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages Message-ID: Any way to suppress these? I get them every time mmpmon is run: Feb 17 11:54:02 nrg5-gpfs01 mmfs[10375]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Feb 17 11:55:01 nrg5-gpfs01 mmfs[13668]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Feb 17 11:56:02 nrg5-gpfs01 mmfs[17318]: CLI root root [EXIT, CHANGE] 'mmpmon -p -s -t 30' RC=0 Bob Oesterlin Sr Principal Storage Engineer, Nuance -------------- next part -------------- An HTML attachment was scrubbed... URL: From syi at ca.ibm.com Fri Feb 17 17:54:39 2017 From: syi at ca.ibm.com (Yi Sun) Date: Fri, 17 Feb 2017 12:54:39 -0500 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages In-Reply-To: References: Message-ID: It may relate to CommandAudit http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1xx_soc.htm Yi Sun > ------------------------------ > > Message: 5 > Date: Fri, 17 Feb 2017 17:26:29 +0000 > From: "Oesterlin, Robert" > To: gpfsug main discussion list > Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages > Message-ID: > Content-Type: text/plain; charset="utf-8" > > Any way to suppress these? I get them every time mmpmon is run: > > Feb 17 11:54:02 nrg5-gpfs01 mmfs[10375]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > Feb 17 11:55:01 nrg5-gpfs01 mmfs[13668]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > Feb 17 11:56:02 nrg5-gpfs01 mmfs[17318]: CLI root root [EXIT, > CHANGE] 'mmpmon -p -s -t 30' RC=0 > > Bob Oesterlin > Sr Principal Storage Engineer, Nuance > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Fri Feb 17 17:58:28 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Fri, 17 Feb 2017 17:58:28 +0000 Subject: [gpfsug-discuss] mmpmon messages in /var/log/messages Message-ID: <3E007FA1-7152-45FB-B78E-2C92A34B7727@nuance.com> Bingo, that was it. I wish I could control it in a more fine-grained manner. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: on behalf of Yi Sun Reply-To: gpfsug main discussion list Date: Friday, February 17, 2017 at 11:54 AM To: "gpfsug-discuss at spectrumscale.org" Subject: [EXTERNAL] Re: [gpfsug-discuss] mmpmon messages in /var/log/messages It may relate to CommandAudit http://www.ibm.com/support/knowledgecenter/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1xx_soc.htm -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Fri Feb 17 18:29:46 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Fri, 17 Feb 2017 18:29:46 +0000 Subject: [gpfsug-discuss] bizarre performance behavior In-Reply-To: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> References: <2a946193-259f-9dcb-0381-12fd571c5413@nasa.gov> Message-ID: I just had a similar experience from a sandisk infiniflash system SAS-attached to s single host. Gpfsperf reported 3,2 Gbyte/s for writes. and 250-300 Mbyte/s on sequential reads!! Random reads were on the order of 2 Gbyte/s. After a bit head scratching snd fumbling around I found out that reducing maxMBpS from 10000 to 100 fixed the problem! Digging further I found that reducing prefetchThreads from default=72 to 32 also fixed it, while leaving maxMBpS at 10000. Can now also read at 3,2 GByte/s. Could something like this be the problem on your box as well? -jf fre. 17. feb. 2017 kl. 18.13 skrev Aaron Knister : > Well, I'm somewhat scrounging for hardware. This is in our test > environment :) And yep, it's got the 2U gpu-tray in it although even > without the riser it has 2 PCIe slots onboard (excluding the on-board > dual-port mezz card) so I think it would make a fine NSD server even > without the riser. > > -Aaron > > On 2/17/17 11:43 AM, Simon Thompson (Research Computing - IT Services) > wrote: > > Maybe its related to interrupt handlers somehow? You drive the load up > on one socket, you push all the interrupt handling to the other socket > where the fabric card is attached? > > > > Dunno ... (Though I am intrigued you use idataplex nodes as NSD servers, > I assume its some 2U gpu-tray riser one or something !) > > > > Simon > > ________________________________________ > > From: gpfsug-discuss-bounces at spectrumscale.org [ > gpfsug-discuss-bounces at spectrumscale.org] on behalf of Aaron Knister [ > aaron.s.knister at nasa.gov] > > Sent: 17 February 2017 15:52 > > To: gpfsug main discussion list > > Subject: [gpfsug-discuss] bizarre performance behavior > > > > This is a good one. I've got an NSD server with 4x 16GB fibre > > connections coming in and 1x FDR10 and 1x QDR connection going out to > > the clients. I was having a really hard time getting anything resembling > > sensible performance out of it (4-5Gb/s writes but maybe 1.2Gb/s for > > reads). The back-end is a DDN SFA12K and I *know* it can do better than > > that. > > > > I don't remember quite how I figured this out but simply by running > > "openssl speed -multi 16" on the nsd server to drive up the load I saw > > an almost 4x performance jump which is pretty much goes against every > > sysadmin fiber in me (i.e. "drive up the cpu load with unrelated crap to > > quadruple your i/o performance"). > > > > This feels like some type of C-states frequency scaling shenanigans that > > I haven't quite ironed down yet. I booted the box with the following > > kernel parameters "intel_idle.max_cstate=0 processor.max_cstate=0" which > > didn't seem to make much of a difference. I also tried setting the > > frequency governer to userspace and setting the minimum frequency to > > 2.6ghz (it's a 2.6ghz cpu). None of that really matters-- I still have > > to run something to drive up the CPU load and then performance improves. > > > > I'm wondering if this could be an issue with the C1E state? I'm curious > > if anyone has seen anything like this. The node is a dx360 M4 > > (Sandybridge) with 16 2.6GHz cores and 32GB of RAM. > > > > -Aaron > > > > -- > > Aaron Knister > > NASA Center for Climate Simulation (Code 606.2) > > Goddard Space Flight Center > > (301) 286-2776 > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 20 15:35:09 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 15:35:09 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM Message-ID: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From Kevin.Buterbaugh at Vanderbilt.Edu Mon Feb 20 15:40:39 2017 From: Kevin.Buterbaugh at Vanderbilt.Edu (Buterbaugh, Kevin L) Date: Mon, 20 Feb 2017 15:40:39 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: Message-ID: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Mon Feb 20 15:47:57 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 17:47:57 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Mon Feb 20 15:55:47 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 15:55:47 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D28B5F.82432C40] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: From orichards at pixitmedia.com Mon Feb 20 16:00:50 2017 From: orichards at pixitmedia.com (Orlando Richards) Date: Mon, 20 Feb 2017 16:00:50 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> Message-ID: <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Woo! Still going strong! Lovely to hear it still being useful - thanks Kevin :) -- *Orlando Richards* VP Product Development, Pixit Media 07930742808|orichards at pixitmedia.com www.pixitmedia.com |Tw:@pixitmedia On 20/02/2017 15:40, Buterbaugh, Kevin L wrote: > Hi Mark, > > Are you referring to this? > > http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html > > It?s not magical, but it?s pretty good! ;-) Seriously, we use it any > time we want to move stuff around in our GPFS filesystems. > > Kevin > >> On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com >> wrote: >> >> I have a client that has around 200 filesets (must be a good reason >> for it) and they need to migrate data but it?s really looking like >> this might bring AFM to its knees. At one point, I had heard of some >> magical version of RSYNC that IBM developed that could do something >> like this. Anyone have any details on such a tool and is it >> available. Or is there some other way I might do this? >> >> *Mark R. Bush*| *Storage Architect* >> Mobile: 210-237-8415 >> Twitter:@bushmr | LinkedIn:/markreedbush >> >> 10100 Reunion Place, Suite 500, San Antonio, TX 78216 >> www.siriuscom.com >> |mark.bush at siriuscom.com >> > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu > - (615)875-9633 > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From zgiles at gmail.com Mon Feb 20 16:04:26 2017 From: zgiles at gmail.com (Zachary Giles) Date: Mon, 20 Feb 2017 11:04:26 -0500 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Message-ID: Hey Mark, I'm curious about the idea behind 200 filesets bring AFM to its knees. Any specific part you're concerned about? -Zach On Mon, Feb 20, 2017 at 11:00 AM, Orlando Richards wrote: > Woo! Still going strong! Lovely to hear it still being useful - thanks > Kevin :) > > > -- > *Orlando Richards* > VP Product Development, Pixit Media > 07930742808 | orichards at pixitmedia.com > www.pixitmedia.com | Tw:@pixitmedia > > > > On 20/02/2017 15:40, Buterbaugh, Kevin L wrote: > > Hi Mark, > > Are you referring to this? > > http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012- > October/000169.html > > It?s not magical, but it?s pretty good! ;-) Seriously, we use it any > time we want to move stuff around in our GPFS filesystems. > > Kevin > > On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.com wrote: > > I have a client that has around 200 filesets (must be a good reason for > it) and they need to migrate data but it?s really looking like this might > bring AFM to its knees. At one point, I had heard of some magical version > of RSYNC that IBM developed that could do something like this. Anyone have > any details on such a tool and is it available. Or is there some other way > I might do this? > > > > > *Mark R. Bush*| *Storage Architect* > Mobile: 210-237-8415 <(210)%20237-8415> > Twitter: @bushmr | LinkedIn: /markreedbush > > 10100 Reunion Place, Suite 500, San Antonio, TX 78216 > www.siriuscom.com |mark.bush at siriuscom.com > > > > ? > Kevin Buterbaugh - Senior System Administrator > Vanderbilt University - Advanced Computing Center for Research and > Education > Kevin.Buterbaugh at vanderbilt.edu - (615)875-9633 <(615)%20875-9633> > > > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other person. > Please notify the sender immediately and delete this email from your > computer system. Any opinions expressed are not necessarily those of the > company from which this email was sent and, whilst to the best of our > knowledge no viruses or defects exist, no responsibility can be accepted > for any loss or damage arising from its receipt or subsequent use of this > email. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- Zach Giles zgiles at gmail.com -------------- next part -------------- An HTML attachment was scrubbed... URL: From YARD at il.ibm.com Mon Feb 20 16:05:27 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 18:05:27 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Message-ID: Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: From olaf.weiser at de.ibm.com Mon Feb 20 16:35:03 2017 From: olaf.weiser at de.ibm.com (Olaf Weiser) Date: Mon, 20 Feb 2017 17:35:03 +0100 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu><4d99df7c-4a60-cc6e-403c-6b41cfdc3bdd@pixitmedia.com> Message-ID: An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 20 16:54:23 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 20 Feb 2017 16:54:23 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu> <05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> Message-ID: <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> Regular rsync apparently takes one week to sync up. I?m just the messenger getting more info from my client soon. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 10:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D28B67.B160D010] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image002.gif at 01D28B67.B160D010] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.gif Type: image/gif Size: 1853 bytes Desc: image002.gif URL: From YARD at il.ibm.com Mon Feb 20 17:03:29 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 20 Feb 2017 19:03:29 +0200 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> References: <16441B29-FE5B-4B45-BCE2-875D5E61A03C@vanderbilt.edu><05DA0658-2E68-45ED-8C58-22153D61C7D0@siriuscom.com> <708DDEF5-B11A-4399-BADF-AABDF339AB34@siriuscom.com> Message-ID: Hi Split rsync into the directory level so u can run parallel rsync session , this way you maximize the network usage. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 06:54 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Regular rsync apparently takes one week to sync up. I?m just the messenger getting more info from my client soon. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 10:05 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which protocols used to access data ? GPFS + NFS ? If yes, you can use standard rsync. Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/20/2017 05:56 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Not sure. It?s a 3.5 based cluster currently. From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 20, 2017 at 9:47 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] 200 filesets and AFM Hi Which ACLs you have in your FS ? Do u have NFSv4 Acls - which use NFS + Windows Acls ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Buterbaugh, Kevin L" To: gpfsug main discussion list Date: 02/20/2017 05:41 PM Subject: Re: [gpfsug-discuss] 200 filesets and AFM Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi Mark, Are you referring to this? http://www.spectrumscale.org/pipermail/gpfsug-discuss/2012-October/000169.html It?s not magical, but it?s pretty good! ;-) Seriously, we use it any time we want to move stuff around in our GPFS filesystems. Kevin On Feb 20, 2017, at 9:35 AM, Mark.Bush at siriuscom.comwrote: I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com ? Kevin Buterbaugh - Senior System Administrator Vanderbilt University - Advanced Computing Center for Research and Education Kevin.Buterbaugh at vanderbilt.edu- (615)875-9633 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1853 bytes Desc: not available URL: From r.sobey at imperial.ac.uk Tue Feb 21 13:53:21 2017 From: r.sobey at imperial.ac.uk (Sobey, Richard A) Date: Tue, 21 Feb 2017 13:53:21 +0000 Subject: [gpfsug-discuss] 200 filesets and AFM In-Reply-To: References: Message-ID: Hey, we?ve got 400+ filesets and still adding more ? From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Mark.Bush at siriuscom.com Sent: 20 February 2017 15:35 To: gpfsug main discussion list Subject: [gpfsug-discuss] 200 filesets and AFM I have a client that has around 200 filesets (must be a good reason for it) and they need to migrate data but it?s really looking like this might bring AFM to its knees. At one point, I had heard of some magical version of RSYNC that IBM developed that could do something like this. Anyone have any details on such a tool and is it available. Or is there some other way I might do this? [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From jonathon.anderson at colorado.edu Tue Feb 21 21:39:48 2017 From: jonathon.anderson at colorado.edu (Jonathon A Anderson) Date: Tue, 21 Feb 2017 21:39:48 +0000 Subject: [gpfsug-discuss] Bad performance with GPFS system monitoring (mmsysmon) in GPFS 4.2.1.1 Message-ID: This thread happened before I joined gpfsug-discuss; but be advised that we also experienced severe (1.5x-3x) performance degradation in user applications when running mmsysmon. In particular, we?re running a Haswell+OPA system. The issue appears to only happen when the user application is simultaneously using all available cores *and* communicating over the network. Synthetic cpu tests with HPL did not expose the issue, nor did OSU micro-benchmarks that were designed to maximize the network without necessarily using all CPUs. I?ve stopped mmsysmon by hand[^1] for now; but I haven?t yet gone so far as to remove the config file to prevent it from starting in the future. We intend to run further tests; but I wanted to share our experiences so far (as this took us way longer than I wish it had to diagnose). ~jonathon From dod2014 at med.cornell.edu Wed Feb 22 15:57:46 2017 From: dod2014 at med.cornell.edu (Douglas Duckworth) Date: Wed, 22 Feb 2017 10:57:46 -0500 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node Message-ID: Hello! I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. In addition I tried: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost However the same result. When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. So far I consulted the following documentation: http://ibm.co/2mcjK3P http://ibm.co/2lFSInH Could anyone please help? We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. Thanks so much! Best Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 -------------- next part -------------- An HTML attachment was scrubbed... URL: From david_johnson at brown.edu Wed Feb 22 16:12:15 2017 From: david_johnson at brown.edu (David D. Johnson) Date: Wed, 22 Feb 2017 11:12:15 -0500 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node In-Reply-To: References: Message-ID: I have a feeling that this is how mmchconfig is supposed to work. You?ve asked it to change the configuration of one node, but the database of configuration settings needs to be propagated to the entire cluster whenever a change is made. You?ll find a section in the mmlsconfig output specific to the node(s) that have been changed [node155] ?. At this point your configuration may be out of sync on any number of nodes. ? ddj Dave Johnson Brown University CCV/CIS > On Feb 22, 2017, at 10:57 AM, Douglas Duckworth wrote: > > Hello! > > I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! > > We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... > > Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: > > sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 > > When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. > > In addition I tried: > > sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost > > However the same result. > > When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. > > So far I consulted the following documentation: > > http://ibm.co/2mcjK3P > http://ibm.co/2lFSInH > > Could anyone please help? > > We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. > > Thanks so much! > > Best > Doug > > > Thanks, > > Douglas Duckworth, MSc, LFCS > HPC System Administrator > Scientific Computing Unit > Physiology and Biophysics > Weill Cornell Medicine > E: doug at med.cornell.edu > O: 212-746-6305 > F: 212-746-8690 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From bbanister at jumptrading.com Wed Feb 22 16:17:09 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Wed, 22 Feb 2017 16:17:09 +0000 Subject: [gpfsug-discuss] Changing verbsPorts On Single Node In-Reply-To: References: Message-ID: I agree with this assessment. I would also recommend looking into user defined node classes so that your mmlsconfig output is more easily readable, otherwise each node will be listed in the mmlsconfig output. HTH, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of David D. Johnson Sent: Wednesday, February 22, 2017 10:12 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Changing verbsPorts On Single Node I have a feeling that this is how mmchconfig is supposed to work. You?ve asked it to change the configuration of one node, but the database of configuration settings needs to be propagated to the entire cluster whenever a change is made. You?ll find a section in the mmlsconfig output specific to the node(s) that have been changed [node155] ?. At this point your configuration may be out of sync on any number of nodes. ? ddj Dave Johnson Brown University CCV/CIS On Feb 22, 2017, at 10:57 AM, Douglas Duckworth > wrote: Hello! I am an HPC admin at Weill Cornell Medicine in the Upper East Side of Manhattan. It's a great place with researchers working in many computationally demanding fields. I am asked to do many new things all of the time so it's never boring. Yesterday we deployed a server that's intended to create atomic-level image of a ribosome. Pretty serious science! We have two DDN GridScaler GPFS clusters with around 3PB of storage. FDR Infiniband provides the interconnect. Our compute nodes are Dell PowerEdge 12/13G servers running Centos 6 and 7 while we're using SGE for scheduling. Hopefully soon Slurm. We also have some GPU servers from Pengiun Computing, with GTX 1080s, as well a new Ryft FPGA accelerator. I am hoping our next round of computing power will come from AMD... Anyway, I've been using Ansible to deploy our new GPFS nodes as well as build all other things we need at WCM. I thought that this was complete. However, apparently, the GPFS client's been trying RDMA over port mlx4_0/2 though we need to use mlx4_0/1! Rather than running mmchconfig against the entire cluster, I have been trying it locally on the node that needs to be addressed. For example: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 When ran locally the desired change becomes permanent and we see RDMA active after restarting GPFS service on node. Though mmchconfig still tries to run against all nodes in the cluster! I kill it of course at the known_hosts step. In addition I tried: sudo mmchconfig verbsPorts=mlx4_0/1 -i -N node155 NodeClass=localhost However the same result. When doing capital "i" mmchconfig does attempt ssh with all nodes. Yet the change does not persist after restarting GPFS. So far I consulted the following documentation: http://ibm.co/2mcjK3P http://ibm.co/2lFSInH Could anyone please help? We're using GPFS client version 4.1.1-3 on Centos 6 nodes as well as 4.2.1-2 on those which are running Centos 7. Thanks so much! Best Doug Thanks, Douglas Duckworth, MSc, LFCS HPC System Administrator Scientific Computing Unit Physiology and Biophysics Weill Cornell Medicine E: doug at med.cornell.edu O: 212-746-6305 F: 212-746-8690 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Robert.Oesterlin at nuance.com Thu Feb 23 15:46:20 2017 From: Robert.Oesterlin at nuance.com (Oesterlin, Robert) Date: Thu, 23 Feb 2017 15:46:20 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Message-ID: For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" Reply-To: "dW-notify at us.ibm.com" Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28DB9.AEDC8740] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From aaron.s.knister at nasa.gov Thu Feb 23 17:03:18 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Thu, 23 Feb 2017 12:03:18 -0500 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> On a particularly heavy loaded NSD server I'm seeing a lot of these messages: 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting for conn rdmas < conn maxrdmas' I've tried tweaking verbsRdmasPerConnection but the issue seems to persist. Has anyone has encountered this and if so how'd you fix it? -Aaron -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Thu Feb 23 17:12:40 2017 From: oehmes at gmail.com (Sven Oehme) Date: Thu, 23 Feb 2017 17:12:40 +0000 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: all this waiter shows is that you have more in flight than the node or connection can currently serve. the reasons for that can be misconfiguration or you simply run out of resources on the node, not the connection. with latest code you shouldn't see this anymore for node limits as the system automatically adjusts the number of maximum RDMA's according to the systems Node capabilities : you should see messages in your mmfslog like : 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes verbsRdmaUseCompVectors=yes 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so (version >= 1.1) loaded and initialized. 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased from* 3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes.* 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE we want to eliminate all this configurable limits eventually, but this takes time, but as you can see above, we make progress on each release :-) Sven On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister wrote: > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From usa-principal at gpfsug.org Thu Feb 23 21:54:01 2017 From: usa-principal at gpfsug.org (usa-principal-gpfsug.org) Date: Thu, 23 Feb 2017 13:54:01 -0800 Subject: [gpfsug-discuss] Save the Date April 4-5 2017 Users Group Meeting at NERSC In-Reply-To: <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> References: <62535d44554b14d77fcea20735183ab3@mail.gpfsug.org> <9420c3f6c74149d2eb95b072f20ca4ba@mail.gpfsug.org> Message-ID: <06d616c6d0da5b6aabae1f8d4bbc0b84@webmail.gpfsug.org> Hello, Information, including the registration form, for the April 4-5 User Group Meeting at NERSC (Berkeley, CA) is now available. Please register as early as possible so we can make final decisions about room selection and a science facility tour. The agenda is still be being finalized and we will continue to update the online agenda as details get settled. *We still have room for 2-3 20-minute user talks, if you are interested, please let us know.* Details, and a link to the registration form can be found here: https://www.nersc.gov/research-and-development/data-analytics/spectrum-user-group-meeting/ Looking forward to seeing you in April. Cheers, Kristy & Bob On , usa-principal-gpfsug.org wrote: > I should have also asked for anyone interested in giving a talk, as > usual, the users group meeting is not meant to be used as a sales and > marketing platform, but user experiences are always welcome. > > If you're interested, or have an idea for a talk, please let us know > so we can include it in the agenda. > > Thanks, > Kristy & Bob > > > On , usa-principal-gpfsug.org wrote: >> Just a follow up reminder to save the date, April 4-5, for a two-day >> Spectrum Scale Users Group event hosted by NERSC in Berkeley, >> California. >> >> We are working on the registration form and agenda and hope to be able >> to share more details soon. >> >> Best, >> Kristy & Bob >> >> >> On , usa-principal-gpfsug.org wrote: >>> Hello all and happy new year (depending upon where you are right now >>> :-) ). >>> >>> We'll have more details in 2017, but for now please save the date for >>> a two-day users group meeting at NERSC in Berkeley, California. >>> >>> April 4-5, 2017 >>> National Energy Research Scientific Computing Center (nersc.gov) >>> Berkeley, California >>> >>> We look forward to offering our first two-day event in the US. >>> >>> Best, >>> Kristy & Bob From willi.engeli at id.ethz.ch Fri Feb 24 12:39:03 2017 From: willi.engeli at id.ethz.ch (Engeli Willi (ID SD)) Date: Fri, 24 Feb 2017 12:39:03 +0000 Subject: [gpfsug-discuss] Performance Tests using Bonnie++ forces expell of the client running the test Message-ID: Dear all, Does one of you know if Bonnie++ io Test is compatible with GPFS and if, what could force expell of the client from the cluster? Thanks Willi -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: smime.p7s Type: application/pkcs7-signature Size: 5461 bytes Desc: not available URL: From Achim.Rehor at de.ibm.com Fri Feb 24 13:24:50 2017 From: Achim.Rehor at de.ibm.com (Achim Rehor) Date: Fri, 24 Feb 2017 14:24:50 +0100 Subject: [gpfsug-discuss] Performance Tests using Bonnie++ forces expell of the client running the test In-Reply-To: References: Message-ID: An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 7182 bytes Desc: not available URL: From bbanister at jumptrading.com Fri Feb 24 14:08:19 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 14:08:19 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E75.28281900] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From Paul.Sanchez at deshaw.com Fri Feb 24 15:15:59 2017 From: Paul.Sanchez at deshaw.com (Sanchez, Paul) Date: Fri, 24 Feb 2017 15:15:59 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E86.6F1F9BB0] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From bbanister at jumptrading.com Fri Feb 24 15:25:14 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: References: Message-ID: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E7F.E769D830] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From jfosburg at mdanderson.org Fri Feb 24 15:29:41 2017 From: jfosburg at mdanderson.org (Fosburgh,Jonathan) Date: Fri, 24 Feb 2017 15:29:41 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> References: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> Message-ID: <1487950179.11933.2.camel@mdanderson.org> FWIW, my contact said to do everything, even client only clusters. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations jfosburg at mdanderson.org (713) 745-9346 -----Original Message----- Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption To: gpfsug main discussion list > Reply-to: gpfsug main discussion list From: Bryan Banister > I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:1487950179.36938.0.camel at mdanderson.org] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From bbanister at jumptrading.com Fri Feb 24 16:21:07 2017 From: bbanister at jumptrading.com (Bryan Banister) Date: Fri, 24 Feb 2017 16:21:07 +0000 Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption In-Reply-To: <1487950179.11933.2.camel@mdanderson.org> References: <341f173d39f94bcfaa39fbe17616426e@jumptrading.com> <1487950179.11933.2.camel@mdanderson.org> Message-ID: Here is the latest I got from IBM: The fix only needs to be installed on the file system manager nodes. About how to know if your cluster is affected already, you can check if there was any MMFS_FSSTRUCT error in the system logs. If you encounter any lookup failure, funny ls cmd outputs. Or if any cmd would give some replica mismatch error or warning. If you encountered the following kind of Assertion failure you hit the bug. Thu Jul 21 03:26:32.373 2016: [X] *** Assert exp(prevIndEntryP->nextP->dataBlockNum > dataBlockNum) in line 4552 of file /project/sprelbmd/build/rbmd1629a/src/avs/fs/mmfs/ts/log/repUpdate.C Thu Jul 21 03:26:32.374 2016: [E] *** Traceback: Thu Jul 21 03:26:32.375 2016: [E] 2:0x7FE6E141AB36 logAssertFailed + 0x2D6 at Logger.C:546 Thu Jul 21 03:26:32.376 2016: [E] 3:0x7FE6E13FCD25 InodeRecoveryList::addInodeAndIndBlock(long long, unsigned int, RepDiskAddr const&, InodeRecoveryList::FlagsToSet, long long, RepDiskAddr const&) + 0x355 at repUpdate.C:4552 Thu Jul 21 03:26:32.377 2016: [E] 4:0x7FE6E1066879 RecoverDirEntry(StripeGroup*, LogRecovery*, LogFile*, LogRecordType, long long, int, unsigned int*, char*, int*, RepDiskAddr) + 0x1089 at direct.C:2312 Thu Jul 21 03:26:32.378 2016: [E] 5:0x7FE6E13F8741 LogRecovery::recoverOneObject(long long) + 0x1E1 at recoverlog.C:362 Thu Jul 21 03:26:32.379 2016: [E] 6:0x7FE6E0F29B25 MultiThreadWork::doNextStep() + 0xC5 at workthread.C:533 Thu Jul 21 03:26:32.380 2016: [E] 7:0x7FE6E0F29FBB MultiThreadWork::helperThreadBody(void*) + 0xCB at workthread.C:455 Thu Jul 21 03:26:32.381 2016: [E] 8:0x7FE6E0F5FB26 Thread::callBody(Thread*) + 0x46 at thread.C:393 Thu Jul 21 03:26:32.382 2016: [E] 9:0x7FE6E0F4DD12 Thread::callBodyWrapper(Thread*) + 0xA2 at mastdep.C:1077 Thu Jul 21 03:26:32.383 2016: [E] 10:0x7FE6E0667851 start_thread + 0xD1 at mastdep.C:1077 Thu Jul 21 03:26:32.384 2016: [E] 11:0x7FE6DF7BE90D clone + 0x6D at mastdep.C:1077 Hope that helps, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Fosburgh,Jonathan Sent: Friday, February 24, 2017 9:30 AM To: gpfsug-discuss at spectrumscale.org Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption FWIW, my contact said to do everything, even client only clusters. -- Jonathan Fosburgh Principal Application Systems Analyst Storage Team IT Operations jfosburg at mdanderson.org (713) 745-9346 -----Original Message----- Date: Fri, 24 Feb 2017 15:25:14 +0000 Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption To: gpfsug main discussion list > Reply-to: gpfsug main discussion list > From: Bryan Banister > I just got word that you only need to update the active file system manager node? I?ll let you know if I hear differently, -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Sanchez, Paul Sent: Friday, February 24, 2017 9:16 AM To: gpfsug main discussion list > Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Can anyone from IBM confirm whether this only affects manager nodes or if parallel log recovery is expected to happen on any other nodes? Thx Paul From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Bryan Banister Sent: Friday, February 24, 2017 9:08 AM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Has anyone been hit by this data corruption issue and if so how did you determine the file system had corruption? Thanks! -Bryan From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Oesterlin, Robert Sent: Thursday, February 23, 2017 9:46 AM To: gpfsug main discussion list > Subject: [gpfsug-discuss] Fw: Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption For those not subscribed, see below. Bob Oesterlin Sr Principal Storage Engineer, Nuance From: "dW-notify at us.ibm.com" > Reply-To: "dW-notify at us.ibm.com" > Date: Thursday, February 23, 2017 at 9:42 AM Subject: [EXTERNAL] [Forums] 'gpfs at us.ibm.com' replied to the 'IBM Spectrum Scale V4.2.2 announcements' topic thread in the 'General Parallel File System - Announce (GPFS - Announce)' forum. [cid:image001.png at 01D28E87.B52EFB90] gpfs at us.ibm.com replied to the IBM Spectrum Scale V4.2.2 announcements topic thread in the General Parallel File System - Announce (GPFS - Announce) forum. Flash (Alert) IBM Spectrum Scale V4.2.1/4.2.2 parallel log recovery function may result in undetected data corruption Abstract IBM has identified a problem with the IBM Spectrum Scale parallel log recovery function in V4.2.1/V4.2.2, which may result in undetected data corruption during the course of a file system recovery. See the complete Flash at http://www-01.ibm.com/support/docview.wss?uid=ssg1S1009965 ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss The information contained in this e-mail message may be privileged, confidential, and/or protected from disclosure. This e-mail message may contain protected health information (PHI); dissemination of PHI should comply with applicable federal and state laws. If you are not the intended recipient, or an authorized representative of the intended recipient, any further review, disclosure, use, dissemination, distribution, or copying of this message or any attachment (or the information contained therein) is strictly prohibited. If you think that you have received this e-mail message in error, please notify the sender by return e-mail and delete all references to it and its contents from your systems. ________________________________ Note: This email is for the confidential use of the named addressee(s) only and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you are hereby notified that any review, dissemination or copying of this email is strictly prohibited, and to please notify the sender immediately and destroy this email and any attachments. Email transmission cannot be guaranteed to be secure or error-free. The Company, therefore, does not make any guarantees as to the completeness or accuracy of this email or any attachments. This email is for informational purposes only and does not constitute a recommendation, offer, request or solicitation of any kind to buy, sell, subscribe, redeem or perform any type of transaction of a financial product. -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 289 bytes Desc: image001.png URL: From SAnderson at convergeone.com Fri Feb 24 16:58:34 2017 From: SAnderson at convergeone.com (Shaun Anderson) Date: Fri, 24 Feb 2017 16:58:34 +0000 Subject: [gpfsug-discuss] NFS Permission matchup to mmnfs command Message-ID: <1487955513211.95497@convergeone.com> I have a customer currently using native NFS and we are going to move them over the CES. I'm looking at the mmnfs command and trying to map the nfs export arguments with the CES arguments. My customer has these currently: no_wdelay, nohide, rw, sync, no_root_squash, no_all_squash I have this so far: mmnfs export add /gpfs/ltfsee/ --client XX.XX.XX.XX ( Access_Type=RW, Squash=no_root_squash,noidsquash, NFS_COMMIT=true ) So the only arguments that don't appear accounted for is the 'nohide' parameter. Does this look right? SHAUN ANDERSON STORAGE ARCHITECT O 208.577.2112 M 214.263.7014 NOTICE: This email message and any attachments here to may contain confidential information. Any unauthorized review, use, disclosure, or distribution of such information is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy the original message and all copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: From aaron.s.knister at nasa.gov Fri Feb 24 19:31:08 2017 From: aaron.s.knister at nasa.gov (Aaron Knister) Date: Fri, 24 Feb 2017 14:31:08 -0500 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 From oehmes at gmail.com Fri Feb 24 19:39:30 2017 From: oehmes at gmail.com (Sven Oehme) Date: Fri, 24 Feb 2017 19:39:30 +0000 Subject: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas In-Reply-To: References: <2e9d8c50-4de2-a27c-7473-7f0d28b02639@nasa.gov> Message-ID: its more likely you run out of verbsRdmasPerNode which is the top limit across all connections for a given node. Sven On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister wrote: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Wei1.Guo at UTSouthwestern.edu Fri Feb 24 23:10:07 2017 From: Wei1.Guo at UTSouthwestern.edu (Wei Guo) Date: Fri, 24 Feb 2017 23:10:07 +0000 Subject: [gpfsug-discuss] Hardening sudo wrapper? In-Reply-To: References: Message-ID: <1487977807260.32706@UTSouthwestern.edu> As per the knowledge page suggested (https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.1/com.ibm.spectrum.scale.v4r21.doc/bl1adm_configsudo.htm), a sudo wapper can work around with PermitRootLogin no. However, giving sudo right to a gpfsadmin account with /usr/bin/scp could be dangerous in the case of this gpfsadmin account been compromised. eg. [gpfsadmin at adminNode ~] $ sudo /usr/bin/scp `/bin/echo /dev/random` /path/to/any_important_files.txt Is it possible to remove scp from the sudoers commands? Instead of the recommended here, # Allow members of the gpfs group to run all commands but only selected commands without a password: %gpfsadmin ALL=(ALL) PASSWD: ALL, NOPASSWD: /usr/lpp/mmfs/bin/mmremote, /usr/bin/scp, /bin/echo, /usr/lpp/mmfs/bin/mmsdrrestore We would like to have this line like this: # Disabled command alias Cmnd_alias MMDELCMDS = /usr/lpp/mmfs/bin/mmdeldisk, /usr/lpp/mmfs/bin/mmdelfileset, /usr/lpp/mmfs/bin/mmdelfs, /usr/lpp/mmfs/bin/mmdelnsd, /usr/lpp/mmfs/bin/mmdelsnapshot %gpfsadmin ALL=(root : gpfsadmin) NOPASSWD: /bin/echo, /usr/lpp/mmfs/bin/?, !MMDELCMDS In this case, we limit the gpfsadmin group user to run only selected mm commands, also not including /usr/bin/scp. In the event of system breach, by loosing gpfsadmin group user account, scp will overwrite system config / user data. From my initial test, this seems to be OK for basic admin commands (such as mmstartup, mmshutdown, mmrepquota, mmchfs), but it did not pass the mmcommon test scpwrap command. ?[gpfsadmin at adminNode ~]$ sudo /usr/lpp/mmfs/bin/mmcommon test scpwrap node1 sudo: no tty present and no askpass program specified lost connection mmcommon: Remote copy file command "/usr/lpp/mmfs/bin/scpwrap" failed (push operation). Return code is 1. mmcommon test scpwrap: Command failed. Examine previous error messages to determine cause. [gpfsadmin at adminNode ~]$ sudo /usr/lpp/mmfs/bin/mmcommon test sshwrap node1 mmcommon test sshwrap: Command successfully completed It is unclear to me now that what exactly does the scp do in the sudo wrapper in the GPFS 4.2.0 version as per Yuri Volobuev's note GPFS and Remote Shell (https://www.ibm.com/developerworks/community/wikis/home?lang=en#!/wiki/General%20Parallel%20File%20System%20(GPFS)/page/GPFS%20and%20Remote%20Shell). Will the mmsdrrestore still use scp or rcp to copy the cluster configuration file mmsdrfs around from the central node? Or it uses RPC to synchronize? Are we OK to drop scp/rcp and limit the commands to run? Is there any risk, security wise and performance wise? Can we limit the gpfsadmin account to a very very small level of privilege? I have send this message to gpfs at us.ibm.com and posted at developer works, but I think the answer could benefit other users. Thanks Wei Guo ________________________________________ From: gpfsug-discuss-bounces at spectrumscale.org on behalf of gpfsug-discuss-request at spectrumscale.org Sent: Friday, February 24, 2017 1:39 PM To: gpfsug-discuss at spectrumscale.org Subject: gpfsug-discuss Digest, Vol 61, Issue 46 Send gpfsug-discuss mailing list submissions to gpfsug-discuss at spectrumscale.org To subscribe or unsubscribe via the World Wide Web, visit http://gpfsug.org/mailman/listinfo/gpfsug-discuss or, via email, send a message with subject or body 'help' to gpfsug-discuss-request at spectrumscale.org You can reach the person managing the list at gpfsug-discuss-owner at spectrumscale.org When replying, please edit your Subject line so it is more specific than "Re: Contents of gpfsug-discuss digest..." Today's Topics: 1. NFS Permission matchup to mmnfs command (Shaun Anderson) 2. Re: waiting for conn rdmas < conn maxrdmas (Aaron Knister) 3. Re: waiting for conn rdmas < conn maxrdmas (Sven Oehme) ---------------------------------------------------------------------- Message: 1 Date: Fri, 24 Feb 2017 16:58:34 +0000 From: Shaun Anderson To: gpfsug main discussion list Subject: [gpfsug-discuss] NFS Permission matchup to mmnfs command Message-ID: <1487955513211.95497 at convergeone.com> Content-Type: text/plain; charset="iso-8859-1" I have a customer currently using native NFS and we are going to move them over the CES. I'm looking at the mmnfs command and trying to map the nfs export arguments with the CES arguments. My customer has these currently: no_wdelay, nohide, rw, sync, no_root_squash, no_all_squash I have this so far: mmnfs export add /gpfs/ltfsee/ --client XX.XX.XX.XX ( Access_Type=RW, Squash=no_root_squash,noidsquash, NFS_COMMIT=true ) So the only arguments that don't appear accounted for is the 'nohide' parameter. Does this look right? SHAUN ANDERSON STORAGE ARCHITECT O 208.577.2112 M 214.263.7014 NOTICE: This email message and any attachments here to may contain confidential information. Any unauthorized review, use, disclosure, or distribution of such information is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy the original message and all copies of it. -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ Message: 2 Date: Fri, 24 Feb 2017 14:31:08 -0500 From: Aaron Knister To: Subject: Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: Content-Type: text/plain; charset="windows-1252"; format=flowed Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 ------------------------------ Message: 3 Date: Fri, 24 Feb 2017 19:39:30 +0000 From: Sven Oehme To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] waiting for conn rdmas < conn maxrdmas Message-ID: Content-Type: text/plain; charset="utf-8" its more likely you run out of verbsRdmasPerNode which is the top limit across all connections for a given node. Sven On Fri, Feb 24, 2017 at 11:31 AM Aaron Knister wrote: Interesting, thanks Sven! Could "resources" I'm running out of include NSD server queues? On 2/23/17 12:12 PM, Sven Oehme wrote: > all this waiter shows is that you have more in flight than the node or > connection can currently serve. the reasons for that can be > misconfiguration or you simply run out of resources on the node, not the > connection. with latest code you shouldn't see this anymore for node > limits as the system automatically adjusts the number of maximum RDMA's > according to the systems Node capabilities : > > you should see messages in your mmfslog like : > > 2017-02-23_06:19:50.056-0800: [I] VERBS RDMA starting with > verbsRdmaCm=no verbsRdmaSend=yes verbsRdmaUseMultiCqThreads=yes > verbsRdmaUseCompVectors=yes > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA library libibverbs.so > (version >= 1.1) loaded and initialized. > 2017-02-23_06:19:50.078-0800: [I] VERBS RDMA verbsRdmasPerNode increased > from*_3072 to 3740 because verbsRdmasPerNodeOptimize is set to yes._* > 2017-02-23_06:19:50.121-0800: [I] VERBS RDMA discover mlx5_5 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB9CD state ACTIVE > 2017-02-23_06:19:50.137-0800: [I] VERBS RDMA discover mlx5_4 port 1 > transport IB link IB NUMA node 16 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB9CC state ACTIVE > 2017-02-23_06:19:50.153-0800: [I] VERBS RDMA discover mlx5_3 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB751 state ACTIVE > 2017-02-23_06:19:50.169-0800: [I] VERBS RDMA discover mlx5_2 port 1 > transport IB link IB NUMA node 1 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB750 state ACTIVE > 2017-02-23_06:19:50.185-0800: [I] VERBS RDMA discover mlx5_1 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000013 id 0xE41D2D0300FDB78D state ACTIVE > 2017-02-23_06:19:50.201-0800: [I] VERBS RDMA discover mlx5_0 port 1 > transport IB link IB NUMA node 0 pkey[0] 0xFFFF gid[0] subnet > 0xFEC0000000000015 id 0xE41D2D0300FDB78C state ACTIVE > > we want to eliminate all this configurable limits eventually, but this > takes time, but as you can see above, we make progress on each release :-) > > Sven > > > > > On Thu, Feb 23, 2017 at 9:05 AM Aaron Knister > wrote: > > On a particularly heavy loaded NSD server I'm seeing a lot of these > messages: > > 0x7FFFF08B63E0 ( 15539) waiting 0.004139456 seconds, NSDThread: on > ThCond 0x7FFFA80772C8 (0x7FFFA80772C8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08EED80 ( 15584) waiting 0.004075718 seconds, NSDThread: on > ThCond 0x7FFF680008F8 (0x7FFF680008F8) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF08FDF00 ( 15596) waiting 0.003965504 seconds, NSDThread: on > ThCond 0x7FFF8C00E288 (0x7FFF8C00E288) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF09185A0 ( 15617) waiting 0.003916346 seconds, NSDThread: on > ThCond 0x7FFF9000CB18 (0x7FFF9000CB18) (VERBSEventWaitCondvar), reason > 'waiting for conn rdmas < conn maxrdmas' > 0x7FFFF092B380 ( 15632) waiting 0.003659610 seconds, NSDThread: on > ThCond 0x1DB04B8 (0x1DB04B8) (VERBSEventWaitCondvar), reason 'waiting > for conn rdmas < conn maxrdmas' > > I've tried tweaking verbsRdmasPerConnection but the issue seems to > persist. Has anyone has encountered this and if so how'd you fix it? > > -Aaron > > -- > Aaron Knister > NASA Center for Climate Simulation (Code 606.2) > Goddard Space Flight Center > (301) 286-2776 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- Aaron Knister NASA Center for Climate Simulation (Code 606.2) Goddard Space Flight Center (301) 286-2776 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: ------------------------------ _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss End of gpfsug-discuss Digest, Vol 61, Issue 46 ********************************************** ________________________________ UT Southwestern Medical Center The future of medicine, today. From service at metamodul.com Mon Feb 27 10:22:48 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 27 Feb 2017 11:22:48 +0100 (CET) Subject: [gpfsug-discuss] Q: backup with dsmc & .snapshots directory Message-ID: <459383319.282012.1488190969081@email.1und1.de> An HTML attachment was scrubbed... URL: From luis.bolinches at fi.ibm.com Mon Feb 27 11:13:59 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 27 Feb 2017 11:13:59 +0000 Subject: [gpfsug-discuss] Q: backup with dsmc & .snapshots directory In-Reply-To: <459383319.282012.1488190969081@email.1und1.de> Message-ID: I usually exclude them. Otherwise you will end up with lots of data on the TSM backend. -- Cheers > On 27 Feb 2017, at 12.23, Hans-Joachim Ehlers wrote: > > Hi, > > short question: if we are using the native TSM dsmc Client, should we exclude the "./.snapshots/." directory from the backup or is it best practise to backup the .snapshots as well. > > Note: We DO NOT use a dedicated .snapshots directory for backups right now. The snapshots directory is created by a policy which is not adapted for TSM so the snapshot creation and deletion is not synchronized with TSM. In the near future we might use dedicated .snapshots for the backup. > > tia > > Hajo > > - > Unix Systems Engineer > -------------------------------------------------- > MetaModul GmbH > S?derstr. 12 > 25336 Elmshorn > HRB: 11873 PI > UstID: DE213701983 > Mobil: + 49 177 4393994 > Mail: service at metamodul.com > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > Ellei edell? ole toisin mainittu: / Unless stated otherwise above: Oy IBM Finland Ab PL 265, 00101 Helsinki, Finland Business ID, Y-tunnus: 0195876-3 Registered in Finland -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 11:30:15 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 11:30:15 +0000 Subject: [gpfsug-discuss] Tracking deleted files Message-ID: Hi, Is there a way to track files which have been deleted easily? I'm assuming that we can't easily use a policy scan as they files are no longer in the file-system unless we do some sort of diff? I'm assuming there must be a way of doing this as mmbackup must track deleted files to notify TSM of expired objects. Basically I want a list of new files, changed files and deleted files since a certain time. I'm assuming the first two will be relatively simple with a policyscan, but the latter I'm not sure about. Thanks Simon From jtucker at pixitmedia.com Mon Feb 27 11:59:44 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 27 Feb 2017 11:59:44 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Hi Simon I presented exactly this (albeit briefly) at the 2016 UG. See the snapdiff section of the presentation at: http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf We can track creations, modifications, deletions and moves (from, to) for files and directories between one point in time and another. The selections can be returned via a manner of your choice. If anyone wants to know more, hit me up directly. Incidentally - I will be at BVE this week (http://www.bvexpo.com/) showing new things driven by the Python API and GPFS - so if anyone is in the area and wants to chat about technicals in person rather than on mail, drop me a line and we can sort that out. Best, Jez On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT Services) wrote: > Hi, > > Is there a way to track files which have been deleted easily? I'm assuming > that we can't easily use a policy scan as they files are no longer in the > file-system unless we do some sort of diff? > > I'm assuming there must be a way of doing this as mmbackup must track > deleted files to notify TSM of expired objects. > > Basically I want a list of new files, changed files and deleted files > since a certain time. I'm assuming the first two will be relatively simple > with a policyscan, but the latter I'm not sure about. > > Thanks > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Mon Feb 27 12:00:54 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Mon, 27 Feb 2017 13:00:54 +0100 (CET) Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: <783766399.287097.1488196854922@email.1und1.de> An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 12:39:02 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 12:39:02 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Yeah but that uses snapshots, which is pretty heavy-weight for what I want to do, particularly given mmbackup seems to have a way of tracking deletes... Simon From: > on behalf of Jez Tucker > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Monday, 27 February 2017 at 11:59 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files Hi Simon I presented exactly this (albeit briefly) at the 2016 UG. See the snapdiff section of the presentation at: http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf We can track creations, modifications, deletions and moves (from, to) for files and directories between one point in time and another. The selections can be returned via a manner of your choice. If anyone wants to know more, hit me up directly. Incidentally - I will be at BVE this week (http://www.bvexpo.com/) showing new things driven by the Python API and GPFS - so if anyone is in the area and wants to chat about technicals in person rather than on mail, drop me a line and we can sort that out. Best, Jez On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT Services) > wrote: Hi, Is there a way to track files which have been deleted easily? I'm assuming that we can't easily use a policy scan as they files are no longer in the file-system unless we do some sort of diff? I'm assuming there must be a way of doing this as mmbackup must track deleted files to notify TSM of expired objects. Basically I want a list of new files, changed files and deleted files since a certain time. I'm assuming the first two will be relatively simple with a policyscan, but the latter I'm not sure about. Thanks Simon _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- [http://www.pixitmedia.com/sig/pxone_pt1.png][http://www.pixitmedia.com/sig/pxone_pt2.png][http://www.pixitmedia.com/sig/pxone_pt3.png][http://www.pixitmedia.com/sig/pxone_pt4.png] [http://pixitmedia.com/sig/BVE-Banner4.png] This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jtucker at pixitmedia.com Mon Feb 27 13:11:59 2017 From: jtucker at pixitmedia.com (Jez Tucker) Date: Mon, 27 Feb 2017 13:11:59 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: Hi Whilst it does use snapshots, I'd argue that snapshot creation is pretty lightweight - and always consistent. Your alternative via the mmbackup 'tracking' route is to parse out the mmbackup shadow file. AFAIK to do this /properly in a timely fashion/ you'd need to do this as an inline post process after the scan phase of mmbackup has run, else you're instead looking at the outdated view of the shadow file post previous mmbackup run. mmbackup does not 'track' file changes, it performs a comparison pass between the filesystem contents and what TSM _believes_ is the known state of the file system during each run. If a change is made oob of TSM then you need to re-generate the show file to regain total consistency. Sensibly you should be running any mmbackup process from a snapshot to perform consistent backups without dsmc errors. So all things being equal, using snapshots for exact consistency and not having to regenerate (very heavyweight) or parse out a shadow file periodically is a lighter weight, smoother and reliably consistent workflow. YMMV with either approach depending on your management of TSM and your interpretation of 'consistent view' vs 'good enough'. Jez On Mon, 27 Feb 2017 at 12:39, Simon Thompson (Research Computing - IT Services) wrote: > Yeah but that uses snapshots, which is pretty heavy-weight for what I want > to do, particularly given mmbackup seems to have a way of tracking > deletes... > > Simon > > From: on behalf of Jez Tucker < > jtucker at pixitmedia.com> > Reply-To: "gpfsug-discuss at spectrumscale.org" < > gpfsug-discuss at spectrumscale.org> > Date: Monday, 27 February 2017 at 11:59 > To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files > > Hi Simon > > I presented exactly this (albeit briefly) at the 2016 UG. > > See the snapdiff section of the presentation at: > > > http://files.gpfsug.org/presentations/2016/south-bank/ArcaPix_GPFS_Spectrum_Scale_Python_API_final_17052016.pdf > > We can track creations, modifications, deletions and moves (from, to) for > files and directories between one point in time and another. > > The selections can be returned via a manner of your choice. > > If anyone wants to know more, hit me up directly. > > Incidentally - I will be at BVE this week (http://www.bvexpo.com/) > showing new things driven by the Python API and GPFS - so if anyone is in > the area and wants to chat about technicals in person rather than on mail, > drop me a line and we can sort that out. > > Best, > > Jez > > > On Mon, 27 Feb 2017 at 11:30, Simon Thompson (Research Computing - IT > Services) wrote: > > Hi, > > Is there a way to track files which have been deleted easily? I'm assuming > that we can't easily use a policy scan as they files are no longer in the > file-system unless we do some sort of diff? > > I'm assuming there must be a way of doing this as mmbackup must track > deleted files to notify TSM of expired objects. > > Basically I want a list of new files, changed files and deleted files > since a certain time. I'm assuming the first two will be relatively simple > with a policyscan, but the latter I'm not sure about. > > Thanks > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > -- > > > This email is confidential in that it is intended for the exclusive > attention of the addressee(s) indicated. If you are not the intended > recipient, this email should not be read or disclosed to any other person. > Please notify the sender immediately and delete this email from your > computer system. Any opinions expressed are not necessarily those of the > company from which this email was sent and, whilst to the best of our > knowledge no viruses or defects exist, no responsibility can be accepted > for any loss or damage arising from its receipt or subsequent use of this > email. > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -- -- This email is confidential in that it is intended for the exclusive attention of the addressee(s) indicated. If you are not the intended recipient, this email should not be read or disclosed to any other person. Please notify the sender immediately and delete this email from your computer system. Any opinions expressed are not necessarily those of the company from which this email was sent and, whilst to the best of our knowledge no viruses or defects exist, no responsibility can be accepted for any loss or damage arising from its receipt or subsequent use of this email. -------------- next part -------------- An HTML attachment was scrubbed... URL: From jonathan at buzzard.me.uk Mon Feb 27 13:25:21 2017 From: jonathan at buzzard.me.uk (Jonathan Buzzard) Date: Mon, 27 Feb 2017 13:25:21 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: Message-ID: <1488201921.4074.114.camel@buzzard.me.uk> On Mon, 2017-02-27 at 12:39 +0000, Simon Thompson (Research Computing - IT Services) wrote: > Yeah but that uses snapshots, which is pretty heavy-weight for what I > want to do, particularly given mmbackup seems to have a way of > tracking deletes... > It has been discussed in the past, but the way to track stuff is to enable HSM and then hook into the DSMAPI. That way you can see all the file creates and deletes "live". I can't however find a reference to it now. I have a feeling it was in the IBM GPFS forum however. It would however require you to get your hands dirty writing code. JAB. -- Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk Fife, United Kingdom. From luis.bolinches at fi.ibm.com Mon Feb 27 13:25:15 2017 From: luis.bolinches at fi.ibm.com (Luis Bolinches) Date: Mon, 27 Feb 2017 13:25:15 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: , Message-ID: An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 13:32:42 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 13:32:42 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <1488201921.4074.114.camel@buzzard.me.uk> References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: >It has been discussed in the past, but the way to track stuff is to >enable HSM and then hook into the DSMAPI. That way you can see all the >file creates and deletes "live". Won't work, I already have a "real" HSM client attached to DMAPI (dsmrecalld). I'm not actually wanting to backup for this use case, we already have mmbackup running to do those things, but it was a list of deleted files that I was after (I just thought it might be easy given mmbackup is tracking it already). Simon From oehmes at gmail.com Mon Feb 27 13:37:46 2017 From: oehmes at gmail.com (Sven Oehme) Date: Mon, 27 Feb 2017 13:37:46 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <1488201921.4074.114.camel@buzzard.me.uk> References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: a couple of years ago tridge demonstrated things you can do with DMAPI interface and even delivered some non supported example code to demonstrate it : https://www.samba.org/~tridge/hacksm/ keep in mind that the DMAPI interface has some severe limitations in terms of scaling, it can only run on one node and can have only one subscriber. we are working on a more scalable and supported solution to accomplish what is asks for (track operations, not just delete) , stay tuned in one of the next user group meetings where i will present (Germany and/or London). Sven On Mon, Feb 27, 2017 at 5:25 AM Jonathan Buzzard wrote: > On Mon, 2017-02-27 at 12:39 +0000, Simon Thompson (Research Computing - > IT Services) wrote: > > Yeah but that uses snapshots, which is pretty heavy-weight for what I > > want to do, particularly given mmbackup seems to have a way of > > tracking deletes... > > > > It has been discussed in the past, but the way to track stuff is to > enable HSM and then hook into the DSMAPI. That way you can see all the > file creates and deletes "live". > > I can't however find a reference to it now. I have a feeling it was in > the IBM GPFS forum however. > > It would however require you to get your hands dirty writing code. > > JAB. > > -- > Jonathan A. Buzzard Email: jonathan (at) buzzard.me.uk > Fife, United Kingdom. > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > -------------- next part -------------- An HTML attachment was scrubbed... URL: From S.J.Thompson at bham.ac.uk Mon Feb 27 13:41:47 2017 From: S.J.Thompson at bham.ac.uk (Simon Thompson (Research Computing - IT Services)) Date: Mon, 27 Feb 2017 13:41:47 +0000 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: Manchester ... The UK meeting is most likely going to be in Manchester ... 9th/10th May if you wanted to pencil something in (we're just waiting for final confirmation of the venue being booked). Simon From: > on behalf of Sven Oehme > Reply-To: "gpfsug-discuss at spectrumscale.org" > Date: Monday, 27 February 2017 at 13:37 To: "gpfsug-discuss at spectrumscale.org" > Subject: Re: [gpfsug-discuss] Tracking deleted files we are working on a more scalable and supported solution to accomplish what is asks for (track operations, not just delete) , stay tuned in one of the next user group meetings where i will present (Germany and/or London). -------------- next part -------------- An HTML attachment was scrubbed... URL: From stef.coene at docum.org Mon Feb 27 13:55:26 2017 From: stef.coene at docum.org (Stef Coene) Date: Mon, 27 Feb 2017 14:55:26 +0100 Subject: [gpfsug-discuss] Policy question Message-ID: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> Hi, I have a file system with 2 pools: V500001 and NAS01. I want to use pool V500001 as the default and migrate the oldest files to the pool NAS01 when the pool V500001 fills up. Whatever rule combination I tried, I can not get this working. This is the currently defined policy (created by the GUI): RULE 'Migration' MIGRATE FROM POOL 'V500001' THRESHOLD(95,85) WEIGHT(100000 - DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) TO POOL 'NAS01' RULE 'Default to V5000' SET POOL 'V500001' And also, how can I monitor the migration processes? Stef From makaplan at us.ibm.com Mon Feb 27 16:00:24 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 27 Feb 2017 11:00:24 -0500 Subject: [gpfsug-discuss] Policy questions In-Reply-To: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> References: <81a8f882-d3cb-91c6-41d2-d15c03dabfef@docum.org> Message-ID: I think you have the sign wrong on your weight. A simple way of ordering the files oldest first is WEIGHT(DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) adding 100,000 does nothing to change the order. WEIGHT can be any numeric SQL expression. So come to think of it WEIGHT( - DAYS(ACCESS_TIME) ) is even simpler and will yield the same ordering Also, you must run or schedule the mmapplypolicy command to run to actually do the migration. It doesn't happen until the mmapplypolicy command is running. You can run mmapplypolicy periodically (e.g. with crontab) or on demand with mmaddcallback (GPFS events facility) This is all covered in the very fine official Spectrum Scale documentation and/or some of the supplemental IBM red books, all available for free downloads from ibm.com --marc of GPFS From: Stef Coene To: gpfsug main discussion list Date: 02/27/2017 08:55 AM Subject: [gpfsug-discuss] Policy question Sent by: gpfsug-discuss-bounces at spectrumscale.org Hi, I have a file system with 2 pools: V500001 and NAS01. I want to use pool V500001 as the default and migrate the oldest files to the pool NAS01 when the pool V500001 fills up. Whatever rule combination I tried, I can not get this working. This is the currently defined policy (created by the GUI): RULE 'Migration' MIGRATE FROM POOL 'V500001' THRESHOLD(95,85) WEIGHT(100000 - DAYS(CURRENT_TIMESTAMP) - DAYS(ACCESS_TIME)) TO POOL 'NAS01' RULE 'Default to V5000' SET POOL 'V500001' And also, how can I monitor the migration processes? Stef _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 27 19:40:57 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 19:40:57 +0000 Subject: [gpfsug-discuss] SMB and AD authentication Message-ID: For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. [id:image001.png at 01D2709D.6EF65720] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 8745 bytes Desc: image001.png URL: From YARD at il.ibm.com Mon Feb 27 19:46:07 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 27 Feb 2017 21:46:07 +0200 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8745 bytes Desc: not available URL: From laurence at qsplace.co.uk Mon Feb 27 19:46:59 2017 From: laurence at qsplace.co.uk (Laurence Horrocks-Barlow) Date: Mon, 27 Feb 2017 19:46:59 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Do you have UID/GID for the user in your AD schema? or the rfc 2307 extended schema? AFAIK it uses winbinds IDMAP so requires rfc 2307 attributes rather than using the windows SID and working the UID/GID using autorid etc. -- Lauz On 27 February 2017 19:40:57 GMT+00:00, "Mark.Bush at siriuscom.com" wrote: >For some reason, I just can?t seem to get this to work. I have >configured my protocol nodes to authenticate to AD using the following > >mmuserauth service create --type ad --data-access-method file --servers >192.168.88.3 --user-name administrator --netbios-name scale >--idmap-role master --password ********* --idmap-range-size 1000000 >--idmap-range 10000000-299999999 --enable-nfs-kerberos >--unixmap-domains 'sirius(10000-20000)' > > >All goes well, I see the nodes in AD and all of the wbinfo commands >show good (id Sirius\\administrator doesn?t work though), but when I >try to mount an SMB share (after doing all the necessary mmsmb export >stuff) I get permission denied. I?m curious if I missed a step >(followed the docs pretty much to the letter). I?m trying >Administrator, mark.bush, and a dummy aduser I created. None seem to >gain access to the share. > >Protocol gurus help! Any ideas are appreciated. > > >[id:image001.png at 01D2709D.6EF65720] >Mark R. Bush| Storage Architect >Mobile: 210-237-8415 >Twitter: @bushmr | LinkedIn: >/markreedbush >10100 Reunion Place, Suite 500, San Antonio, TX 78216 >www.siriuscom.com >|mark.bush at siriuscom.com > > >This message (including any attachments) is intended only for the use >of the individual or entity to which it is addressed and may contain >information that is non-public, proprietary, privileged, confidential, >and exempt from disclosure under applicable law. If you are not the >intended recipient, you are hereby notified that any use, >dissemination, distribution, or copying of this communication is >strictly prohibited. This message may be viewed by parties at Sirius >Computer Solutions other than those named in the message header. This >message does not contain an official representation of Sirius Computer >Solutions. If you have received this communication in error, notify >Sirius Computer Solutions immediately and (i) destroy this message if a >facsimile or (ii) delete this message immediately if this is an >electronic communication. Thank you. > >Sirius Computer Solutions -- Sent from my Android device with K-9 Mail. Please excuse my brevity. -------------- next part -------------- An HTML attachment was scrubbed... URL: From Mark.Bush at siriuscom.com Mon Feb 27 19:50:17 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 19:50:17 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: [root at n1 ~]# mmsmb export list share2 export path browseable guest ok smb encrypt share2 /gpfs/fs1/sales yes no auto [root at n1 ~]# ls -l /gpfs/fs1 total 0 drwxrwxrwx 2 root root 4096 Feb 25 12:33 sales From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 27, 2017 at 1:46 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] SMB and AD authentication Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards ________________________________ Yaron Daniel 94 Em Ha'Moshavot Rd [cid:image001.gif at 01D29100.6E55CCF0] Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org ________________________________ For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. [cid:image002.png at 01D29100.6E55CCF0] Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.gif Type: image/gif Size: 1852 bytes Desc: image001.gif URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: image002.png Type: image/png Size: 8746 bytes Desc: image002.png URL: From christof.schmitt at us.ibm.com Mon Feb 27 19:59:46 2017 From: christof.schmitt at us.ibm.com (Christof Schmitt) Date: Mon, 27 Feb 2017 12:59:46 -0700 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: --unixmap-domains 'sirius(10000-20000)' specifies that for the domain SIRIUS, all uid and gids are stored as rfc2307 attributes in the user and group objects in AD. If "id Sirius\\administrator" does not work, that might already point to missing data in AD. The requirement is that the user has a uidNumber defined, and the user's primary group in AD has to have a gidNumber defined. Note that a gidNumber defined for the user is not read by Spectrum Scale at this point. All uidNumber and gidNumber attributes have to fall in the defined range (10000-20000). If verifying the above points does not help, then a winbindd trace might help to point to the missing step: /usr/lpp/mmfs/bin/smbcontrol winbindd debug 10 id Sirius\\administrator /usr/lpp/mmfs/bin/smbcontrol winbindd debug 1 /var/adm/ras/log.winbindd-idmap is the log file for the idmap queries; it might show a failing ldap query in this case. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 12:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss From YARD at il.ibm.com Mon Feb 27 20:04:09 2017 From: YARD at il.ibm.com (Yaron Daniel) Date: Mon, 27 Feb 2017 22:04:09 +0200 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: Hi What does the command return when you run it on the protocols nodes: #id 'DOM\user' Please follow this steps: https://www.ibm.com/support/knowledgecenter/en/STXKQY_4.2.0/ibmspectrumscale42_content.html SA23-1452-06 05/2016 IBM Spectrum Scale V4.2: Administration and Programming Reference Page - 135 Creating SMB share Use the following information to create an SMB share: 1. Create the directory to be exported through SMB: mmcrfileset fs01 fileset --inode-space=new mmlinkfileset fs01 fileset -J /gpfs/fs01/fileset mkdir /gpfs/fs01/fileset/smb Note: IBM recommends an independent fileset for SMB shares. Create a new independent fileset with these commands: mmcrfileset fs01 fileset --inode-space=new mmlinkfileset fs01 fileset -J /gpfs/fs01/fileset If the directory to be exported does not exist, create the directory first by running the following command: mkdir /gpfs/fs01/fileset/smb" 2. The recommended approach for managing access to the SMB share is to manage the ACLs from a Windows client machine. To change the ACLs from a Windows client, change the owner of the share folder to a user ID that will be used to make the ACL changes by running the following command: chown ?DOMAIN\smbadmin? /gpfs/fs01/fileset/smb 3. Create the actual SMB share on the existing directory: mmsmb export add smbexport /gpfs/fs01/fileset/smb Additional options can be set during share creation. For the documentation of all supported options, see ?mmsmb command? on page 663. 4. Verify that the share has been created: mmsmb export list 5. Access the share from a Windows client using the user ID that has been previously made the owner of the folder. 6. Right-click the folder in the Windows Explorer, open the Security tab, click Advanced, and modify the Access Control List as required. Note: An SMB share can only be created when the ACL setting of the underlying file system is -k nfsv4. In all other cases, mmsmb export create will fail with an error. See ?Authorizing protocol users? on page 200 for details and limitations Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services - Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:50 PM Subject: Re: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org [root at n1 ~]# mmsmb export list share2 export path browseable guest ok smb encrypt share2 /gpfs/fs1/sales yes no auto [root at n1 ~]# ls -l /gpfs/fs1 total 0 drwxrwxrwx 2 root root 4096 Feb 25 12:33 sales From: on behalf of Yaron Daniel Reply-To: gpfsug main discussion list Date: Monday, February 27, 2017 at 1:46 PM To: gpfsug main discussion list Subject: Re: [gpfsug-discuss] SMB and AD authentication Hi Can you show the share config + ls -l on the share Fileset/Directory from the protocols nodes ? Regards Yaron Daniel 94 Em Ha'Moshavot Rd Server, Storage and Data Services- Team Leader Petach Tiqva, 49527 Global Technology Services Israel Phone: +972-3-916-5672 Fax: +972-3-916-5672 Mobile: +972-52-8395593 e-mail: yard at il.ibm.com IBM Israel From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 09:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr| LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com|mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1851 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/gif Size: 1852 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: image/png Size: 8746 bytes Desc: not available URL: From Mark.Bush at siriuscom.com Mon Feb 27 20:12:23 2017 From: Mark.Bush at siriuscom.com (Mark.Bush at siriuscom.com) Date: Mon, 27 Feb 2017 20:12:23 +0000 Subject: [gpfsug-discuss] SMB and AD authentication In-Reply-To: References: Message-ID: That was it. I just didn?t have the ScaleUsers group (special AD group I created) set as AD user Sirius\mark.bush?s primary group. Once I did that bam?shares show up and I can view and id works too. Thanks Christof. On 2/27/17, 1:59 PM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of Christof Schmitt" wrote: --unixmap-domains 'sirius(10000-20000)' specifies that for the domain SIRIUS, all uid and gids are stored as rfc2307 attributes in the user and group objects in AD. If "id Sirius\\administrator" does not work, that might already point to missing data in AD. The requirement is that the user has a uidNumber defined, and the user's primary group in AD has to have a gidNumber defined. Note that a gidNumber defined for the user is not read by Spectrum Scale at this point. All uidNumber and gidNumber attributes have to fall in the defined range (10000-20000). If verifying the above points does not help, then a winbindd trace might help to point to the missing step: /usr/lpp/mmfs/bin/smbcontrol winbindd debug 10 id Sirius\\administrator /usr/lpp/mmfs/bin/smbcontrol winbindd debug 1 /var/adm/ras/log.winbindd-idmap is the log file for the idmap queries; it might show a failing ldap query in this case. Regards, Christof Schmitt || IBM || Spectrum Scale Development || Tucson, AZ christof.schmitt at us.ibm.com || +1-520-799-2469 (T/L: 321-2469) From: "Mark.Bush at siriuscom.com" To: gpfsug main discussion list Date: 02/27/2017 12:41 PM Subject: [gpfsug-discuss] SMB and AD authentication Sent by: gpfsug-discuss-bounces at spectrumscale.org For some reason, I just can?t seem to get this to work. I have configured my protocol nodes to authenticate to AD using the following mmuserauth service create --type ad --data-access-method file --servers 192.168.88.3 --user-name administrator --netbios-name scale --idmap-role master --password ********* --idmap-range-size 1000000 --idmap-range 10000000-299999999 --enable-nfs-kerberos --unixmap-domains 'sirius(10000-20000)' All goes well, I see the nodes in AD and all of the wbinfo commands show good (id Sirius\\administrator doesn?t work though), but when I try to mount an SMB share (after doing all the necessary mmsmb export stuff) I get permission denied. I?m curious if I missed a step (followed the docs pretty much to the letter). I?m trying Administrator, mark.bush, and a dummy aduser I created. None seem to gain access to the share. Protocol gurus help! Any ideas are appreciated. Mark R. Bush| Storage Architect Mobile: 210-237-8415 Twitter: @bushmr | LinkedIn: /markreedbush 10100 Reunion Place, Suite 500, San Antonio, TX 78216 www.siriuscom.com |mark.bush at siriuscom.com This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. This message may be viewed by parties at Sirius Computer Solutions other than those named in the message header. This message does not contain an official representation of Sirius Computer Solutions. If you have received this communication in error, notify Sirius Computer Solutions immediately and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication. Thank you. Sirius Computer Solutions From ewahl at osc.edu Mon Feb 27 20:50:49 2017 From: ewahl at osc.edu (Edward Wahl) Date: Mon, 27 Feb 2017 15:50:49 -0500 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> Message-ID: <20170227155049.22001bb0@osc.edu> I can think of a couple of ways to do this. But using snapshots seems heavy, but so does using mmbackup unless you are already running it every day. Diff the shadow files? Haha could be a _terrible_ idea if you have a couple hundred million files. But it IS possible. Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably stayed at a Holiday Inn express at least once in my heavy travel days) -query objects using '-ina=yes' and yesterdays date? Might be a touch slow. But it probably uses the next one as it's backend: -db2 query inside TSM to see a similar thing. This ought to be the fastest, and I'm sure with a little google'ing you can work this out. Tivoli MUST know exact dates of deletion as it uses that and the retention time to know when to purge/reclaim deleted objects from it's storage pools. (retain extra version or RETEXTRA or retain only version) Ed On Mon, 27 Feb 2017 13:32:42 +0000 "Simon Thompson (Research Computing - IT Services)" wrote: > >It has been discussed in the past, but the way to track stuff is to > >enable HSM and then hook into the DSMAPI. That way you can see all the > >file creates and deletes "live". > > Won't work, I already have a "real" HSM client attached to DMAPI > (dsmrecalld). > > I'm not actually wanting to backup for this use case, we already have > mmbackup running to do those things, but it was a list of deleted files > that I was after (I just thought it might be easy given mmbackup is > tracking it already). > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 From makaplan at us.ibm.com Mon Feb 27 21:23:52 2017 From: makaplan at us.ibm.com (Marc A Kaplan) Date: Mon, 27 Feb 2017 16:23:52 -0500 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: <20170227155049.22001bb0@osc.edu> References: <1488201921.4074.114.camel@buzzard.me.uk> <20170227155049.22001bb0@osc.edu> Message-ID: Diffing file lists can be fast - IF you keep the file lists sorted by a unique key, e.g. the inode number. I believe that's how mmbackup does it. Use the classic set difference algorithm. Standard diff is designed to do something else and is terribly slow on large file lists. From: Edward Wahl To: "Simon Thompson (Research Computing - IT Services)" Cc: gpfsug main discussion list Date: 02/27/2017 03:51 PM Subject: Re: [gpfsug-discuss] Tracking deleted files Sent by: gpfsug-discuss-bounces at spectrumscale.org I can think of a couple of ways to do this. But using snapshots seems heavy, but so does using mmbackup unless you are already running it every day. Diff the shadow files? Haha could be a _terrible_ idea if you have a couple hundred million files. But it IS possible. Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably stayed at a Holiday Inn express at least once in my heavy travel days) -query objects using '-ina=yes' and yesterdays date? Might be a touch slow. But it probably uses the next one as it's backend: -db2 query inside TSM to see a similar thing. This ought to be the fastest, and I'm sure with a little google'ing you can work this out. Tivoli MUST know exact dates of deletion as it uses that and the retention time to know when to purge/reclaim deleted objects from it's storage pools. (retain extra version or RETEXTRA or retain only version) Ed On Mon, 27 Feb 2017 13:32:42 +0000 "Simon Thompson (Research Computing - IT Services)" wrote: > >It has been discussed in the past, but the way to track stuff is to > >enable HSM and then hook into the DSMAPI. That way you can see all the > >file creates and deletes "live". > > Won't work, I already have a "real" HSM client attached to DMAPI > (dsmrecalld). > > I'm not actually wanting to backup for this use case, we already have > mmbackup running to do those things, but it was a list of deleted files > that I was after (I just thought it might be easy given mmbackup is > tracking it already). > > Simon > > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss -- Ed Wahl Ohio Supercomputer Center 614-292-9302 _______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From janfrode at tanso.net Mon Feb 27 22:13:46 2017 From: janfrode at tanso.net (Jan-Frode Myklebust) Date: Mon, 27 Feb 2017 23:13:46 +0100 Subject: [gpfsug-discuss] Tracking deleted files In-Reply-To: References: <1488201921.4074.114.camel@buzzard.me.uk> <20170227155049.22001bb0@osc.edu> Message-ID: AFM apparently keeps track og this, so maybe it would be possible to run AFM-SW with disconnected home and query the queue of changes? But would require some way of clearing the queue as well.. -jf On Monday, February 27, 2017, Marc A Kaplan wrote: > Diffing file lists can be fast - IF you keep the file lists sorted by a > unique key, e.g. the inode number. > I believe that's how mmbackup does it. Use the classic set difference > algorithm. > > Standard diff is designed to do something else and is terribly slow on > large file lists. > > > > From: Edward Wahl > > To: "Simon Thompson (Research Computing - IT Services)" < > S.J.Thompson at bham.ac.uk > > > Cc: gpfsug main discussion list > > Date: 02/27/2017 03:51 PM > Subject: Re: [gpfsug-discuss] Tracking deleted files > Sent by: gpfsug-discuss-bounces at spectrumscale.org > > ------------------------------ > > > > I can think of a couple of ways to do this. But using snapshots seems > heavy, > but so does using mmbackup unless you are already running it every day. > > Diff the shadow files? Haha could be a _terrible_ idea if you have a > couple > hundred million files. But it IS possible. > > > Next, I'm NOT a tsm expert, but I know a bit about it: (and I probably > stayed > at a Holiday Inn express at least once in my heavy travel days) > > -query objects using '-ina=yes' and yesterdays date? Might be a touch > slow. But > it probably uses the next one as it's backend: > > -db2 query inside TSM to see a similar thing. This ought to be the > fastest, > and I'm sure with a little google'ing you can work this out. Tivoli MUST > know > exact dates of deletion as it uses that and the retention time to know > when to purge/reclaim deleted objects from it's storage pools. > (retain extra version or RETEXTRA or retain only version) > > Ed > > On Mon, 27 Feb 2017 13:32:42 +0000 > "Simon Thompson (Research Computing - IT Services)" < > S.J.Thompson at bham.ac.uk > > > wrote: > > > >It has been discussed in the past, but the way to track stuff is to > > >enable HSM and then hook into the DSMAPI. That way you can see all the > > >file creates and deletes "live". > > > > Won't work, I already have a "real" HSM client attached to DMAPI > > (dsmrecalld). > > > > I'm not actually wanting to backup for this use case, we already have > > mmbackup running to do those things, but it was a list of deleted files > > that I was after (I just thought it might be easy given mmbackup is > > tracking it already). > > > > Simon > > > > _______________________________________________ > > gpfsug-discuss mailing list > > gpfsug-discuss at spectrumscale.org > > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -- > > Ed Wahl > Ohio Supercomputer Center > 614-292-9302 > _______________________________________________ > gpfsug-discuss mailing list > gpfsug-discuss at spectrumscale.org > http://gpfsug.org/mailman/listinfo/gpfsug-discuss > > > > -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Tue Feb 28 08:44:26 2017 From: service at metamodul.com (Hans-Joachim Ehlers) Date: Tue, 28 Feb 2017 09:44:26 +0100 (CET) Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Message-ID: <1031275380.310791.1488271466638@email.1und1.de> An HTML attachment was scrubbed... URL: From ashish.thandavan at cs.ox.ac.uk Tue Feb 28 16:10:44 2017 From: ashish.thandavan at cs.ox.ac.uk (Ashish Thandavan) Date: Tue, 28 Feb 2017 16:10:44 +0000 Subject: [gpfsug-discuss] mmbackup logging issue Message-ID: Dear all, We have a small GPFS cluster and a separate server running TSM and one of the three NSD servers backs up our GPFS filesystem to the TSM server using mmbackup. After a recent upgrade from v3.5 to 4.1.1, we've noticed that mmbackup no longer logs stuff like it used to : ... Thu Jan 19 05:45:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 2 failed. Thu Jan 19 06:15:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 3 failed. Thu Jan 19 06:45:41 2017 mmbackup:Backing up files: 0 backed up, 870532 expired, 3 failed. ... instead of ... Sat Dec 3 12:01:00 2016 mmbackup:Backing up files: 105030 backed up, 635456 expired, 30 failed. Sat Dec 3 12:31:00 2016 mmbackup:Backing up files: 205934 backed up, 635456 expired, 57 failed. Sat Dec 3 13:01:00 2016 mmbackup:Backing up files: 321702 backed up, 635456 expired, 169 failed. ... like it used to pre-upgrade. I am therefore unable to see how far long it has got, and indeed if it completed successfully, as this is what it logs at the end of a job : ... Tue Jan 17 18:07:31 2017 mmbackup:Completed policy backup run with 0 policy errors, 10012 files failed, 0 severe errors, returning rc=9. Tue Jan 17 18:07:31 2017 mmbackup:Policy for backup returned 9 Highest TSM error 12 mmbackup: TSM Summary Information: Total number of objects inspected: 20617273 Total number of objects backed up: 0 Total number of objects updated: 0 Total number of objects rebound: 0 Total number of objects deleted: 0 Total number of objects expired: 1 Total number of objects failed: 10012 Total number of objects encrypted: 0 Total number of bytes inspected: 3821624716861 Total number of bytes transferred: 3712040943672 Tue Jan 17 18:07:31 2017 mmbackup:Audit files /cs/mmbackup.audit.gpfs* contain 0 failed paths but there were 10012 failures. Cannot reconcile shadow database. Unable to compensate for all TSM errors in new shadow database. Preserving previous shadow database. Run next mmbackup with -q to synchronize shadow database. exit 12 If it helps, the mmbackup job is kicked off with the following options : /usr/lpp/mmfs/bin/mmbackup gpfs -n 8 -t full -B 20000 -L 1 --tsm-servers gpfs_weekly_stanza -N glossop1a | /usr/bin/tee /var/log/mmbackup/gpfs_weekly/backup_log.`date +%Y%m%d_%H_%M` (The excerpts above are from the backup_log. file.) Our NSD servers are running GPFS 4.1.1-11, TSM is at 7.1.1.100 and the File system version is 12.06 (3.4.0.3). Has anyone else seen this behaviour with mmbackup and if so, found a fix? Thanks, Regards, Ash -- ------------------------- Ashish Thandavan UNIX Support Computing Officer Department of Computer Science University of Oxford Wolfson Building Parks Road Oxford OX1 3QD Phone: 01865 610733 Email: ashish.thandavan at cs.ox.ac.uk From TOMP at il.ibm.com Tue Feb 28 17:08:29 2017 From: TOMP at il.ibm.com (Tomer Perry) Date: Tue, 28 Feb 2017 19:08:29 +0200 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster In-Reply-To: <1031275380.310791.1488271466638@email.1und1.de> References: <1031275380.310791.1488271466638@email.1und1.de> Message-ID: Hans-Joachim, Since I'm the one that gave this answer...I'll work on adding it to the FAQ. But, in general: 1. The maximum number of "outbound clusters" - meaning "how many clusters can a client join - is limited to 31 ( 32 including the local cluster) 2. The maximum number or "inbound cluster" - meaning "how many clusters can join my cluster) - is not really limited. Thus, since the smallest cluster possible is a single node cluster, it means that 16383 nodes can join my cluster ( 16384 - 1). Regards, Tomer Perry Scalable I/O Development (Spectrum Scale) email: tomp at il.ibm.com 1 Azrieli Center, Tel Aviv 67021, Israel Global Tel: +1 720 3422758 Israel Tel: +972 3 9188625 Mobile: +972 52 2554625 From: Hans-Joachim Ehlers To: gpfsug main discussion list Date: 28/02/2017 10:44 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Sent by: gpfsug-discuss-bounces at spectrumscale.org First thx to all for the support on this list. It is highly appreciated. My new question: i have currently with IBM a discussion about the maximum number of remote clusters mounting GPFS from a local cluster. The answer was that there is almost no limit to the amount of REMOTE clusters accessing a given cluster. From memory I thought there was a limit of 24 remote clusters and the total amount of node must not exceed 16k nodes. The later is described in the GPFS FAQ but about the maximum number of remote cluster accessing a local cluster I could not find anything within the FAQ. So is there a limit of remote clusters accessing a given GPFS cluster or could I really have almost 16k-n(*) remote clusters ( One node cluster ) as long as the max amount of nodes does not exceed the 16K ? (*) n is the amount of local nodes. Maybe this info should be added also to the FAQ ? Info from the FAQ: https://www.ibm.com/support/knowledgecenter/SSFKCN/com.ibm.cluster.gpfs.doc/gpfs_faqs/gpfsclustersfaq.pdf Q5.4: What is the current limit on the number of nodes that may concurrently join a cluster? A5.4: As of GPFS V3.4.0.18 and GPFS V3.5.0.5, the total number of nodes that may concurrently join a cluster is limited to a maximum of 16384 nodes. tia Hajo -- Unix Systems Engineer -------------------------------------------------- MetaModul GmbH S?derstr. 12 25336 Elmshorn HRB: 11873 PI UstID: DE213701983 Mobil: + 49 177 4393994 Mail: service at metamodul.com_______________________________________________ gpfsug-discuss mailing list gpfsug-discuss at spectrumscale.org http://gpfsug.org/mailman/listinfo/gpfsug-discuss -------------- next part -------------- An HTML attachment was scrubbed... URL: From service at metamodul.com Tue Feb 28 17:45:57 2017 From: service at metamodul.com (service at metamodul.com) Date: Tue, 28 Feb 2017 18:45:57 +0100 Subject: [gpfsug-discuss] Spectrum Scale (GPFS) FAQ - Maximum number of remote clusters accessing a given cluster Message-ID: Thx a lot Perry I never thought about outbound or inbound cluster access. Wish you all the best Hajo --? Unix Systems Engineer MetaModul GmbH +49 177 4393994 -------------- next part -------------- An HTML attachment was scrubbed... URL: