<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

Hi Ed,

<div class=""><br class="">

</div>

<div class="">Thanks for the suggestion … that’s basically what I had done yesterday after Googling and getting a hit or two on the IBM DeveloperWorks site.  I’m including some output below which seems to show that I’ve got everything set up but it’s still

 not working.</div>

<div class=""><br class="">

</div>

<div class="">Am I missing something?  We don’t use CCR on our production cluster (and this experience doesn’t make me eager to do so!), so I’m not that familiar with it...</div>

<div class=""><br class="">

</div>

<div class="">Kevin</div>

<div class=""><br class="">

</div>

<div class="">

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# mmdsh -F /tmp/cluster.hostnames "ps -ef | grep mmccr | grep -v grep" | sort</div>

<div class="">testdellnode1:  root      2583     1  0 May30 ?        00:10:33 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testdellnode1:  root      6694  2583  0 11:19 ?        00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testgateway:  root      2023  5828  0 11:19 ?        00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testgateway:  root      5828     1  0 Sep18 ?        00:00:19 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testnsd1:  root     19356  4628  0 11:19 tty1     00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testnsd1:  root      4628     1  0 Sep19 tty1     00:00:04 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testnsd2:  root     22149  2983  0 11:16 ?        00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testnsd2:  root      2983     1  0 Sep18 ?        00:00:27 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testnsd3:  root     15685  6557  0 11:19 ?        00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testnsd3:  root      6557     1  0 Sep19 ?        00:00:04 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testsched:  root     29424  6512  0 11:19 ?        00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">testsched:  root      6512     1  0 Sep18 ?        00:00:20 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# mmstartup -a</div>

<div class="">get file failed: Not enough CCR quorum nodes available (err 809)</div>

<div class="">gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158</div>

<div class="">mmstartup: Command failed. Examine previous error messages to determine cause.</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# mmdsh -F /tmp/cluster.hostnames "ls -l /var/mmfs/ccr" | sort</div>

<div class="">testdellnode1:  drwxr-xr-x 2 root root 4096 Mar  3  2017 cached</div>

<div class="">testdellnode1:  drwxr-xr-x 2 root root 4096 Nov 10  2016 committed</div>

<div class="">testdellnode1:  -rw-r--r-- 1 root root   99 Nov 10  2016 ccr.nodes</div>

<div class="">testdellnode1:  total 12</div>

<div class="">testgateway:  drwxr-xr-x. 2 root root 4096 Jun 29  2016 committed</div>

<div class="">testgateway:  drwxr-xr-x. 2 root root 4096 Mar  3  2017 cached</div>

<div class="">testgateway:  -rw-r--r--. 1 root root   99 Jun 29  2016 ccr.nodes</div>

<div class="">testgateway:  total 12</div>

<div class="">testnsd1:  drwxr-xr-x 2 root root  6 Sep 19 15:38 cached</div>

<div class="">testnsd1:  drwxr-xr-x 2 root root  6 Sep 19 15:38 committed</div>

<div class="">testnsd1:  -rw-r--r-- 1 root root  0 Sep 19 15:39 ccr.disks</div>

<div class="">testnsd1:  -rw-r--r-- 1 root root  4 Sep 19 15:38 ccr.noauth</div>

<div class="">testnsd1:  -rw-r--r-- 1 root root 99 Sep 19 15:39 ccr.nodes</div>

<div class="">testnsd1:  total 8</div>

<div class="">testnsd2:  drwxr-xr-x 2 root root   22 Mar  3  2017 cached</div>

<div class="">testnsd2:  drwxr-xr-x 2 root root 4096 Sep 18 11:49 committed</div>

<div class="">testnsd2:  -rw------- 1 root root 4096 Sep 18 11:50 ccr.paxos.1</div>

<div class="">testnsd2:  -rw------- 1 root root 4096 Sep 18 11:50 ccr.paxos.2</div>

<div class="">testnsd2:  -rw-r--r-- 1 root root    0 Jun 29  2016 ccr.disks</div>

<div class="">testnsd2:  -rw-r--r-- 1 root root   99 Jun 29  2016 ccr.nodes</div>

<div class="">testnsd2:  total 16</div>

<div class="">testnsd3:  drwxr-xr-x 2 root root  6 Sep 19 15:41 cached</div>

<div class="">testnsd3:  drwxr-xr-x 2 root root  6 Sep 19 15:41 committed</div>

<div class="">testnsd3:  -rw-r--r-- 1 root root  0 Jun 29  2016 ccr.disks</div>

<div class="">testnsd3:  -rw-r--r-- 1 root root  4 Sep 19 15:41 ccr.noauth</div>

<div class="">testnsd3:  -rw-r--r-- 1 root root 99 Jun 29  2016 ccr.nodes</div>

<div class="">testnsd3:  total 8</div>

<div class="">testsched:  drwxr-xr-x. 2 root root 4096 Jun 29  2016 committed</div>

<div class="">testsched:  drwxr-xr-x. 2 root root 4096 Mar  3  2017 cached</div>

<div class="">testsched:  -rw-r--r--. 1 root root   99 Jun 29  2016 ccr.nodes</div>

<div class="">testsched:  total 12</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# more ../ccr/ccr.nodes</div>

<div class="">3,0,10.0.6.215,,testnsd3.vampire</div>

<div class="">1,0,10.0.6.213,,testnsd1.vampire</div>

<div class="">2,0,10.0.6.214,,testnsd2.vampire</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# mmdsh -F /tmp/cluster.hostnames "ls -l /var/mmfs/gen/mmsdrfs"</div>

<div class="">testnsd1:  -rw-r--r-- 1 root root 20360 Sep 19 15:21 /var/mmfs/gen/mmsdrfs</div>

<div class="">testnsd3:  -rw-r--r-- 1 root root 20360 Sep 19 15:34 /var/mmfs/gen/mmsdrfs</div>

<div class="">testnsd2:  -rw-r--r-- 1 root root 20360 Aug 25 17:34 /var/mmfs/gen/mmsdrfs</div>

<div class="">testdellnode1:  -rw-r--r-- 1 root root 20360 Aug 25 17:43 /var/mmfs/gen/mmsdrfs</div>

<div class="">testgateway:  -rw-r--r--. 1 root root 20360 Aug 25 17:43 /var/mmfs/gen/mmsdrfs</div>

<div class="">testsched:  -rw-r--r--. 1 root root 20360 Aug 25 17:43 /var/mmfs/gen/mmsdrfs</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# mmdsh -F /tmp/cluster.hostnames "md5sum /var/mmfs/gen/mmsdrfs"</div>

<div class="">testnsd1:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs</div>

<div class="">testnsd3:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs</div>

<div class="">testnsd2:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs</div>

<div class="">testdellnode1:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs</div>

<div class="">testgateway:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs</div>

<div class="">testsched:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# mmdsh -F /tmp/cluster.hostnames "md5sum /var/mmfs/ssl/stage/genkeyData1"</div>

<div class="">testnsd3:  ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1</div>

<div class="">testnsd1:  ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1</div>

<div class="">testnsd2:  ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1</div>

<div class="">testdellnode1:  ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1</div>

<div class="">testgateway:  ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1</div>

<div class="">testsched:  ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1</div>

<div class="">/var/mmfs/gen</div>

<div class="">root@testnsd2# </div>

<div><br class="">

<blockquote type="cite" class="">

<div class="">On Sep 20, 2017, at 10:48 AM, Edward Wahl <<a href="mailto:ewahl@osc.edu" class="">ewahl@osc.edu</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<div class="">I've run into this before.  We didn't use to use CCR.  And restoring nodes for<br class="">

us is a major pain in the rear as we only allow one-way root SSH, so we have a<br class="">

number of useful little scripts to work around problems like this.<br class="">

<br class="">

Assuming that you have all the necessary files copied to the correct<br class="">

places, you can manually kick off CCR. <br class="">

<br class="">

I think my script does something like:<br class="">

<br class="">

(copy the encryption key info) <br class="">

<br class="">

scp  /var/mmfs/ccr/ccr.nodes <node>:/var/mmfs/ccr/<br class="">

<br class="">

scp /var/mmfs/gen/mmsdrfs <node>:/var/mmfs/gen/<br class="">

<br class="">

scp /var/mmfs/ssl/stage/genkeyData1  <node>:/var/mmfs/ssl/stage/<br class="">

<br class="">

<node>:/usr/lpp/mmfs/bin/mmcommon startCcrMonitor<br class="">

<br class="">

you should then see like 2 copies of it running under mmksh.<br class="">

<br class="">

Ed<br class="">

<br class="">

<br class="">

On Wed, 20 Sep 2017 13:55:28 +0000<br class="">

"Buterbaugh, Kevin L" <<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">Kevin.Buterbaugh@Vanderbilt.Edu</a>> wrote:<br class="">

<br class="">

<blockquote type="cite" class="">Hi All,<br class="">

<br class="">

testnsd1 and testnsd3 both had hardware issues (power supply and internal HD<br class="">

respectively).  Given that they were 12 year old boxes, we decided to replace<br class="">

them with other boxes that are a mere 7 years old … keep in mind that this is<br class="">

a test cluster.<br class="">

<br class="">

Disabling CCR does not work, even with the undocumented “—force” option:<br class="">

<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmchcluster --ccr-disable -p testnsd2 -s testnsd1 --force<br class="">

mmchcluster: Unable to obtain the GPFS configuration file lock.<br class="">

mmchcluster: GPFS was unable to obtain a lock from node testnsd1.vampire.<br class="">

mmchcluster: Processing continues without lock protection.<br class="">

The authenticity of host 'testnsd3.vampire (10.0.6.215)' can't be established.<br class="">

ECDSA key fingerprint is SHA256:Ky1pkjsC/kvt4RA8PJuEh/W3vcxCJZplr2m1XHr+UwI.<br class="">

ECDSA key fingerprint is MD5:55:59:a0:2a:6e:a1:00:58:85:3d:ac:86:0e:cd:2a:8a.<br class="">

Are you sure you want to continue connecting (yes/no)? The authenticity of<br class="">

host 'testnsd1.vampire (10.0.6.213)' can't be established. ECDSA key<br class="">

fingerprint is SHA256:WPiTtyuyzhuv+lRRpgDjLuHpyHyk/W3+c5N9SabWvnE. ECDSA key<br class="">

fingerprint is MD5:26:26:2a:bf:e4:cb:1d:a8:27:35:96:ef:b5:96:e0:29. Are you<br class="">

sure you want to continue connecting (yes/no)? The authenticity of host<br class="">

'vmp609.vampire (10.0.21.9)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:/gX6eSp/shsRboVFcUFcNCtGSfbBIWQZ/CWjA6gb17Q. ECDSA key fingerprint is<br class="">

MD5:ca:4d:58:8c:91:28:25:7b:5b:b1:0d:a3:72:a3:00:bb. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'vmp608.vampire<br class="">

(10.0.21.8)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:tvtNWN9b7/Qknb/Am8x7FzyMngi6R3f5SHBqATNtLzw. ECDSA key fingerprint is<br class="">

MD5:fc:4e:87:fb:09:82:cd:67:b0:7d:7f:c7:4b:83:b9:6c. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'vmp612.vampire<br class="">

(10.0.21.12)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:zKXqPt8rIMZWSAYavKEuaAVIm31OGVovoWVU+dBTRPM. ECDSA key fingerprint is<br class="">

MD5:72:4d:fb:22:4e:b3:0e:04:37:be:16:74:ae:ea:05:6c. Are you sure you want to<br class="">

continue connecting (yes/no)?<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s password:<br class="">

testnsd3.vampire:  Host key verification failed. mmdsh: testnsd3.vampire<br class="">

remote shell process had return code 255. testnsd1.vampire:  Host key<br class="">

verification failed. mmdsh: testnsd1.vampire remote shell process had return<br class="">

code 255. vmp609.vampire:  Host key verification failed. mmdsh:<br class="">

vmp609.vampire remote shell process had return code 255. vmp608.vampire:<br class="">

Host key verification failed. mmdsh: vmp608.vampire remote shell process had<br class="">

return code 255. vmp612.vampire:  Host key verification failed. mmdsh:<br class="">

vmp612.vampire remote shell process had return code 255.<br class="">

<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s password: vmp610.vampire:<br class="">

Permission denied, please try again.<br class="">

<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s password: vmp610.vampire:<br class="">

Permission denied, please try again.<br class="">

<br class="">

vmp610.vampire:  Permission denied<br class="">

(publickey,gssapi-keyex,gssapi-with-mic,password). mmdsh: vmp610.vampire<br class="">

remote shell process had return code 255.<br class="">

<br class="">

Verifying GPFS is stopped on all nodes ...<br class="">

The authenticity of host 'testnsd3.vampire (10.0.6.215)' can't be established.<br class="">

ECDSA key fingerprint is SHA256:Ky1pkjsC/kvt4RA8PJuEh/W3vcxCJZplr2m1XHr+UwI.<br class="">

ECDSA key fingerprint is MD5:55:59:a0:2a:6e:a1:00:58:85:3d:ac:86:0e:cd:2a:8a.<br class="">

Are you sure you want to continue connecting (yes/no)? The authenticity of<br class="">

host 'vmp612.vampire (10.0.21.12)' can't be established. ECDSA key<br class="">

fingerprint is SHA256:zKXqPt8rIMZWSAYavKEuaAVIm31OGVovoWVU+dBTRPM. ECDSA key<br class="">

fingerprint is MD5:72:4d:fb:22:4e:b3:0e:04:37:be:16:74:ae:ea:05:6c. Are you<br class="">

sure you want to continue connecting (yes/no)? The authenticity of host<br class="">

'vmp608.vampire (10.0.21.8)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:tvtNWN9b7/Qknb/Am8x7FzyMngi6R3f5SHBqATNtLzw. ECDSA key fingerprint is<br class="">

MD5:fc:4e:87:fb:09:82:cd:67:b0:7d:7f:c7:4b:83:b9:6c. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'vmp609.vampire<br class="">

(10.0.21.9)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:/gX6eSp/shsRboVFcUFcNCtGSfbBIWQZ/CWjA6gb17Q. ECDSA key fingerprint is<br class="">

MD5:ca:4d:58:8c:91:28:25:7b:5b:b1:0d:a3:72:a3:00:bb. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'testnsd1.vampire<br class="">

(10.0.6.213)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:WPiTtyuyzhuv+lRRpgDjLuHpyHyk/W3+c5N9SabWvnE. ECDSA key fingerprint is<br class="">

MD5:26:26:2a:bf:e4:cb:1d:a8:27:35:96:ef:b5:96:e0:29. Are you sure you want to<br class="">

continue connecting (yes/no)?<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s password:<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s password:<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s password:<br class="">

<br class="">

testnsd3.vampire:  Host key verification failed.<br class="">

mmdsh: testnsd3.vampire remote shell process had return code 255.<br class="">

vmp612.vampire:  Host key verification failed.<br class="">

mmdsh: vmp612.vampire remote shell process had return code 255.<br class="">

vmp608.vampire:  Host key verification failed.<br class="">

mmdsh: vmp608.vampire remote shell process had return code 255.<br class="">

vmp609.vampire:  Host key verification failed.<br class="">

mmdsh: vmp609.vampire remote shell process had return code 255.<br class="">

testnsd1.vampire:  Host key verification failed.<br class="">

mmdsh: testnsd1.vampire remote shell process had return code 255.<br class="">

vmp610.vampire:  Permission denied, please try again.<br class="">

vmp610.vampire:  Permission denied, please try again.<br class="">

vmp610.vampire:  Permission denied<br class="">

(publickey,gssapi-keyex,gssapi-with-mic,password). mmdsh: vmp610.vampire<br class="">

remote shell process had return code 255. mmchcluster: Command failed.<br class="">

Examine previous error messages to determine cause. /var/mmfs/gen<br class="">

root@testnsd2#<br class="">

<br class="">

I believe that part of the problem may be that there are 4 client nodes that<br class="">

were removed from the cluster without removing them from the cluster (done by<br class="">

another SysAdmin who was in a hurry to repurpose those machines).  They’re up<br class="">

and pingable but not reachable by GPFS anymore, which I’m pretty sure is<br class="">

making things worse.<br class="">

<br class="">

Nor does Loic’s suggestion of running mmcommon work (but thanks for the<br class="">

suggestion!) … actually the mmcommon part worked, but a subsequent attempt to<br class="">

start the cluster up failed:<br class="">

<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmstartup -a<br class="">

get file failed: Not enough CCR quorum nodes available (err 809)<br class="">

gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158<br class="">

mmstartup: Command failed. Examine previous error messages to determine cause.<br class="">

/var/mmfs/gen<br class="">

root@testnsd2#<br class="">

<br class="">

Thanks.<br class="">

<br class="">

Kevin<br class="">

<br class="">

On Sep 19, 2017, at 10:07 PM, IBM Spectrum Scale<br class="">

<<a href="mailto:scale@us.ibm.com" class="">scale@us.ibm.com</a><<a href="mailto:scale@us.ibm.com" class="">mailto:scale@us.ibm.com</a>>> wrote:<br class="">

<br class="">

<br class="">

Hi Kevin,<br class="">

<br class="">

Let's me try to understand the problem you have. What's the meaning of node<br class="">

died here. Are you mean that there are some hardware/OS issue which cannot be<br class="">

fixed and OS cannot be up anymore?<br class="">

<br class="">

I agree with Bob that you can have a try to disable CCR temporally, restore<br class="">

cluster configuration and enable it again.<br class="">

<br class="">

Such as:<br class="">

<br class="">

1. Login to a node which has proper GPFS config, e.g NodeA<br class="">

2. Shutdown daemon in all client cluster.<br class="">

3. mmchcluster --ccr-disable -p NodeA<br class="">

4. mmsdrrestore -a -p NodeA<br class="">

5. mmauth genkey propagate -N testnsd1, testnsd3<br class="">

6. mmchcluster --ccr-enable<br class="">

<br class="">

Regards, The Spectrum Scale (GPFS) team<br class="">

<br class="">

------------------------------------------------------------------------------------------------------------------<br class="">

If you feel that your question can benefit other users of Spectrum Scale<br class="">

(GPFS), then please post it to the public IBM developerWroks Forum at<br class="">

<a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01" class="">https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01</a>%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=8OL9COHsb4M%2BZOyWta92acdO8K1Ez8HJfHbrCdDsmRs%3D&reserved=0<<a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=rDOjWbVnVsp5M75VorQgDtZhxMrgvwIgV%2BReJgt5ZUs%3D&reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=rDOjWbVnVsp5M75VorQgDtZhxMrgvwIgV%2BReJgt5ZUs%3D&reserved=0</a>>.<br class="">

<br class="">

If your query concerns a potential software error in Spectrum Scale (GPFS)<br class="">

and you have an IBM software maintenance contract please contact<br class="">

1-800-237-5511 in the United States or your local IBM Service Center in other<br class="">

countries.<br class="">

<br class="">

The forum is informally monitored as time permits and should not be used for<br class="">

priority messages to the Spectrum Scale (GPFS) team.<br class="">

<br class="">

<graycol.gif>"Oesterlin, Robert" ---09/20/2017 07:39:55 AM---OK – I’ve run<br class="">

across this before, and it’s because of a bug (as I recall) having to do with<br class="">

CCR and<br class="">

<br class="">

From: "Oesterlin, Robert"<br class="">

<<a href="mailto:Robert.Oesterlin@nuance.com" class="">Robert.Oesterlin@nuance.com</a><<a href="mailto:Robert.Oesterlin@nuance.com" class="">mailto:Robert.Oesterlin@nuance.com</a>>> To: gpfsug<br class="">

main discussion list<br class="">

<<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>>><br class="">

Date: 09/20/2017 07:39 AM Subject: Re: [gpfsug-discuss] CCR cluster down for<br class="">

the count? Sent by:<br class="">

<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">gpfsug-discuss-bounces@spectrumscale.org</a><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">mailto:gpfsug-discuss-bounces@spectrumscale.org</a>><br class="">

<br class="">

________________________________<br class="">

<br class="">

<br class="">

<br class="">

OK – I’ve run across this before, and it’s because of a bug (as I recall)<br class="">

having to do with CCR and quorum. What I think you can do is set the cluster<br class="">

to non-ccr (mmchcluster –ccr-disable) with all the nodes down, bring it back<br class="">

up and then re-enable ccr.<br class="">

<br class="">

I’ll see if I can find this in one of the recent 4.2 release nodes.<br class="">

<br class="">

<br class="">

Bob Oesterlin<br class="">

Sr Principal Storage Engineer, Nuance<br class="">

<br class="">

<br class="">

From:<br class="">

<<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">gpfsug-discuss-bounces@spectrumscale.org</a><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">mailto:gpfsug-discuss-bounces@spectrumscale.org</a>>><br class="">

on behalf of "Buterbaugh, Kevin L"<br class="">

<<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">Kevin.Buterbaugh@Vanderbilt.Edu</a><<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">mailto:Kevin.Buterbaugh@Vanderbilt.Edu</a>>><br class="">

Reply-To: gpfsug main discussion list<br class="">

<<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>>><br class="">

Date: Tuesday, September 19, 2017 at 4:03 PM To: gpfsug main discussion list<br class="">

<<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>>><br class="">

Subject: [EXTERNAL] [gpfsug-discuss] CCR cluster down for the count?<br class="">

<br class="">

Hi All,<br class="">

<br class="">

We have a small test cluster that is CCR enabled. It only had/has 3 NSD<br class="">

servers (testnsd1, 2, and 3) and maybe 3-6 clients. testnsd3 died a while<br class="">

back. I did nothing about it at the time because it was due to be life-cycled<br class="">

as soon as I finished a couple of higher priority projects.<br class="">

<br class="">

Yesterday, testnsd1 also died, which took the whole cluster down. So now<br class="">

resolving this has become higher priority… ;-)<br class="">

<br class="">

I took two other boxes and set them up as testnsd1 and 3, respectively. I’ve<br class="">

done a “mmsdrrestore -p testnsd2 -R /usr/bin/scp” on both of them. I’ve also<br class="">

done a "mmccr setup -F” and copied the ccr.disks and ccr.nodes files from<br class="">

testnsd2 to them. And I’ve copied /var/mmfs/gen/mmsdrfs from testnsd2 to<br class="">

testnsd1 and 3. In case it’s not obvious from the above, networking is fine …<br class="">

ssh without a password between those 3 boxes is fine.<br class="">

<br class="">

However, when I try to startup GPFS … or run any GPFS command I get:<br class="">

<br class="">

/root<br class="">

root@testnsd2# mmstartup -a<br class="">

get file failed: Not enough CCR quorum nodes available (err 809)<br class="">

gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158<br class="">

mmstartup: Command failed. Examine previous error messages to determine cause.<br class="">

/root<br class="">

root@testnsd2#<br class="">

<br class="">

I’ve got to run to a meeting right now, so I hope I’m not leaving out any<br class="">

crucial details here … does anyone have an idea what I need to do? Thanks…<br class="">

<br class="">

—<br class="">

Kevin Buterbaugh - Senior System Administrator<br class="">

Vanderbilt University - Advanced Computing Center for Research and Education<br class="">

<a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">Kevin.Buterbaugh@vanderbilt.edu</a><<a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">mailto:Kevin.Buterbaugh@vanderbilt.edu</a>> -<br class="">

(615)875-9633<br class="">

<br class="">

<br class="">

_______________________________________________<br class="">

gpfsug-discuss mailing list<br class="">

gpfsug-discuss at <a href="http://spectrumscale.org" class="">spectrumscale.org</a><<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=sVk0NNvXp4b4MnO8gUXBx0pEnAClHIGz9%2BSocg64TSQ%3D&reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=sVk0NNvXp4b4MnO8gUXBx0pEnAClHIGz9%2BSocg64TSQ%3D&reserved=0</a>><br class="">

<a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=oQ4u%2BdyyYLY7HzaOqRPEGjUVhi7AQF%2BvbvnWA4bhuXE%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=66K3H2yHjRwd%2F56tamS2itwN6%2Fg3fnVkLAl9D0M%2BWSQ%3D&reserved=0>" class="">https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=oQ4u%2BdyyYLY7HzaOqRPEGjUVhi7AQF%2BvbvnWA4bhuXE%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=66K3H2yHjRwd%2F56tamS2itwN6%2Fg3fnVkLAl9D0M%2BWSQ%3D&reserved=0></a><br class="">

<br class="">

<br class="">

<br class="">

_______________________________________________<br class="">

gpfsug-discuss mailing list<br class="">

gpfsug-discuss at spectrumscale.org<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=sVk0NNvXp4b4MnO8gUXBx0pEnAClHIGz9%2BSocg64TSQ%3D&reserved=0><br class="">

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=kBvEL7Kp2JMGuLIL4NX3UV7h3emaayQSbHr8O1F2CXc%3D&reserved=0<br class="">

<br class="">

</blockquote>

<br class="">

<br class="">

<br class="">

-- <br class="">

<br class="">

Ed Wahl<br class="">

Ohio Supercomputer Center<br class="">

614-292-9302<br class="">

</div>

</div>

</blockquote>

</div>

<br class="">

</div>

<br class="">

<br class="">

<div class="">

<div class="">—</div>

<div class="">Kevin Buterbaugh - Senior System Administrator</div>

<div class="">Vanderbilt University - Advanced Computing Center for Research and Education</div>

<div class=""><a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">Kevin.Buterbaugh@vanderbilt.edu</a> - (615)875-9633</div>

<div class=""><br class="">

</div>

<br class="Apple-interchange-newline">

</div>

<br class="">

</body>

</html>