<html>

<head>

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

</head>

<body style="word-wrap: break-word; -webkit-nbsp-mode: space; line-break: after-white-space;" class="">

Hi All,

<div class=""><br class="">

</div>

<div class="">Ralf Eberhard of IBM helped me resolve this off list.  The key was to temporarily make testnsd1 and testnsd3 not be quorum nodes by making sure GPFS was down and then executing:</div>

<div class=""><br class="">

</div>

<div class=""><span style="font-size: small;" class="">mmchnode --nonquorum -N testnsd1,testnsd3 --force</span><br class="">

<div><br class="">

</div>

<div>That gave me some scary messages about overriding normal GPFS quorum semantics, but nce that was done I was able to run an “mmstartup -a” and bring up the cluster!  Once it was up and I had verified things were working properly I then shut it back down

 so that I could rerun the mmchnode (without the —force) to make testnsd1 and testnsd3 quorum nodes again.</div>

<div><br class="">

</div>

<div>Thanks to all who helped me out here…</div>

<div><br class="">

</div>

<div>Kevin</div>

<div><br class="">

</div>

<div>

<blockquote type="cite" class="">

<div class="">On Sep 20, 2017, at 2:07 PM, Edward Wahl <<a href="mailto:ewahl@osc.edu" class="">ewahl@osc.edu</a>> wrote:</div>

<br class="Apple-interchange-newline">

<div class="">

<div class=""><br class="">

So who was the ccrmaster before? <br class="">

What is/was the quorum config?  (tiebreaker disks?) <br class="">

<br class="">

what does 'mmccr check' say?<br class="">

<br class="">

<br class="">

Have you set DEBUG=1 and tried mmstartup to see if it teases out any more info<br class="">

from the error?<br class="">

<br class="">

<br class="">

Ed<br class="">

<br class="">

<br class="">

On Wed, 20 Sep 2017 16:27:48 +0000<br class="">

"Buterbaugh, Kevin L" <<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">Kevin.Buterbaugh@Vanderbilt.Edu</a>> wrote:<br class="">

<br class="">

<blockquote type="cite" class="">Hi Ed,<br class="">

<br class="">

Thanks for the suggestion … that’s basically what I had done yesterday after<br class="">

Googling and getting a hit or two on the IBM DeveloperWorks site.  I’m<br class="">

including some output below which seems to show that I’ve got everything set<br class="">

up but it’s still not working.<br class="">

<br class="">

Am I missing something?  We don’t use CCR on our production cluster (and this<br class="">

experience doesn’t make me eager to do so!), so I’m not that familiar with<br class="">

it...<br class="">

<br class="">

Kevin<br class="">

<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmdsh -F /tmp/cluster.hostnames "ps -ef | grep mmccr | grep -v<br class="">

grep" | sort testdellnode1:  root      2583     1  0 May30 ?<br class="">

00:10:33 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15<br class="">

testdellnode1:  root      6694  2583  0 11:19 ?<br class="">

00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15<br class="">

testgateway:  root      2023  5828  0 11:19 ?<br class="">

00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15<br class="">

testgateway:  root      5828     1  0 Sep18 ?<br class="">

00:00:19 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 testnsd1:<br class="">

root     19356  4628  0 11:19 tty1<br class="">

00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 testnsd1:<br class="">

root      4628     1  0 Sep19 tty1<br class="">

00:00:04 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 testnsd2:<br class="">

root     22149  2983  0 11:16 ?<br class="">

00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 testnsd2:<br class="">

root      2983     1  0 Sep18 ?<br class="">

00:00:27 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 testnsd3:<br class="">

root     15685  6557  0 11:19 ?<br class="">

00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15 testnsd3:<br class="">

root      6557     1  0 Sep19 ?<br class="">

00:00:04 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15<br class="">

testsched:  root     29424  6512  0 11:19 ?<br class="">

00:00:00 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15<br class="">

testsched:  root      6512     1  0 Sep18 ?<br class="">

00:00:20 /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor<br class="">

15 /var/mmfs/gen root@testnsd2# mmstartup -a get file failed: Not enough CCR<br class="">

quorum nodes available (err 809) gpfsClusterInit: Unexpected error from ccr<br class="">

fget mmsdrfs.  Return code: 158 mmstartup: Command failed. Examine previous<br class="">

error messages to determine cause. /var/mmfs/gen root@testnsd2# mmdsh<br class="">

-F /tmp/cluster.hostnames "ls -l /var/mmfs/ccr" | sort testdellnode1:<br class="">

drwxr-xr-x 2 root root 4096 Mar  3  2017 cached testdellnode1:  drwxr-xr-x 2<br class="">

root root 4096 Nov 10  2016 committed testdellnode1:  -rw-r--r-- 1 root<br class="">

root   99 Nov 10  2016 ccr.nodes testdellnode1:  total 12 testgateway:<br class="">

drwxr-xr-x. 2 root root 4096 Jun 29  2016 committed testgateway:  drwxr-xr-x.<br class="">

2 root root 4096 Mar  3  2017 cached testgateway:  -rw-r--r--. 1 root root<br class="">

99 Jun 29  2016 ccr.nodes testgateway:  total 12 testnsd1:  drwxr-xr-x 2 root<br class="">

root  6 Sep 19 15:38 cached testnsd1:  drwxr-xr-x 2 root root  6 Sep 19 15:38<br class="">

committed testnsd1:  -rw-r--r-- 1 root root  0 Sep 19 15:39 ccr.disks<br class="">

testnsd1:  -rw-r--r-- 1 root root  4 Sep 19 15:38 ccr.noauth testnsd1:<br class="">

-rw-r--r-- 1 root root 99 Sep 19 15:39 ccr.nodes testnsd1:  total 8<br class="">

testnsd2:  drwxr-xr-x 2 root root   22 Mar  3  2017 cached testnsd2:<br class="">

drwxr-xr-x 2 root root 4096 Sep 18 11:49 committed testnsd2:  -rw------- 1<br class="">

root root 4096 Sep 18 11:50 ccr.paxos.1 testnsd2:  -rw------- 1 root root<br class="">

4096 Sep 18 11:50 ccr.paxos.2 testnsd2:  -rw-r--r-- 1 root root    0 Jun 29<br class="">

2016 ccr.disks testnsd2:  -rw-r--r-- 1 root root   99 Jun 29  2016 ccr.nodes<br class="">

testnsd2:  total 16 testnsd3:  drwxr-xr-x 2 root root  6 Sep 19 15:41 cached<br class="">

testnsd3:  drwxr-xr-x 2 root root  6 Sep 19 15:41 committed testnsd3:<br class="">

-rw-r--r-- 1 root root  0 Jun 29  2016 ccr.disks testnsd3:  -rw-r--r-- 1 root<br class="">

root  4 Sep 19 15:41 ccr.noauth testnsd3:  -rw-r--r-- 1 root root 99 Jun 29<br class="">

2016 ccr.nodes testnsd3:  total 8 testsched:  drwxr-xr-x. 2 root root 4096<br class="">

Jun 29  2016 committed testsched:  drwxr-xr-x. 2 root root 4096 Mar  3  2017<br class="">

cached testsched:  -rw-r--r--. 1 root root   99 Jun 29  2016 ccr.nodes<br class="">

testsched:  total 12 /var/mmfs/gen root@testnsd2# more ../ccr/ccr.nodes<br class="">

3,0,10.0.6.215,,testnsd3.vampire<br class="">

1,0,10.0.6.213,,testnsd1.vampire<br class="">

2,0,10.0.6.214,,testnsd2.vampire<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmdsh -F /tmp/cluster.hostnames "ls -l /var/mmfs/gen/mmsdrfs"<br class="">

testnsd1:  -rw-r--r-- 1 root root 20360 Sep 19 15:21 /var/mmfs/gen/mmsdrfs<br class="">

testnsd3:  -rw-r--r-- 1 root root 20360 Sep 19 15:34 /var/mmfs/gen/mmsdrfs<br class="">

testnsd2:  -rw-r--r-- 1 root root 20360 Aug 25 17:34 /var/mmfs/gen/mmsdrfs<br class="">

testdellnode1:  -rw-r--r-- 1 root root 20360 Aug 25<br class="">

17:43 /var/mmfs/gen/mmsdrfs testgateway:  -rw-r--r--. 1 root root 20360 Aug<br class="">

25 17:43 /var/mmfs/gen/mmsdrfs testsched:  -rw-r--r--. 1 root root 20360 Aug<br class="">

25 17:43 /var/mmfs/gen/mmsdrfs /var/mmfs/gen<br class="">

root@testnsd2# mmdsh -F /tmp/cluster.hostnames "md5sum /var/mmfs/gen/mmsdrfs"<br class="">

testnsd1:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs<br class="">

testnsd3:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs<br class="">

testnsd2:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs<br class="">

testdellnode1:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs<br class="">

testgateway:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs<br class="">

testsched:  7120c79d9d767466c7629763abb7f730  /var/mmfs/gen/mmsdrfs<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmdsh -F /tmp/cluster.hostnames<br class="">

"md5sum /var/mmfs/ssl/stage/genkeyData1" testnsd3:<br class="">

ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1 testnsd1:<br class="">

ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1 testnsd2:<br class="">

ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1<br class="">

testdellnode1:<br class="">

ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1<br class="">

testgateway:<br class="">

ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1 testsched:<br class="">

ee6d345a87202a9f9d613e4862c92811  /var/mmfs/ssl/stage/genkeyData1 /var/mmfs/gen<br class="">

root@testnsd2#<br class="">

<br class="">

On Sep 20, 2017, at 10:48 AM, Edward Wahl<br class="">

<<a href="mailto:ewahl@osc.edu" class="">ewahl@osc.edu</a><<a href="mailto:ewahl@osc.edu" class="">mailto:ewahl@osc.edu</a>>> wrote:<br class="">

<br class="">

I've run into this before.  We didn't use to use CCR.  And restoring nodes for<br class="">

us is a major pain in the rear as we only allow one-way root SSH, so we have a<br class="">

number of useful little scripts to work around problems like this.<br class="">

<br class="">

Assuming that you have all the necessary files copied to the correct<br class="">

places, you can manually kick off CCR.<br class="">

<br class="">

I think my script does something like:<br class="">

<br class="">

(copy the encryption key info)<br class="">

<br class="">

scp  /var/mmfs/ccr/ccr.nodes <node>:/var/mmfs/ccr/<br class="">

<br class="">

scp /var/mmfs/gen/mmsdrfs <node>:/var/mmfs/gen/<br class="">

<br class="">

scp /var/mmfs/ssl/stage/genkeyData1  <node>:/var/mmfs/ssl/stage/<br class="">

<br class="">

<node>:/usr/lpp/mmfs/bin/mmcommon startCcrMonitor<br class="">

<br class="">

you should then see like 2 copies of it running under mmksh.<br class="">

<br class="">

Ed<br class="">

<br class="">

<br class="">

On Wed, 20 Sep 2017 13:55:28 +0000<br class="">

"Buterbaugh, Kevin L"<br class="">

<<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">Kevin.Buterbaugh@Vanderbilt.Edu</a><<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">mailto:Kevin.Buterbaugh@Vanderbilt.Edu</a>>><br class="">

wrote:<br class="">

<br class="">

Hi All,<br class="">

<br class="">

testnsd1 and testnsd3 both had hardware issues (power supply and internal HD<br class="">

respectively).  Given that they were 12 year old boxes, we decided to replace<br class="">

them with other boxes that are a mere 7 years old … keep in mind that this is<br class="">

a test cluster.<br class="">

<br class="">

Disabling CCR does not work, even with the undocumented “—force” option:<br class="">

<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmchcluster --ccr-disable -p testnsd2 -s testnsd1 --force<br class="">

mmchcluster: Unable to obtain the GPFS configuration file lock.<br class="">

mmchcluster: GPFS was unable to obtain a lock from node testnsd1.vampire.<br class="">

mmchcluster: Processing continues without lock protection.<br class="">

The authenticity of host 'testnsd3.vampire (10.0.6.215)' can't be established.<br class="">

ECDSA key fingerprint is SHA256:Ky1pkjsC/kvt4RA8PJuEh/W3vcxCJZplr2m1XHr+UwI.<br class="">

ECDSA key fingerprint is MD5:55:59:a0:2a:6e:a1:00:58:85:3d:ac:86:0e:cd:2a:8a.<br class="">

Are you sure you want to continue connecting (yes/no)? The authenticity of<br class="">

host 'testnsd1.vampire (10.0.6.213)' can't be established. ECDSA key<br class="">

fingerprint is SHA256:WPiTtyuyzhuv+lRRpgDjLuHpyHyk/W3+c5N9SabWvnE. ECDSA key<br class="">

fingerprint is MD5:26:26:2a:bf:e4:cb:1d:a8:27:35:96:ef:b5:96:e0:29. Are you<br class="">

sure you want to continue connecting (yes/no)? The authenticity of host<br class="">

'vmp609.vampire (10.0.21.9)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:/gX6eSp/shsRboVFcUFcNCtGSfbBIWQZ/CWjA6gb17Q. ECDSA key fingerprint is<br class="">

MD5:ca:4d:58:8c:91:28:25:7b:5b:b1:0d:a3:72:a3:00:bb. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'vmp608.vampire<br class="">

(10.0.21.8)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:tvtNWN9b7/Qknb/Am8x7FzyMngi6R3f5SHBqATNtLzw. ECDSA key fingerprint is<br class="">

MD5:fc:4e:87:fb:09:82:cd:67:b0:7d:7f:c7:4b:83:b9:6c. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'vmp612.vampire<br class="">

(10.0.21.12)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:zKXqPt8rIMZWSAYavKEuaAVIm31OGVovoWVU+dBTRPM. ECDSA key fingerprint is<br class="">

MD5:72:4d:fb:22:4e:b3:0e:04:37:be:16:74:ae:ea:05:6c. Are you sure you want to<br class="">

continue connecting (yes/no)?<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s<br class="">

password: testnsd3.vampire:  Host key verification failed. mmdsh:<br class="">

testnsd3.vampire remote shell process had return code 255. testnsd1.vampire:<br class="">

Host key verification failed. mmdsh: testnsd1.vampire remote shell process<br class="">

had return code 255. vmp609.vampire:  Host key verification failed. mmdsh:<br class="">

vmp609.vampire remote shell process had return code 255. vmp608.vampire:<br class="">

Host key verification failed. mmdsh: vmp608.vampire remote shell process had<br class="">

return code 255. vmp612.vampire:  Host key verification failed. mmdsh:<br class="">

vmp612.vampire remote shell process had return code 255.<br class="">

<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s<br class="">

password: vmp610.vampire: Permission denied, please try again.<br class="">

<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s<br class="">

password: vmp610.vampire: Permission denied, please try again.<br class="">

<br class="">

vmp610.vampire:  Permission denied<br class="">

(publickey,gssapi-keyex,gssapi-with-mic,password). mmdsh: vmp610.vampire<br class="">

remote shell process had return code 255.<br class="">

<br class="">

Verifying GPFS is stopped on all nodes ...<br class="">

The authenticity of host 'testnsd3.vampire (10.0.6.215)' can't be established.<br class="">

ECDSA key fingerprint is SHA256:Ky1pkjsC/kvt4RA8PJuEh/W3vcxCJZplr2m1XHr+UwI.<br class="">

ECDSA key fingerprint is MD5:55:59:a0:2a:6e:a1:00:58:85:3d:ac:86:0e:cd:2a:8a.<br class="">

Are you sure you want to continue connecting (yes/no)? The authenticity of<br class="">

host 'vmp612.vampire (10.0.21.12)' can't be established. ECDSA key<br class="">

fingerprint is SHA256:zKXqPt8rIMZWSAYavKEuaAVIm31OGVovoWVU+dBTRPM. ECDSA key<br class="">

fingerprint is MD5:72:4d:fb:22:4e:b3:0e:04:37:be:16:74:ae:ea:05:6c. Are you<br class="">

sure you want to continue connecting (yes/no)? The authenticity of host<br class="">

'vmp608.vampire (10.0.21.8)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:tvtNWN9b7/Qknb/Am8x7FzyMngi6R3f5SHBqATNtLzw. ECDSA key fingerprint is<br class="">

MD5:fc:4e:87:fb:09:82:cd:67:b0:7d:7f:c7:4b:83:b9:6c. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'vmp609.vampire<br class="">

(10.0.21.9)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:/gX6eSp/shsRboVFcUFcNCtGSfbBIWQZ/CWjA6gb17Q. ECDSA key fingerprint is<br class="">

MD5:ca:4d:58:8c:91:28:25:7b:5b:b1:0d:a3:72:a3:00:bb. Are you sure you want to<br class="">

continue connecting (yes/no)? The authenticity of host 'testnsd1.vampire<br class="">

(10.0.6.213)' can't be established. ECDSA key fingerprint is<br class="">

SHA256:WPiTtyuyzhuv+lRRpgDjLuHpyHyk/W3+c5N9SabWvnE. ECDSA key fingerprint is<br class="">

MD5:26:26:2a:bf:e4:cb:1d:a8:27:35:96:ef:b5:96:e0:29. Are you sure you want to<br class="">

continue connecting (yes/no)?<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s<br class="">

password:<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s<br class="">

password:<br class="">

<a href="mailto:root@vmp610.vampire" class="">root@vmp610.vampire</a><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>><<a href="mailto:root@vmp610.vampire" class="">mailto:root@vmp610.vampire</a>>'s<br class="">

password:<br class="">

<br class="">

testnsd3.vampire:  Host key verification failed.<br class="">

mmdsh: testnsd3.vampire remote shell process had return code 255.<br class="">

vmp612.vampire:  Host key verification failed.<br class="">

mmdsh: vmp612.vampire remote shell process had return code 255.<br class="">

vmp608.vampire:  Host key verification failed.<br class="">

mmdsh: vmp608.vampire remote shell process had return code 255.<br class="">

vmp609.vampire:  Host key verification failed.<br class="">

mmdsh: vmp609.vampire remote shell process had return code 255.<br class="">

testnsd1.vampire:  Host key verification failed.<br class="">

mmdsh: testnsd1.vampire remote shell process had return code 255.<br class="">

vmp610.vampire:  Permission denied, please try again.<br class="">

vmp610.vampire:  Permission denied, please try again.<br class="">

vmp610.vampire:  Permission denied<br class="">

(publickey,gssapi-keyex,gssapi-with-mic,password). mmdsh: vmp610.vampire<br class="">

remote shell process had return code 255. mmchcluster: Command failed.<br class="">

Examine previous error messages to determine cause. /var/mmfs/gen<br class="">

root@testnsd2#<br class="">

<br class="">

I believe that part of the problem may be that there are 4 client nodes that<br class="">

were removed from the cluster without removing them from the cluster (done by<br class="">

another SysAdmin who was in a hurry to repurpose those machines).  They’re up<br class="">

and pingable but not reachable by GPFS anymore, which I’m pretty sure is<br class="">

making things worse.<br class="">

<br class="">

Nor does Loic’s suggestion of running mmcommon work (but thanks for the<br class="">

suggestion!) … actually the mmcommon part worked, but a subsequent attempt to<br class="">

start the cluster up failed:<br class="">

<br class="">

/var/mmfs/gen<br class="">

root@testnsd2# mmstartup -a<br class="">

get file failed: Not enough CCR quorum nodes available (err 809)<br class="">

gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158<br class="">

mmstartup: Command failed. Examine previous error messages to determine cause.<br class="">

/var/mmfs/gen<br class="">

root@testnsd2#<br class="">

<br class="">

Thanks.<br class="">

<br class="">

Kevin<br class="">

<br class="">

On Sep 19, 2017, at 10:07 PM, IBM Spectrum Scale<br class="">

<<a href="mailto:scale@us.ibm.com" class="">scale@us.ibm.com</a><<a href="mailto:scale@us.ibm.com" class="">mailto:scale@us.ibm.com</a>><<a href="mailto:scale@us.ibm.com" class="">mailto:scale@us.ibm.com</a>>> wrote:<br class="">

<br class="">

<br class="">

Hi Kevin,<br class="">

<br class="">

Let's me try to understand the problem you have. What's the meaning of node<br class="">

died here. Are you mean that there are some hardware/OS issue which cannot be<br class="">

fixed and OS cannot be up anymore?<br class="">

<br class="">

I agree with Bob that you can have a try to disable CCR temporally, restore<br class="">

cluster configuration and enable it again.<br class="">

<br class="">

Such as:<br class="">

<br class="">

1. Login to a node which has proper GPFS config, e.g NodeA<br class="">

2. Shutdown daemon in all client cluster.<br class="">

3. mmchcluster --ccr-disable -p NodeA<br class="">

4. mmsdrrestore -a -p NodeA<br class="">

5. mmauth genkey propagate -N testnsd1, testnsd3<br class="">

6. mmchcluster --ccr-enable<br class="">

<br class="">

Regards, The Spectrum Scale (GPFS) team<br class="">

<br class="">

------------------------------------------------------------------------------------------------------------------<br class="">

If you feel that your question can benefit other users of Spectrum Scale<br class="">

(GPFS), then please post it to the public IBM developerWroks Forum at<br class="">

<a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01" class="">https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01</a>%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=8OL9COHsb4M%2BZOyWta92acdO8K1Ez8HJfHbrCdDsmRs%3D&reserved=0<<a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=rDOjWbVnVsp5M75VorQgDtZhxMrgvwIgV%2BReJgt5ZUs%3D&reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.ibm.com%2Fdeveloperworks%2Fcommunity%2Fforums%2Fhtml%2Fforum%3Fid%3D11111111-0000-0000-0000-000000000479&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=rDOjWbVnVsp5M75VorQgDtZhxMrgvwIgV%2BReJgt5ZUs%3D&reserved=0</a>>.<br class="">

<br class="">

If your query concerns a potential software error in Spectrum Scale (GPFS)<br class="">

and you have an IBM software maintenance contract please contact<br class="">

1-800-237-5511 in the United States or your local IBM Service Center in other<br class="">

countries.<br class="">

<br class="">

The forum is informally monitored as time permits and should not be used for<br class="">

priority messages to the Spectrum Scale (GPFS) team.<br class="">

<br class="">

<graycol.gif>"Oesterlin, Robert" ---09/20/2017 07:39:55 AM---OK – I’ve run<br class="">

across this before, and it’s because of a bug (as I recall) having to do with<br class="">

CCR and<br class="">

<br class="">

From: "Oesterlin, Robert"<br class="">

<<a href="mailto:Robert.Oesterlin@nuance.com" class="">Robert.Oesterlin@nuance.com</a><<a href="mailto:Robert.Oesterlin@nuance.com" class="">mailto:Robert.Oesterlin@nuance.com</a>><<a href="mailto:Robert.Oesterlin@nuance.com" class="">mailto:Robert.Oesterlin@nuance.com</a>>><br class="">

To: gpfsug main discussion list<br class="">

<<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>>><br class="">

Date: 09/20/2017 07:39 AM Subject: Re: [gpfsug-discuss] CCR cluster down for<br class="">

the count? Sent by:<br class="">

<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">gpfsug-discuss-bounces@spectrumscale.org</a><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">mailto:gpfsug-discuss-bounces@spectrumscale.org</a>><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">mailto:gpfsug-discuss-bounces@spectrumscale.org</a>><br class="">

<br class="">

________________________________<br class="">

<br class="">

<br class="">

<br class="">

OK – I’ve run across this before, and it’s because of a bug (as I recall)<br class="">

having to do with CCR and quorum. What I think you can do is set the cluster<br class="">

to non-ccr (mmchcluster –ccr-disable) with all the nodes down, bring it back<br class="">

up and then re-enable ccr.<br class="">

<br class="">

I’ll see if I can find this in one of the recent 4.2 release nodes.<br class="">

<br class="">

<br class="">

Bob Oesterlin<br class="">

Sr Principal Storage Engineer, Nuance<br class="">

<br class="">

<br class="">

From:<br class="">

<<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">gpfsug-discuss-bounces@spectrumscale.org</a><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">mailto:gpfsug-discuss-bounces@spectrumscale.org</a>><<a href="mailto:gpfsug-discuss-bounces@spectrumscale.org" class="">mailto:gpfsug-discuss-bounces@spectrumscale.org</a>>><br class="">

on behalf of "Buterbaugh, Kevin L"<br class="">

<<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">Kevin.Buterbaugh@Vanderbilt.Edu</a><<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">mailto:Kevin.Buterbaugh@Vanderbilt.Edu</a>><<a href="mailto:Kevin.Buterbaugh@Vanderbilt.Edu" class="">mailto:Kevin.Buterbaugh@Vanderbilt.Edu</a>>><br class="">

Reply-To: gpfsug main discussion list<br class="">

<<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>>><br class="">

Date: Tuesday, September 19, 2017 at 4:03 PM To: gpfsug main discussion list<br class="">

<<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">gpfsug-discuss@spectrumscale.org</a><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>><<a href="mailto:gpfsug-discuss@spectrumscale.org" class="">mailto:gpfsug-discuss@spectrumscale.org</a>>><br class="">

Subject: [EXTERNAL] [gpfsug-discuss] CCR cluster down for the count?<br class="">

<br class="">

Hi All,<br class="">

<br class="">

We have a small test cluster that is CCR enabled. It only had/has 3 NSD<br class="">

servers (testnsd1, 2, and 3) and maybe 3-6 clients. testnsd3 died a while<br class="">

back. I did nothing about it at the time because it was due to be life-cycled<br class="">

as soon as I finished a couple of higher priority projects.<br class="">

<br class="">

Yesterday, testnsd1 also died, which took the whole cluster down. So now<br class="">

resolving this has become higher priority… ;-)<br class="">

<br class="">

I took two other boxes and set them up as testnsd1 and 3, respectively. I’ve<br class="">

done a “mmsdrrestore -p testnsd2 -R /usr/bin/scp” on both of them. I’ve also<br class="">

done a "mmccr setup -F” and copied the ccr.disks and ccr.nodes files from<br class="">

testnsd2 to them. And I’ve copied /var/mmfs/gen/mmsdrfs from testnsd2 to<br class="">

testnsd1 and 3. In case it’s not obvious from the above, networking is fine …<br class="">

ssh without a password between those 3 boxes is fine.<br class="">

<br class="">

However, when I try to startup GPFS … or run any GPFS command I get:<br class="">

<br class="">

/root<br class="">

root@testnsd2# mmstartup -a<br class="">

get file failed: Not enough CCR quorum nodes available (err 809)<br class="">

gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code: 158<br class="">

mmstartup: Command failed. Examine previous error messages to determine cause.<br class="">

/root<br class="">

root@testnsd2#<br class="">

<br class="">

I’ve got to run to a meeting right now, so I hope I’m not leaving out any<br class="">

crucial details here … does anyone have an idea what I need to do? Thanks…<br class="">

<br class="">

—<br class="">

Kevin Buterbaugh - Senior System Administrator<br class="">

Vanderbilt University - Advanced Computing Center for Research and Education<br class="">

<a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">Kevin.Buterbaugh@vanderbilt.edu</a><<a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">mailto:Kevin.Buterbaugh@vanderbilt.edu</a>><<a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">mailto:Kevin.Buterbaugh@vanderbilt.edu</a>><br class="">

- (615)875-9633<br class="">

<br class="">

<br class="">

_______________________________________________<br class="">

gpfsug-discuss mailing list<br class="">

gpfsug-discuss at<br class="">

<a href="http://spectrumscale.org" class="">spectrumscale.org</a><<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cfabfdb4659d249e2d20308d5005ae1ab%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415312700069585&sdata=d0MIeC47FlVIyiWVgLm%2FmvIKWJYwHVR2Kp9oMAPrtgM%3D&reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cfabfdb4659d249e2d20308d5005ae1ab%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415312700069585&sdata=d0MIeC47FlVIyiWVgLm%2FmvIKWJYwHVR2Kp9oMAPrtgM%3D&reserved=0</a>><<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=sVk0NNvXp4b4MnO8gUXBx0pEnAClHIGz9%2BSocg64TSQ%3D&reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=sVk0NNvXp4b4MnO8gUXBx0pEnAClHIGz9%2BSocg64TSQ%3D&reserved=0</a>><br class="">

<a href="https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=oQ4u%2BdyyYLY7HzaOqRPEGjUVhi7AQF%2BvbvnWA4bhuXE%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=66K3H2yHjRwd%2F56tamS2itwN6%2Fg3fnVkLAl9D0M%2BWSQ%3D&reserved=0>" class="">https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=oQ4u%2BdyyYLY7HzaOqRPEGjUVhi7AQF%2BvbvnWA4bhuXE%3D&reserved=0=<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Furldefense.proofpoint.com%2Fv2%2Furl%3Fu%3Dhttp-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss%26d%3DDwICAg%26c%3Djf_iaSHvJObTbx-siA1ZOg%26r%3DIbxtjdkPAM2Sbon4Lbbi4w%26m%3DmBSa534LB4C2zN59ZsJSlginQqfcrutinpAPYNDqU_Y%26s%3DYJEapknqzE2d9kwZzZuu6gEW0DzBoM-o94pXGEeCfuI%26e%3D&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=66K3H2yHjRwd%2F56tamS2itwN6%2Fg3fnVkLAl9D0M%2BWSQ%3D&reserved=0></a><br class="">

<br class="">

<br class="">

<br class="">

_______________________________________________<br class="">

gpfsug-discuss mailing list<br class="">

gpfsug-discuss at<br class="">

spectrumscale.org<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspectrumscale.org&data=02%7C01%7CKevin.Buterbaugh%40Vanderbilt.Edu%7C745cfeaac7264124bb8c08d5003f162a%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415193316350738&sdata=sVk0NNvXp4b4MnO8gUXBx0pEnAClHIGz9%2BSocg64TSQ%3D&reserved=0><br class="">

https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7C494f0469ec084568b39608d4ffd4b8c2%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636414736486816768&sdata=kBvEL7Kp2JMGuLIL4NX3UV7h3emaayQSbHr8O1F2CXc%3D&reserved=0<br class="">

<br class="">

<br class="">

<br class="">

<br class="">

--<br class="">

<br class="">

Ed Wahl<br class="">

Ohio Supercomputer Center<br class="">

614-292-9302<br class="">

<br class="">

<br class="">

<br class="">

—<br class="">

Kevin Buterbaugh - Senior System Administrator<br class="">

Vanderbilt University - Advanced Computing Center for Research and Education<br class="">

Kevin.Buterbaugh@vanderbilt.edu<mailto:Kevin.Buterbaugh@vanderbilt.edu> -<br class="">

(615)875-9633<br class="">

<br class="">

<br class="">

<br class="">

</blockquote>

<br class="">

<br class="">

<br class="">

-- <br class="">

<br class="">

Ed Wahl<br class="">

Ohio Supercomputer Center<br class="">

614-292-9302<br class="">

_______________________________________________<br class="">

gpfsug-discuss mailing list<br class="">

gpfsug-discuss at <a href="http://spectrumscale.org" class="">spectrumscale.org</a><br class="">

<a href="https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cfabfdb4659d249e2d20308d5005ae1ab%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415312700069585&sdata=Z59ik0w%2BaK6bV2JsDxSNt%2FsqwR1ESuqkXTQVBlRjDgw%3D&reserved=0" class="">https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Cfabfdb4659d249e2d20308d5005ae1ab%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636415312700069585&sdata=Z59ik0w%2BaK6bV2JsDxSNt%2FsqwR1ESuqkXTQVBlRjDgw%3D&reserved=0</a><br class="">

</div>

</div>

</blockquote>

</div>

<br class="">

</div>

<br class="">

<br class="">

<div class="">

<div class="">—</div>

<div class="">Kevin Buterbaugh - Senior System Administrator</div>

<div class="">Vanderbilt University - Advanced Computing Center for Research and Education</div>

<div class=""><a href="mailto:Kevin.Buterbaugh@vanderbilt.edu" class="">Kevin.Buterbaugh@vanderbilt.edu</a> - (615)875-9633</div>

<div class=""><br class="">

</div>

<br class="Apple-interchange-newline">

</div>

<br class="">

</body>

</html>