[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
Uwe Falke
UWEFALKE at de.ibm.com
Thu Jun 28 08:44:16 BST 2018
Just some ideas what to try.
when you attempted mmdelnode, was that node still active with the IP
address known in the cluster? If so, shut it down and try again.
Mind the restrictions of mmdelnode though (can't delete NSD servers).
Try to fake one of the currently missing cluster nodes, or restore the old
system backup to the reinstalled server, if available, or temporarily
install gpfs SW there and copy over the GPFS config stuff from a node
still active (/var/mmfs/), configure the admin and daemon IFs of the faked
node on that machine, then try to start the cluster and see if it comes up
with quorum, if it does then go ahead and cleanly de-configure what's
needed to remove that node from the cluster gracefully. Once you reach
quorum with the remaining nodes you are in safe area.
Mit freundlichen Grüßen / Kind regards
Dr. Uwe Falke
IT Specialist
High Performance Computing Services / Integrated Technology Services /
Data Center Services
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
Rathausstr. 7
09111 Chemnitz
Phone: +49 371 6978 2165
Mobile: +49 175 575 2877
E-Mail: uwefalke at de.ibm.com
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Business & Technology Services GmbH / Geschäftsführung:
Thomas Wolter, Sven Schooß
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 17122
From: Renata Maria Dart <renata at SLAC.STANFORD.EDU>
To: Simon Thompson <S.J.Thompson at bham.ac.uk>
Cc: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: 27/06/2018 21:30
Subject: Re: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr
issues
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Hi Simon, yes I ran
mmsdrrestore -p <working node in the cluster>
and that helped to create the /var/mmfs/ccr directory which was
missing. But it didn't create a ccr.nodes file, so I ended up scp'ng
that over by hand which I hope was the right thing to do. The one
host that is no longer in service is still in that ccr.nodes file and
when I try to mmdelnode it I get:
root at ocio-gpu03 renata]# mmdelnode -N dhcp-os-129-164.slac.stanford.edu
mmdelnode: Unable to obtain the GPFS configuration file lock.
mmdelnode: GPFS was unable to obtain a lock from node
dhcp-os-129-164.slac.stanford.edu.
mmdelnode: Command failed. Examine previous error messages to determine
cause.
despite the fact that it doesn't respond to ping. The mmstartup on
the newly reinstalled node fails as in my initial email. I should
mention that the two "working" nodes are running 4.2.3.4. The person
who reinstalled the node that won't start up put on 4.2.3.8. I didn't
think that was the cause of this problem though and thought I would
try to get the cluster talking again before upgrading the rest of the
nodes or degrading the reinstalled one.
Thanks,
Renata
On Wed, 27 Jun 2018, Simon Thompson wrote:
>Have you tried running mmsdrestore in the reinstalled node to reads to
the cluster and then try and startup gpfs on it?
>
>
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1pdg_mmsdrrest.htm
>
>Simon
>________________________________________
>From: gpfsug-discuss-bounces at spectrumscale.org
[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Renata Maria Dart
[renata at slac.stanford.edu]
>Sent: 27 June 2018 19:09
>To: gpfsug-discuss at spectrumscale.org
>Subject: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
>
>Hi, we have a client cluster of 4 nodes with 3 quorum nodes. One of the
>quorum nodes is no longer in service and the other was reinstalled with
>a newer OS, both without informing the gpfs admins. Gpfs is still
>"working" on the two remaining nodes, that is, they continue to have
access
>to the gpfs data on the remote clusters. But, I can no longer get
>any gpfs commands to work. On one of the 2 nodes that are still serving
data,
>
>root at ocio-gpu01 ~]# mmlscluster
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code:
158
>mmlscluster: Command failed. Examine previous error messages to determine
cause.
>
>
>On the reinstalled node, this fails in the same way:
>
>[root at ocio-gpu02 ccr]# mmstartup
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs. Return code:
158
>mmstartup: Command failed. Examine previous error messages to determine
cause.
>
>
>I have looked through the users group interchanges but didn't find
anything
>that seems to fit this scenario.
>
>Is there a way to salvage this cluster? Can it be done without
>shutting gpfs down on the 2 nodes that continue to work?
>
>Thanks for any advice,
>
>Renata Dart
>SLAC National Accelerator Lb
>
>_______________________________________________
>gpfsug-discuss mailing list
>gpfsug-discuss at spectrumscale.org
>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
More information about the gpfsug-discuss
mailing list