[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues

Uwe Falke UWEFALKE at de.ibm.com
Thu Jun 28 08:44:16 BST 2018


Just some ideas what to try.
when you attempted mmdelnode, was that node still active with the IP 
address known in the cluster? If so, shut it down and try again.
Mind the restrictions of mmdelnode though (can't delete NSD servers).

Try to fake one of the currently missing cluster nodes, or restore the old 
system backup to the reinstalled server, if available, or temporarily 
install  gpfs SW there and copy over the GPFS config stuff from a node 
still active (/var/mmfs/), configure the admin and daemon IFs of the faked 
node on that machine, then try to start the cluster and see if it comes up 
with quorum, if it does  then go ahead and cleanly de-configure what's 
needed to remove that node from the cluster gracefully. Once you reach 
quorum with the remaining nodes you are in safe area.


 
Mit freundlichen Grüßen / Kind regards

 
Dr. Uwe Falke
 
IT Specialist
High Performance Computing Services / Integrated Technology Services / 
Data Center Services
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland
Rathausstr. 7
09111 Chemnitz
Phone: +49 371 6978 2165
Mobile: +49 175 575 2877
E-Mail: uwefalke at de.ibm.com
-------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Business & Technology Services GmbH / Geschäftsführung: 
Thomas Wolter, Sven Schooß
Sitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart, 
HRB 17122 




From:   Renata Maria Dart <renata at SLAC.STANFORD.EDU>
To:     Simon Thompson <S.J.Thompson at bham.ac.uk>
Cc:     gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date:   27/06/2018 21:30
Subject:        Re: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr 
issues
Sent by:        gpfsug-discuss-bounces at spectrumscale.org



Hi Simon, yes I ran

mmsdrrestore -p <working node in the cluster>

and that helped to create the /var/mmfs/ccr directory which was
missing.  But it didn't create a ccr.nodes file, so I ended up scp'ng
that over by hand which I hope was the right thing to do.  The one
host that is no longer in service is still in that ccr.nodes file and
when I try to mmdelnode it I get:

root at ocio-gpu03 renata]# mmdelnode -N dhcp-os-129-164.slac.stanford.edu
mmdelnode: Unable to obtain the GPFS configuration file lock.
mmdelnode: GPFS was unable to obtain a lock from node 
dhcp-os-129-164.slac.stanford.edu.
mmdelnode: Command failed. Examine previous error messages to determine 
cause.

despite the fact that it doesn't respond to ping.  The mmstartup on
the newly reinstalled node fails as in my initial email.  I should
mention that the two "working" nodes are running 4.2.3.4.  The person
who reinstalled the node that won't start up put on 4.2.3.8.  I didn't
think that was the cause of this problem though and thought I would
try to get the cluster talking again before upgrading the rest of the
nodes or degrading the reinstalled one.

Thanks,
Renata




On Wed, 27 Jun 2018, Simon Thompson wrote:

>Have you tried running mmsdrestore in the reinstalled node to reads to 
the cluster and then try and startup gpfs on it?
>
>
https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1pdg_mmsdrrest.htm

>
>Simon
>________________________________________
>From: gpfsug-discuss-bounces at spectrumscale.org 
[gpfsug-discuss-bounces at spectrumscale.org] on behalf of Renata Maria Dart 
[renata at slac.stanford.edu]
>Sent: 27 June 2018 19:09
>To: gpfsug-discuss at spectrumscale.org
>Subject: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
>
>Hi, we have a client cluster of 4 nodes with 3 quorum nodes.  One of the
>quorum nodes is no longer in service and the other was reinstalled with
>a newer OS, both without informing the gpfs admins.  Gpfs is still
>"working" on the two remaining nodes, that is, they continue to have 
access
>to the gpfs data on the remote clusters.  But, I can no longer get
>any gpfs commands to work.  On one of the 2 nodes that are still serving 
data,
>
>root at ocio-gpu01 ~]# mmlscluster
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 
158
>mmlscluster: Command failed. Examine previous error messages to determine 
cause.
>
>
>On the reinstalled node, this fails in the same way:
>
>[root at ocio-gpu02 ccr]# mmstartup
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 
158
>mmstartup: Command failed. Examine previous error messages to determine 
cause.
>
>
>I have looked through the users group interchanges but didn't find 
anything
>that seems to fit this scenario.
>
>Is there a way to salvage this cluster?  Can it be done without
>shutting gpfs down on the 2 nodes that continue to work?
>
>Thanks for any advice,
>
>Renata Dart
>SLAC National Accelerator Lb
>
>_______________________________________________
>gpfsug-discuss mailing list
>gpfsug-discuss at spectrumscale.org
>
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

>

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss









More information about the gpfsug-discuss mailing list