[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues

Wed Jun 27 19:54:47 BST 2018

Hi Simon, yes I ran

mmsdrrestore -p <working node in the cluster>

and that helped to create the /var/mmfs/ccr directory which was
missing.  But it didn't create a ccr.nodes file, so I ended up scp'ng
that over by hand which I hope was the right thing to do.  The one
host that is no longer in service is still in that ccr.nodes file and
when I try to mmdelnode it I get:

root at ocio-gpu03 renata]# mmdelnode -N dhcp-os-129-164.slac.stanford.edu
mmdelnode: Unable to obtain the GPFS configuration file lock.
mmdelnode: GPFS was unable to obtain a lock from node dhcp-os-129-164.slac.stanford.edu.
mmdelnode: Command failed. Examine previous error messages to determine cause.

despite the fact that it doesn't respond to ping.  The mmstartup on
the newly reinstalled node fails as in my initial email.  I should
mention that the two "working" nodes are running 4.2.3.4.  The person
who reinstalled the node that won't start up put on 4.2.3.8.  I didn't
think that was the cause of this problem though and thought I would
try to get the cluster talking again before upgrading the rest of the
nodes or degrading the reinstalled one.

Thanks,
Renata

On Wed, 27 Jun 2018, Simon Thompson wrote:

>Have you tried running mmsdrestore in the reinstalled node to reads to the cluster and then try and startup gpfs on it?
>
>https://www.ibm.com/support/knowledgecenter/STXKQY_4.2.3/com.ibm.spectrum.scale.v4r23.doc/bl1pdg_mmsdrrest.htm
>
>Simon
>________________________________________
>From: gpfsug-discuss-bounces at spectrumscale.org [gpfsug-discuss-bounces at spectrumscale.org] on behalf of Renata Maria Dart [renata at slac.stanford.edu]
>Sent: 27 June 2018 19:09
>To: gpfsug-discuss at spectrumscale.org
>Subject: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
>
>Hi, we have a client cluster of 4 nodes with 3 quorum nodes.  One of the
>quorum nodes is no longer in service and the other was reinstalled with
>a newer OS, both without informing the gpfs admins.  Gpfs is still
>"working" on the two remaining nodes, that is, they continue to have access
>to the gpfs data on the remote clusters.  But, I can no longer get
>any gpfs commands to work.  On one of the 2 nodes that are still serving data,
>
>root at ocio-gpu01 ~]# mmlscluster
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
>mmlscluster: Command failed. Examine previous error messages to determine cause.
>
>
>On the reinstalled node, this fails in the same way:
>
>[root at ocio-gpu02 ccr]# mmstartup
>get file failed: Not enough CCR quorum nodes available (err 809)
>gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
>mmstartup: Command failed. Examine previous error messages to determine cause.
>
>
>I have looked through the users group interchanges but didn't find anything
>that seems to fit this scenario.
>
>Is there a way to salvage this cluster?  Can it be done without
>shutting gpfs down on the 2 nodes that continue to work?
>
>Thanks for any advice,
>
>Renata Dart
>SLAC National Accelerator Lb
>
>_______________________________________________
>gpfsug-discuss mailing list
>gpfsug-discuss at spectrumscale.org
>http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>