[gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues

Thu Jun 28 04:14:30 BST 2018

Can you also check the time differences between nodes?

We had a situation recently where the server time mismatch caused failures.

On Thu, Jun 28, 2018 at 2:50 AM, Kevin D Johnson <kevindjo at us.ibm.com>
wrote:

> You can also try to convert to the old primary/secondary model to back it
> away from the default CCR configuration.
>
> mmchcluster --ccr-disable -p servername
>
> Then, temporarily go with only one quorum node and add more once the
> cluster comes back up.  Once the cluster is back up and has at least two
> quorum nodes, do a --ccr-enable with the mmchcluster command.
>
> Kevin D. Johnson
> Spectrum Computing, Senior Managing Consultant
> MBA, MAcc, MS Global Technology and Development
> IBM Certified Technical Specialist Level 2 Expert
>
> [image: IBM Certified Technical Specialist Level 2 Expert]
> <https://www.youracclaim.com/badges/69d10078-02df-4e57-a223-bb3c9ae06306>
> Certified Deployment Professional - Spectrum Scale
> Certified Solution Advisor - Spectrum Computing
> Certified Solution Architect - Spectrum Storage Solutions
>
>
> 720.349.6199 - kevindjo at us.ibm.com
>
> "To think is to achieve." - Thomas J. Watson, Sr.
>
>
>
>
> ----- Original message -----
> From: "IBM Spectrum Scale" <scale at us.ibm.com>
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> To: renata at slac.stanford.edu, gpfsug main discussion list <
> gpfsug-discuss at spectrumscale.org>
> Cc:
> Subject: Re: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
> Date: Wed, Jun 27, 2018 5:15 PM
>
>
> Hi Renata,
>
> You may want to reduce the set of quorum nodes. If your version supports
> the --force option, you can run
>
> mmchnode --noquorum -N <broken-nodes> --force
>
> It is a good idea to configure tiebreaker disks in a cluster that has only
> 2 quorum nodes.
>
> Regards, The Spectrum Scale (GPFS) team
>
> ------------------------------------------------------------
> ------------------------------------------------------
> If you feel that your question can benefit other users of Spectrum Scale
> (GPFS), then please post it to the public IBM developerWroks Forum at
> https://www.ibm.com/developerworks/community/
> forums/html/forum?id=11111111-0000-0000-0000-000000000479.
>
> If your query concerns a potential software error in Spectrum Scale (GPFS)
> and you have an IBM software maintenance contract please contact
> 1-800-237-5511 in the United States or your local IBM Service Center in
> other countries.
>
> The forum is informally monitored as time permits and should not be used
> for priority messages to the Spectrum Scale (GPFS) team.
>
> [image: Inactive hide details for Renata Maria Dart ---06/27/2018 02:21:52
> PM---Hi, we have a client cluster of 4 nodes with 3 quorum n]Renata Maria
> Dart ---06/27/2018 02:21:52 PM---Hi, we have a client cluster of 4 nodes
> with 3 quorum nodes. One of the quorum nodes is no longer i
>
> From: Renata Maria Dart <renata at slac.stanford.edu>
> To: gpfsug-discuss at spectrumscale.org
> Date: 06/27/2018 02:21 PM
> Subject: [gpfsug-discuss] gpfs client cluster, lost quorum, ccr issues
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> Hi, we have a client cluster of 4 nodes with 3 quorum nodes.  One of the
> quorum nodes is no longer in service and the other was reinstalled with
> a newer OS, both without informing the gpfs admins.  Gpfs is still
> "working" on the two remaining nodes, that is, they continue to have access
> to the gpfs data on the remote clusters.  But, I can no longer get
> any gpfs commands to work.  On one of the 2 nodes that are still serving
> data,
>
> root at ocio-gpu01 ~]# mmlscluster
> get file failed: Not enough CCR quorum nodes available (err 809)
> gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
> mmlscluster: Command failed. Examine previous error messages to determine
> cause.
>
>
> On the reinstalled node, this fails in the same way:
>
> [root at ocio-gpu02 ccr]# mmstartup
> get file failed: Not enough CCR quorum nodes available (err 809)
> gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
> mmstartup: Command failed. Examine previous error messages to determine
> cause.
>
>
> I have looked through the users group interchanges but didn't find anything
> that seems to fit this scenario.
>
> Is there a way to salvage this cluster?  Can it be done without
> shutting gpfs down on the 2 nodes that continue to work?
>
> Thanks for any advice,
>
> Renata Dart
> SLAC National Accelerator Lb
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180628/d9442b2d/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Image.1__=0ABB082ADFE7DE038f9e8a93df938690918c0AB at .gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180628/d9442b2d/attachment-0002.gif>