[gpfsug-discuss] CCR troubles - CCR and mmXXconfig commands fine with mmshutdown

Thu Jul 28 19:27:20 BST 2016

they should get started as soon as you shutdown via mmshutdown
could you check a node where the processes are NOT started and simply run
mmshutdown on this node to see if they get started ?

On Thu, Jul 28, 2016 at 10:57 AM, Bryan Banister <bbanister at jumptrading.com>
wrote:

> I now see that these mmccrmonitor and mmsdrserv daemons are required for
> the CCR operations to work.  This is just not clear in the error output.
> Even the GPFS 4.2 Problem Determination Guide doesn’t have anything
> explaining the “Not enough CCR quorum nodes available” or “Unexpected error
> from ccr fget mmsdrfs” error messages.  Thus there is no clear direction on
> how to fix this issue from the command output, the man pages, nor the Admin
> Guides.
>
>
>
> [root at fpia-gpfs-jcsdr01 ~]# man -E ascii mmccr
>
> No manual entry for mmccr
>
>
>
> There isn’t a help for mmccr either, but at least it does print some usage
> info:
>
>
>
> [root at fpia-gpfs-jcsdr01 ~]# mmccr -h
>
> Unknown subcommand: '-h'Usage: mmccr subcommand common-options
> subcommand-options...
>
>
>
> Subcommands:
>
>
>
> Setup and Initialization:
>
> [snip]
>
>
>
> I’m still not sure how to start these mmccrmonitor and mmsdrserv daemons
> without starting GPFS… could you tell me how it would be possible?
>
>
>
> Thanks for sharing details about how this all works Marc, I do appreciate
> your response!
>
> -Bryan
>
>
>
> *From:* gpfsug-discuss-bounces at spectrumscale.org [mailto:
> gpfsug-discuss-bounces at spectrumscale.org] *On Behalf Of *Marc A Kaplan
> *Sent:* Thursday, July 28, 2016 12:25 PM
> *To:* gpfsug main discussion list
> *Subject:* Re: [gpfsug-discuss] CCR troubles - CCR and mmXXconfig
> commands fine with mmshutdown
>
>
>
> Based on experiments on my test cluster, I can assure you that you can
> list and change GPFS configuration parameters with CCR enabled while GPFS
> is down.
>
> I understand you are having a problem with your cluster, but you are
> incorrectly disparaging the CCR.
>
> In fact you can mmshutdown -a AND kill all GPFS related processes,
> including mmsdrserv and mmcrmonitor and then issue commands like:
>
> mmlscluster, mmlsconfig, mmchconfig
>
> Those will work correctly and by-the-way re-start mmsdrserv and
> mmcrmonitor...
> (Use command like `ps auxw | grep mm`  to find the relevenat processes).
>
> But that will not start the main GPFS file manager process mmfsd.  GPFS
> "proper" remains down...
>
> For the following commands Linux was "up" on all nodes, but GPFS was
> shutdown.
> [root at n2 gpfs-git]# mmgetstate -a
>
>  Node number  Node name        GPFS state
> ------------------------------------------
>        1      n2               down
>        3      n4               down
>        4      n5               down
>        6      n3               down
>
> However if a majority of the quorum nodes can not be obtained, you WILL
> see a sequence of messages like this, after a noticeable "timeout":
> (For the following test I had three quorum nodes and did a Linux shutdown
> on two of them...)
>
> [root at n2 gpfs-git]# mmlsconfig
> get file failed: Not enough CCR quorum nodes available (err 809)
> gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
> mmlsconfig: Command failed. Examine previous error messages to determine
> cause.
>
> [root at n2 gpfs-git]# mmchconfig worker1Threads=1022
> mmchconfig: Unable to obtain the GPFS configuration file lock.
> mmchconfig: GPFS was unable to obtain a lock from node n2.frozen.
> mmchconfig: Command failed. Examine previous error messages to determine
> cause.
>
> [root at n2 gpfs-git]# mmgetstate -a
> get file failed: Not enough CCR quorum nodes available (err 809)
> gpfsClusterInit: Unexpected error from ccr fget mmsdrfs.  Return code: 158
> mmgetstate: Command failed. Examine previous error messages to determine
> cause.
>
> HMMMM.... notice mmgetstate needs a quorum even to "know" what nodes it
> should check!
>
> Then re-starting Linux... So I have two of three quorum nodes active, but
> GPFS still down...
>
> ##  From n2, login to node n3 that I just rebooted...
> [root at n2 gpfs-git]# ssh n3
> Last login: Thu Jul 28 09:50:53 2016 from n2.frozen
>
> ## See if any mm processes are running? ... NOPE!
>
> [root at n3 ~]# ps auxw | grep mm
> ps auxw | grep mm
> root      3834  0.0  0.0 112640   972 pts/0    S+   10:12   0:00 grep
> --color=auto mm
>
> ## Check the state...  notice n4 is powered off...
> [root at n3 ~]# mmgetstate -a
> mmgetstate -a
>
>  Node number  Node name        GPFS state
> ------------------------------------------
>        1      n2               down
>        3      n4               unknown
>        4      n5               down
>        6      n3               down
>
> ## Examine the cluster configuration
> [root at n3 ~]# mmlscluster
> mmlscluster
>
> GPFS cluster information
> ========================
>   GPFS cluster name:         madagascar.frozen
>   GPFS cluster id:           7399668614468035547
>   GPFS UID domain:           madagascar.frozen
>   Remote shell command:      /usr/bin/ssh
>   Remote file copy command:  /usr/bin/scp
>   Repository type:           CCR
>
> GPFS cluster configuration servers:
> -----------------------------------
>   Primary server:    n2.frozen (not in use)
>   Secondary server:  n4.frozen (not in use)
>
>  Node  Daemon node name  IP address   Admin node name  Designation
> -------------------------------------------------------------------
>    1   n2.frozen         172.20.0.21  n2.frozen
>  quorum-manager-perfmon
>    3   n4.frozen         172.20.0.23  n4.frozen
>  quorum-manager-perfmon
>    4   n5.frozen         172.20.0.24  n5.frozen        perfmon
>    6   n3.frozen         172.20.0.22  n3.frozen
>  quorum-manager-perfmon
>
> ## notice that mmccrmonitor and mmsdrserv are running but not mmfsd
>
> [root at n3 ~]# ps auxw | grep mm
> ps auxw | grep mm
> root      3882  0.0  0.0 114376  1720 pts/0    S    10:13   0:00
> /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
> root      3954  0.0  0.0 491244 13040 ?        Ssl  10:13   0:00
> /usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 128 yes
> root      4339  0.0  0.0 114376   796 pts/0    S    10:15   0:00
> /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
> root      4345  0.0  0.0 112640   972 pts/0    S+   10:16   0:00 grep
> --color=auto mm
>
> ## Now I can mmchconfig ... while GPFS remains down.
>
> [root at n3 ~]# mmchconfig worker1Threads=1022
> mmchconfig worker1Threads=1022
> mmchconfig: Command successfully completed
> mmchconfig: Propagating the cluster configuration data to all
>   affected nodes.  This is an asynchronous process.
> [root at n3 ~]# Thu Jul 28 10:18:16 PDT 2016: mmcommon pushSdr_async:
> mmsdrfs propagation started
> Thu Jul 28 10:18:21 PDT 2016: mmcommon pushSdr_async: mmsdrfs propagation
> completed; mmdsh rc=0
>
> [root at n3 ~]# mmgetstate -a
> mmgetstate -a
>
>  Node number  Node name        GPFS state
> ------------------------------------------
>        1      n2               down
>        3      n4               unknown
>        4      n5               down
>        6      n3               down
>
> ## Quorum node n4 remains unreachable...  But n2 and n3 are running Linux.
> [root at n3 ~]# ping -c 1 n4
> ping -c 1 n4
> PING n4.frozen (172.20.0.23) 56(84) bytes of data.
> From n3.frozen (172.20.0.22) icmp_seq=1 Destination Host Unreachable
>
> --- n4.frozen ping statistics ---
> 1 packets transmitted, 0 received, +1 errors, 100% packet loss, time 0ms
>
> [root at n3 ~]# exit
> exit
> logout
> Connection to n3 closed.
> [root at n2 gpfs-git]# ps auwx | grep mm
> root      3264  0.0  0.0 114376   812 pts/1    S    10:21   0:00
> /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
> root      3271  0.0  0.0 112640   980 pts/1    S+   10:21   0:00 grep
> --color=auto mm
> root     31820  0.0  0.0 114376  1728 pts/1    S    09:42   0:00
> /usr/lpp/mmfs/bin/mmksh /usr/lpp/mmfs/bin/mmccrmonitor 15
> root     32058  0.0  0.0 493264 12000 ?        Ssl  09:42   0:00
> /usr/lpp/mmfs/bin/mmsdrserv 1191 10 10 /var/adm/ras/mmsdrserv.log 1
> root     32263  0.0  0.0 1700732 17600 ?       Sl   09:42   0:00 python
> /usr/lpp/mmfs/bin/mmsysmon.py
> [root at n2 gpfs-git]#
>
>
> ------------------------------
>
> Note: This email is for the confidential use of the named addressee(s)
> only and may contain proprietary, confidential or privileged information.
> If you are not the intended recipient, you are hereby notified that any
> review, dissemination or copying of this email is strictly prohibited, and
> to please notify the sender immediately and destroy this email and any
> attachments. Email transmission cannot be guaranteed to be secure or
> error-free. The Company, therefore, does not make any guarantees as to the
> completeness or accuracy of this email or any attachments. This email is
> for informational purposes only and does not constitute a recommendation,
> offer, request or solicitation of any kind to buy, sell, subscribe, redeem
> or perform any type of transaction of a financial product.
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160728/5992df68/attachment-0002.htm>