[gpfsug-discuss] "mmhealth cluster show" returns error

Thu May 11 13:05:14 BST 2017

I’ve also been exploring the mmhealth and gpfsgui for the first time this week.
I have a test cluster where I’m trying the new stuff.  Running 4.2.2-2

mmhealth cluster show says everyone is in nominal status:
Component           Total         Failed       Degraded        Healthy          Other
-------------------------------------------------------------------------------------
NODE                   12              0              0             12              0
GPFS                   12              0              0             12              0
NETWORK                12              0              0             12              0
FILESYSTEM              0              0              0              0              0
DISK                    0              0              0              0              0
GUI                     1              0              0              1              0
PERFMON                12              0              0             12              0

However on the GUI there is conflicting information:
1) Home page shows 3/8 NSD Servers unhealthy 
2) Home page shows 3/21 Nodes unhealthy
 — where is it getting this notion?  
 — there are only 12 nodes in the whole cluster! 
3) clicking on either NSD Servers or Nodes leads to the monitoring page
where the top half spins forever, bottom half is content-free.

I may have installed the pmsensors RPM on a couple of other nodes back in early April,
but have forgotten which ones.  They are in the production cluster.  

Also, the storage in this sandbox cluster has not been turned into a filesystem yet. 
There are a few dozen free NSDs.  Perhaps the “FILESYSTEM CHECKING” status is somehow 
wedging up the GUI?

Node name:      storage005.oscar.ccv.brown.edu
Node status:    HEALTHY
Status Change:  15 hours ago

Component      Status        Status Change     Reasons
------------------------------------------------------
GPFS           HEALTHY       16 hours ago      -
NETWORK        HEALTHY       16 hours ago      -
FILESYSTEM     CHECKING      16 hours ago      -
GUI            HEALTHY       15 hours ago      -
PERFMON        HEALTHY       16 hours ago      

I’ve tried restarting the GUI service and also rebooted the GUI server, but it comes back looking the same.

Any thoughts?

> On May 11, 2017, at 7:28 AM, Anna Christina Wagner <Anna.Wagner at de.ibm.com> wrote:
> 
> Hello Bob,
> 
> 4.2.2 is the release were we introduced "mmhealth cluster show". And you are totally right, it can be a little fragile at times.
> 
> So a short explanation: 
> We had this situation on test machines as well. Because of issues with the system not only the mm-commands but also usual Linux commands 
> took more than 10 seconds to return. We have internally a default time out of 10 seconds for cli commands. So if you had a failover situation, in which the cluster manager 
> was changed (we have our cluster state manager (CSM) on the cluster manager) and the mmlsmgr command did not return in 10 seconds the node does not
> know, that it is the CSM and will not start the corresponding service for that. 
> 
> 
> If you want me to look further into it or if you have feedback regarding mmhealth please feel free to send me an email (Anna.Wagner at de.ibm.com)
> 
> Mit freundlichen Grüßen / Kind regards
> 
> Wagner, Anna Christina
> 
> Software Engineer, Spectrum Scale Development
> IBM Systems
> 
> IBM Deutschland Research & Development GmbH / Vorsitzende des Aufsichtsrats: Martina Koederitz
> Geschäftsführung: Dirk Wittkopp
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, HRB 243294 
> 
> 
> 
> From:        "Oesterlin, Robert" <Robert.Oesterlin at nuance.com>
> To:        gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
> Date:        10.05.2017 18:21
> Subject:        Re: [gpfsug-discuss] "mmhealth cluster show" returns error
> Sent by:        gpfsug-discuss-bounces at spectrumscale.org
> 
> 
> 
> Yea, it’s fine. 
> 
> I did manage to get it to respond after I did a “mmsysmoncontrol restart” but it’s still not showing proper status across the cluster.
> 
> Seems a bit fragile :-) 
> 
> Bob Oesterlin
> Sr Principal Storage Engineer, Nuance
> 
> 
> 
> On 5/10/17, 10:46 AM, "gpfsug-discuss-bounces at spectrumscale.org on behalf of valdis.kletnieks at vt.edu" <gpfsug-discuss-bounces at spectrumscale.org on behalf of valdis.kletnieks at vt.edu> wrote:
> 
>    On Wed, 10 May 2017 14:13:56 -0000, "Oesterlin, Robert" said:
>    
>    > [root]# mmhealth cluster show
>    > nrg1-gpfs16.nrg1.us.grid.nuance.com: Could not find the cluster state manager. It may be in an failover process. Please try again in a few seconds.
>    
>    Does 'mmlsmgr' return something sane?
>    
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss <http://gpfsug.org/mailman/listinfo/gpfsug-discuss>
> 
> 
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170511/98d2e5d5/attachment-0002.htm>