[gpfsug-discuss] How to get rid of very old mmhealth events
Dorigo Alvise (PSI)
alvise.dorigo at psi.ch
Thu Jun 28 09:02:07 BST 2018
Dear experts,
I've e GL2 IBM system running SpectrumScale v4.2.3-6 (RHEL 7.3).
The system is working properly but I get a DEGRADED status report for the NETWORK running the command mmhealth:
[root at sf-gssio1 ~]# mmhealth node show
Node name: sf-gssio1.psi.ch
Node status: DEGRADED
Status Change: 23 min. ago
Component Status Status Change Reasons
-------------------------------------------------------------------------------------------------------------------------------------------
GPFS HEALTHY 22 min. ago -
NETWORK DEGRADED 145 days ago ib_rdma_link_down(mlx5_0/2), ib_rdma_nic_down(mlx5_0/2), ib_rdma_nic_unrecognized(mlx5_0/2)
[...]
This event is clearly an outlier because the network, verbs and IB are correctly working:
[root at sf-gssio1 ~]# mmfsadm test verbs status
VERBS RDMA status: started
[root at sf-gssio1 ~]# mmlsconfig verbsPorts|grep gssio1
verbsPorts mlx5_0/1 [sf-ems1,sf-gssio1,sf-gssio2]
[root at sf-gssio1 ~]# mmdiag --config|grep verbsPorts
! verbsPorts mlx5_0/1
[root at sf-gssio1 ~]# ibstat mlx5_0
CA 'mlx5_0'
CA type: MT4113
Number of ports: 2
Firmware version: 10.16.1020
Hardware version: 0
Node GUID: 0xec0d9a03002b5db0
System image GUID: 0xec0d9a03002b5db0
Port 1:
State: Active
Physical state: LinkUp
Rate: 56
Base lid: 42
LMC: 0
SM lid: 1
Capability mask: 0x26516848
Port GUID: 0xec0d9a03002b5db0
Link layer: InfiniBand
Port 2:
State: Down
Physical state: Disabled
Rate: 10
Base lid: 65535
LMC: 0
SM lid: 0
Capability mask: 0x26516848
Port GUID: 0xec0d9a03002b5db8
Link layer: InfiniBand
That event is there since 145 days and I didn't go away after a daemon restart (mmshutdown/mmstartup).
My question is: how I can get rid of this event and restore the mmhealth's output to HEALTHY ? This is important because I've nagios sensors that periodically parse the "mmhealth -Y ..." output and at the moment I've to disable their email notification (which is not good if some real bad event happens).
Thanks,
Alvise
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180628/013aec6b/attachment-0001.htm>
More information about the gpfsug-discuss
mailing list