[gpfsug-discuss] mmhealth - where is the info hiding?

Thu Jul 19 23:23:06 BST 2018

Hi Valdis,

Is this what you’re looking for (from an IBMer in response to another question a few weeks back)?

assuming 4.2.3 code level this can be done by deleting and recreating the rule with changed settings:

# mmhealth thresholds list
### Threshold Rules ###
rule_name                metric                error  warn              direction  filterBy  groupBy                                           sensitivity
--------------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule        Fileset_inode         90.0   80.0              high                 gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name      300
MetaDataCapUtil_Rule     MetaDataPool_capUtil  90.0   80.0              high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
DataCapUtil_Rule         DataPool_capUtil      90.0   80.0              high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
MemFree_Rule             mem_memfree           50000  100000            low                  node                                               300

# mmhealth thresholds delete MetaDataCapUtil_Rule
The rule(s) was(were) deleted successfully

# mmhealth thresholds add MetaDataPool_capUtil --errorlevel 95.0 --warnlevel 85.0 --direction high --sensitivity 300 --name MetaDataCapUtil_Rule --groupby gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name

#  mmhealth thresholds list
### Threshold Rules ###
rule_name                metric                error  warn              direction  filterBy  groupBy                                         sensitivity  --------------------------------------------------------------------------------------------------------------------------------------------------------
InodeCapUtil_Rule        Fileset_inode         90.0   80.0              high                 gpfs_cluster_name,gpfs_fs_name,gpfs_fset_name      300
MemFree_Rule             mem_memfree           50000  100000            low                  node                                               300
DataCapUtil_Rule         DataPool_capUtil      90.0   80.0              high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300
MetaDataCapUtil_Rule     MetaDataPool_capUtil  95.0   85.0              high                 gpfs_cluster_name,gpfs_fs_name,gpfs_diskpool_name  300

Kevin

—
Kevin Buterbaugh - Senior System Administrator
Vanderbilt University - Advanced Computing Center for Research and Education
Kevin.Buterbaugh at vanderbilt.edu<mailto:Kevin.Buterbaugh at vanderbilt.edu> - (615)875-9633

On Jul 19, 2018, at 4:25 PM, valdis.kletnieks at vt.edu<mailto:valdis.kletnieks at vt.edu> wrote:

So I'm trying to tidy up things like 'mmhealth' etc.  Got most of it fixed, but stuck on
one thing..

Note: I already did a 'mmhealth node eventlog --clear -N all' yesterday, which
cleaned out a bunch of other long-past events that were "stuck" as failed /
degraded even though they were corrected days/weeks ago - keep this in mind as
you read on....

# mmhealth cluster show

Component           Total         Failed       Degraded        Healthy          Other
-------------------------------------------------------------------------------------
NODE                   10              0              0             10              0
GPFS                   10              0              0             10              0
NETWORK                10              0              0             10              0
FILESYSTEM              1              0              1              0              0
DISK                  102              0              0            102              0
CES                     4              0              0              4              0
GUI                     1              0              0              1              0
PERFMON                10              0              0             10              0
THRESHOLD              10              0              0             10              0

Great.  One hit for 'degraded' filesystem.

# mmhealth node show --unhealthy -N all
(skipping all the nodes that show healthy)

Node name:      arnsd3-vtc.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
-----------------------------------------------------------------------------------
FILESYSTEM     FAILED        24 days ago       pool-data_high_error(archive/system)
(...)
Node name:      arproto2-isb.nis.internal
Node status:    HEALTHY
Status Change:  21 hours ago

Component      Status        Status Change     Reasons
----------------------------------------------------------------------------------
FILESYSTEM     DEGRADED      6 days ago        pool-data_high_warn(archive/system)

mmdf tells me:
nsd_isb_01        13103005696        1 No       Yes      1747905536 ( 13%)     111667200 ( 1%)
nsd_isb_02        13103005696        1 No       Yes      1748245504 ( 13%)     111724384 ( 1%)
(94 more LUNs all within 0.2% of these for usage - data is striped out pretty well)

There's also 6 SSD LUNs for metadata:
nsd_isb_flash_01    2956984320        1 Yes      No       2116091904 ( 72%)      26996992 ( 1%)
(again, evenly striped)

So who is remembering that status, and how to clear it?
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org<http://spectrumscale.org>
https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fgpfsug.org%2Fmailman%2Flistinfo%2Fgpfsug-discuss&data=02%7C01%7CKevin.Buterbaugh%40vanderbilt.edu%7Ca2e808fa12e74ed277bc08d5edc51bc3%7Cba5a7f39e3be4ab3b45067fa80faecad%7C0%7C0%7C636676353194563950&sdata=5biJuM0K0XwEw3BMwbS5epNQhrlig%2FFON7k1V79G%2Fyc%3D&reserved=0

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20180719/b7d4cdb7/attachment-0002.htm>