[gpfsug-discuss] Bad disk but not failed in DSS-G

Mon Jun 24 13:51:59 BST 2024

On 24/06/2024 13:16, Achim Rehor wrote:
> CAUTION: This email originated outside the University. Check before 
> clicking links or attachments.
> well ... not necessarily 😄
> but on the disk ... just as i expected ... taking it out helps a lot.
> 
> Now on taking it out automatically when raising too many errors was a 
> discussion i had several times with the GNR development.
> The issue really is .. I/O errors on disks (as seen in the 
> mmlsrecoverygroupevent logs) can be due to several issues  (the disk 
> itself,
> the expander, the IOM, the adapter, the cable ... )
> in case of a more general part serving like 5 or more pdisks, that would 
> risk the FT , if we took them out automatically.
> Thus ... we dont do that ..
> 

When smartctl for the disk says

Error counter log:
            Errors Corrected by           Total   Correction 
Gigabytes    Total
                ECC          rereads/    errors   algorithm 
processed    uncorrected
            fast | delayed   rewrites  corrected  invocations   [10^9 
bytes]  errors
read:          0    33839        32         0          0     137434.705 
         32
write:         0       36         0         0          0     178408.893 
          0

Non-medium error count:        0

A disk with 32 read errors in smartctl is fubar, no ifs no buts. 
Whatever the balance in ejecting bad disks is, IMHO currently it's in 
the wrong place because it failed to eject an actual bad disk.

At an absolute bare minimum mmhealth should be not be saying everything 
is fine and dandy because clearly it was not. That's the bigger issue. I 
can live with them not been taken out automatically, it is unacceptable 
that mmhealth was giving false and inaccurate information about the 
state of the filesystem. Had it even just changed something to a 
"degraded" state the problems could have been picked up much much sooner.

Presumably the disk category was still good because the vdisk's where 
theoretically good. I suggest renaming that to VDISK to more accurately 
reflect what it is about and add a PDISK category. Then when a pdisk 
starts showing IO errors you can increment the number of disks in a 
degraded state and it can be picked up without end users having to roll 
their own monitoring.

> The idea is to improve the disk hospital more and more, so that the 
> decision to switch a disk back to OK is more accurate,   over time.
> 
> Until then .. it might always be a good idea to scan the event log for 
> pdisk errors ...
> 

That is my conclusion, that mmhealth is as useful as a chocolate teapot 
because you can't rely on it to provide correct information and I need 
to do my own health monitoring of the system.

JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG