[gpfsug-discuss] Bad disk but not failed in DSS-G
Jonathan Buzzard
jonathan.buzzard at strath.ac.uk
Mon Jun 24 13:51:59 BST 2024
On 24/06/2024 13:16, Achim Rehor wrote:
> CAUTION: This email originated outside the University. Check before
> clicking links or attachments.
> well ... not necessarily 😄
> but on the disk ... just as i expected ... taking it out helps a lot.
>
> Now on taking it out automatically when raising too many errors was a
> discussion i had several times with the GNR development.
> The issue really is .. I/O errors on disks (as seen in the
> mmlsrecoverygroupevent logs) can be due to several issues (the disk
> itself,
> the expander, the IOM, the adapter, the cable ... )
> in case of a more general part serving like 5 or more pdisks, that would
> risk the FT , if we took them out automatically.
> Thus ... we dont do that ..
>
When smartctl for the disk says
Error counter log:
Errors Corrected by Total Correction
Gigabytes Total
ECC rereads/ errors algorithm
processed uncorrected
fast | delayed rewrites corrected invocations [10^9
bytes] errors
read: 0 33839 32 0 0 137434.705
32
write: 0 36 0 0 0 178408.893
0
Non-medium error count: 0
A disk with 32 read errors in smartctl is fubar, no ifs no buts.
Whatever the balance in ejecting bad disks is, IMHO currently it's in
the wrong place because it failed to eject an actual bad disk.
At an absolute bare minimum mmhealth should be not be saying everything
is fine and dandy because clearly it was not. That's the bigger issue. I
can live with them not been taken out automatically, it is unacceptable
that mmhealth was giving false and inaccurate information about the
state of the filesystem. Had it even just changed something to a
"degraded" state the problems could have been picked up much much sooner.
Presumably the disk category was still good because the vdisk's where
theoretically good. I suggest renaming that to VDISK to more accurately
reflect what it is about and add a PDISK category. Then when a pdisk
starts showing IO errors you can increment the number of disks in a
degraded state and it can be picked up without end users having to roll
their own monitoring.
> The idea is to improve the disk hospital more and more, so that the
> decision to switch a disk back to OK is more accurate, over time.
>
> Until then .. it might always be a good idea to scan the event log for
> pdisk errors ...
>
That is my conclusion, that mmhealth is as useful as a chocolate teapot
because you can't rely on it to provide correct information and I need
to do my own health monitoring of the system.
JAB.
--
Jonathan A. Buzzard Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
More information about the gpfsug-discuss
mailing list