[gpfsug-discuss] Bad disk but not failed in DSS-G

Achim Rehor Achim.Rehor at de.ibm.com
Mon Jun 24 13:16:52 BST 2024


well ... not necessarily 😄
but on the disk ... just as i expected ... taking it out helps a lot.

Now on taking it out automatically when raising too many errors was a discussion i had several times with the GNR development.
The issue really is .. I/O errors on disks (as seen in the mmlsrecoverygroupevent logs) can be due to several issues  (the disk itself,
the expander, the IOM, the adapter, the cable ... )
in case of a more general part serving like 5 or more pdisks, that would risk the FT , if we took them out automatically.
Thus ... we dont do that ..

The idea is to improve the disk hospital more and more, so that the decision to switch a disk back to OK is more accurate,   over time.

Until then .. it might always be a good idea to scan the event log for pdisk errors ...


--

Mit freundlichen Grüßen / Kind regards

Achim Rehor



-----Original Message-----
From: Jonathan Buzzard <jonathan.buzzard at strath.ac.uk<mailto:Jonathan%20Buzzard%20%3cjonathan.buzzard at strath.ac.uk%3e>>
Reply-To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org<mailto:gpfsug%20main%20discussion%20list%20%3cgpfsug-discuss at gpfsug.org%3e>>
To: gpfsug-discuss at gpfsug.org<mailto:gpfsug-discuss at gpfsug.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Bad disk but not failed in DSS-G
Date: Mon, 24 Jun 2024 10:41:50 +0100

On 20/06/2024 23:32, Achim Rehor wrote:

[SNIP]

Fred is most probably correct here. the two errors are not necessarily
the same.


Turns out Fred was incorrect and having pushed the bad disk out the file
system the backups magically started working again. Not that, that
should come as the slightest surprise to anyone.

However finding I have a bad disk because the backups are failing is not
good at all because it means I can't rely on GPFS's health monitoring to
accurately report the state of the file system :-(

It also begs the question with hundreds of I/O errors on a disk why was
it not failed by GPFS? What criteria does GPFS use for deciding if a
disk is bad as clearly they are not accurate.


JAB.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20240624/f4d0525f/attachment.htm>


More information about the gpfsug-discuss mailing list