[gpfsug-discuss] Bad disk but not failed in DSS-G

Thu Jun 20 23:32:26 BST 2024

Fred is most probably correct here. the two errors are not necessarily the same.

i would guess that looking at

# mmlsreoverygroupevents dssg2
or
# mmvdisk recoverygroup list --recovery-group dssg2  --events

you would see e1d2s25    multiple times, changing its state from ok to diagnosing, and back to ok

If you feel this is recurring too often (and i tend to agree, given the number of IOErrors,
you can always '--simulate-failing' this pdisk , and replace it
# mmvdisk pdisk change --recovery-group dssg2 --pdisk e1d2s25 --simulate-failing

--

Mit freundlichen Grüßen / Kind regards

Achim Rehor

Technical Support Specialist Spectrum Scale and ESS (SME)
Advisory Product Services Professional
IBM Systems Storage Support - EMEA

Achim.Rehor at de.ibm.com +49-170-4521194
IBM Deutschland GmbH
Vorsitzender des Aufsichtsrats: Sebastian Krause
Geschäftsführung: Gregor Pillen (Vorsitzender), Nicole Reimer,
Gabriele Schwarenthorer, Christine Rupp, Frank Theisen
Sitz der Gesellschaft: Ehningen / Registergericht: AmtsgerichtStuttgart, HRB 14562 / WEEE-Reg.-Nr. DE 99369940

-----Original Message-----
From: Fred Stock <stockf at us.ibm.com>
Reply-To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
To: gpfsug main discussion list <gpfsug-discuss at gpfsug.org>
Subject: [EXTERNAL] Re: [gpfsug-discuss] Bad disk but not failed in DSS-G
Date: Thu, 20 Jun 2024 21:02:43 +0000

I think you are seeing two different errors. The backup is failing due to a stale file handle error which usually means the file system was unmounted while the file handle was open. The write error on the physical disk, may have contributed

#pfptBannerqlqwaj3 { all: revert !important; display: block !important; visibility: visible !important; opacity: 1 !important; background-color: #D0D8DC !important; max-width: none !important; max-height: none !important } .pfptPrimaryButtonqlqwaj3:hover, .pfptPrimaryButtonqlqwaj3:focus { background-color: #b4c1c7 !important; } .pfptPrimaryButtonqlqwaj3:active { background-color: #90a4ae !important; }
I think you are seeing two different errors.  The backup is failing due to a stale file handle error which usually means the file system was unmounted while the file handle was open.  The write error on the physical disk, may have contributed to the stale file handle but I doubt that is the case.  As I understand a single IO error on a physical disk in an ESS (DSS) system will not cause the disk to be considered bad.  This is likely why the system considers the disk to be ok.  I suggest you track down the source of the stale file handle and correct that issue to see if your backups will then again be successful.

Fred

Fred Stock, Spectrum Scale Development Advocacy
stockf at us.ibm.com | 720-430-8821

From:gpfsug-discuss <gpfsug-discuss-bounces at gpfsug.org> on behalf of Jonathan Buzzard <jonathan.buzzard at strath.ac.uk>
Date: Thursday, June 20, 2024 at 4:16 PM
To: gpfsug-discuss at gpfsug.org <gpfsug-discuss at gpfsug.org>
Subject: [EXTERNAL] [gpfsug-discuss] Bad disk but not failed in DSS-G

So came to light because I was checking the mmbackup logs and found that
we had not been getting any successful backups for several days and
seeing lots of errors like this

Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E]
Error on gpfs_iopen([/gpfs/users/xxxyyyyy/.swr],68050746): Stale file handle
Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E]
Summary of errors:: _dirscan failures:3, _serious unclassified errors:3.

After some digging around wondering what was going on I came across
these being logged on one of the DSS-G nodes

[Wed Jun 12 22:22:05 2024] blk_update_request: I/O error, dev sdbv,
sector 9144672512 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0

Yikes looks like I have a failed disk/ However if I do

[root at gpfs2 ~]# mmvdisk pdisk list --recovery-group all --not-ok
mmvdisk: All pdisks are ok.

Clearly that's a load of rubbish.

After a lot more prodding

[root at gpfs2 ~]# mmvdisk pdisk list --recovery-group dssg2 --pdisk e1d2s25 -L
pdisk:
    replacementPriority = 1000
    name = "e1d2s25"
    device =
"//gpfs1/dev/sdft(notEnabled),//gpfs1/dev/sdfu(notEnabled),//gpfs2/dev/sdfb,//gpfs2/dev/sdbv"
    recoveryGroup = "dssg2"
    declusteredArray = "DA1"
    state = "ok"
    IOErrors = 444
    IOTimeouts = 8958
    mediaErrors = 15

What on earth gives? Why has the disk not been failed? It's not great
that a clearly bad disk is allowed to stick around in the file system
and cause problems IMHO.

When I try and prepare the disk for removal I get

[root at gpfs2 ~]# mmvdisk pdisk replace --prepare --rg dssg2 --pdisk e1d2s25
mmvdisk: Pdisk e1d2s25 of recovery group dssg2 is not currently
scheduled for replacement.
mmvdisk:
mmvdisk:
mmvdisk: Command failed. Examine previous error messages to determine cause.

Do I have to use the --force option? I would like to get this disk out
the file system ASAP.

JAB.

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20240620/77158e02/attachment.htm>