[gpfsug-discuss] Bad disk but not failed in DSS-G

Jonathan Buzzard jonathan.buzzard at strath.ac.uk
Thu Jun 20 21:14:09 BST 2024


So came to light because I was checking the mmbackup logs and found that 
we had not been getting any successful backups for several days and 
seeing lots of errors like this

Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E] 
Error on gpfs_iopen([/gpfs/users/xxxyyyyy/.swr],68050746): Stale file handle
Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E] 
Summary of errors:: _dirscan failures:3, _serious unclassified errors:3.

After some digging around wondering what was going on I came across 
these being logged on one of the DSS-G nodes

[Wed Jun 12 22:22:05 2024] blk_update_request: I/O error, dev sdbv, 
sector 9144672512 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0

Yikes looks like I have a failed disk/ However if I do

[root at gpfs2 ~]# mmvdisk pdisk list --recovery-group all --not-ok
mmvdisk: All pdisks are ok.

Clearly that's a load of rubbish.

After a lot more prodding

[root at gpfs2 ~]# mmvdisk pdisk list --recovery-group dssg2 --pdisk e1d2s25 -L
pdisk:
    replacementPriority = 1000
    name = "e1d2s25"
    device = 
"//gpfs1/dev/sdft(notEnabled),//gpfs1/dev/sdfu(notEnabled),//gpfs2/dev/sdfb,//gpfs2/dev/sdbv"
    recoveryGroup = "dssg2"
    declusteredArray = "DA1"
    state = "ok"
    IOErrors = 444
    IOTimeouts = 8958
    mediaErrors = 15


What on earth gives? Why has the disk not been failed? It's not great 
that a clearly bad disk is allowed to stick around in the file system 
and cause problems IMHO.

When I try and prepare the disk for removal I get

[root at gpfs2 ~]# mmvdisk pdisk replace --prepare --rg dssg2 --pdisk e1d2s25
mmvdisk: Pdisk e1d2s25 of recovery group dssg2 is not currently 
scheduled for replacement.
mmvdisk:
mmvdisk:
mmvdisk: Command failed. Examine previous error messages to determine cause.

Do I have to use the --force option? I would like to get this disk out 
the file system ASAP.



JAB.

-- 
Jonathan A. Buzzard                         Tel: +44141-5483420
HPC System Administrator, ARCHIE-WeSt.
University of Strathclyde, John Anderson Building, Glasgow. G4 0NG



More information about the gpfsug-discuss mailing list