<html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<style><!--


/* Font Definitions */


@font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


@font-face


        {font-family:Verdana;


        panose-1:2 11 6 4 3 5 4 4 2 4;}


@font-face


        {font-family:Aptos;


        panose-1:2 11 0 4 2 2 2 2 2 4;}


@font-face


        {font-family:"Times New Roman \(Body CS\)";


        panose-1:2 11 6 4 2 2 2 2 2 4;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0in;


        font-size:12.0pt;


        font-family:"Aptos",sans-serif;}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:blue;


        text-decoration:underline;}


span.EmailStyle19


        {mso-style-type:personal-reply;


        font-family:"Arial",sans-serif;


        color:windowtext;


        font-weight:normal;


        font-style:normal;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-size:10.0pt;


        mso-ligatures:none;}


@page WordSection1


        {size:8.5in 11.0in;


        margin:1.0in 1.0in 1.0in 1.0in;}


div.WordSection1


        {page:WordSection1;}


--></style>


</head>


<body lang="EN-US" link="blue" vlink="purple" style="word-wrap:break-word">


<div class="WordSection1">


<p class="MsoNormal"><span style="font-size:14.0pt;font-family:"Arial",sans-serif">I think you are seeing two different errors.  The backup is failing due to a stale file handle error which usually means the file system was unmounted while the file handle was


 open.  The write error on the physical disk, may have contributed to the stale file handle but I doubt that is the case.  As I understand a single IO error on a physical disk in an ESS (DSS) system will not cause the disk to be considered bad.  This is likely


 why the system considers the disk to be ok.  I suggest you track down the source of the stale file handle and correct that issue to see if your backups will then again be successful.<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:14.0pt;font-family:"Arial",sans-serif"><o:p> </o:p></span></p>


<div>


<div>


<p class="MsoNormal"><span style="font-family:"Arial",sans-serif">Fred<o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-family:"Arial",sans-serif"><o:p> </o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Verdana",sans-serif;color:#121212;background:white">Fred Stock, Spectrum Scale Development Advocacy<o:p></o:p></span></p>


<p class="MsoNormal" style="margin-bottom:12.0pt"><span style="font-size:11.0pt;font-family:"Verdana",sans-serif;color:#121212;background:white"><a href="mailto:stockf@us.ibm.com"><span style="color:#0563C1">stockf@us.ibm.com</span></a> | 720-430-8821</span><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"><o:p></o:p></span></p>


<p class="MsoNormal"><span style="font-size:11.0pt;font-family:"Calibri",sans-serif"> <o:p></o:p></span></p>


</div>


</div>


<p class="MsoNormal"><span style="font-size:14.0pt;font-family:"Arial",sans-serif"><o:p> </o:p></span></p>


<p class="MsoNormal"><span style="font-size:14.0pt;font-family:"Arial",sans-serif"><o:p> </o:p></span></p>


<div id="mail-editor-reference-message-container">


<div>


<div style="border:none;border-top:solid #B5C4DF 1.0pt;padding:3.0pt 0in 0in 0in">


<p class="MsoNormal" style="margin-bottom:12.0pt"><b><span style="color:black">From:


</span></b><span style="color:black">gpfsug-discuss <gpfsug-discuss-bounces@gpfsug.org> on behalf of Jonathan Buzzard <jonathan.buzzard@strath.ac.uk><br>


<b>Date: </b>Thursday, June 20, 2024 at 4:16</span><span style="font-family:"Arial",sans-serif;color:black"> </span><span style="color:black">PM<br>


<b>To: </b>gpfsug-discuss@gpfsug.org <gpfsug-discuss@gpfsug.org><br>


<b>Subject: </b>[EXTERNAL] [gpfsug-discuss] Bad disk but not failed in DSS-G<o:p></o:p></span></p>


</div>


<div>


<p class="MsoNormal"><span style="font-size:11.0pt"><br>


So came to light because I was checking the mmbackup logs and found that <br>


we had not been getting any successful backups for several days and <br>


seeing lots of errors like this<br>


<br>


Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E] <br>


Error on gpfs_iopen([/gpfs/users/xxxyyyyy/.swr],68050746): Stale file handle<br>


Wed Jun 19 21:45:28 2024 mmbackup:Error encountered in policy scan: [E] <br>


Summary of errors:: _dirscan failures:3, _serious unclassified errors:3.<br>


<br>


After some digging around wondering what was going on I came across <br>


these being logged on one of the DSS-G nodes<br>


<br>


[Wed Jun 12 22:22:05 2024] blk_update_request: I/O error, dev sdbv, <br>


sector 9144672512 op 0x1:(WRITE) flags 0x700 phys_seg 17 prio class 0<br>


<br>


Yikes looks like I have a failed disk/ However if I do<br>


<br>


[root@gpfs2 ~]# mmvdisk pdisk list --recovery-group all --not-ok<br>


mmvdisk: All pdisks are ok.<br>


<br>


Clearly that's a load of rubbish.<br>


<br>


After a lot more prodding<br>


<br>


[root@gpfs2 ~]# mmvdisk pdisk list --recovery-group dssg2 --pdisk e1d2s25 -L<br>


pdisk:<br>


    replacementPriority = 1000<br>


    name = "e1d2s25"<br>


    device = <br>


"//gpfs1/dev/sdft(notEnabled),//gpfs1/dev/sdfu(notEnabled),//gpfs2/dev/sdfb,//gpfs2/dev/sdbv"<br>


    recoveryGroup = "dssg2"<br>


    declusteredArray = "DA1"<br>


    state = "ok"<br>


    IOErrors = 444<br>


    IOTimeouts = 8958<br>


    mediaErrors = 15<br>


<br>


<br>


What on earth gives? Why has the disk not been failed? It's not great <br>


that a clearly bad disk is allowed to stick around in the file system <br>


and cause problems IMHO.<br>


<br>


When I try and prepare the disk for removal I get<br>


<br>


[root@gpfs2 ~]# mmvdisk pdisk replace --prepare --rg dssg2 --pdisk e1d2s25<br>


mmvdisk: Pdisk e1d2s25 of recovery group dssg2 is not currently <br>


scheduled for replacement.<br>


mmvdisk:<br>


mmvdisk:<br>


mmvdisk: Command failed. Examine previous error messages to determine cause.<br>


<br>


Do I have to use the --force option? I would like to get this disk out <br>


the file system ASAP.<br>


<br>


<br>


<br>


JAB.<br>


<br>


-- <br>


Jonathan A. Buzzard                         Tel: +44141-5483420<br>


HPC System Administrator, ARCHIE-WeSt.<br>


University of Strathclyde, John Anderson Building, Glasgow. G4 0NG<br>


<br>


_______________________________________________<br>


gpfsug-discuss mailing list<br>


gpfsug-discuss at gpfsug.org<br>


<a href="http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org">http://gpfsug.org/mailman/listinfo/gpfsug-discuss_gpfsug.org</a>


<o:p></o:p></span></p>


</div>


</div>


</div>


</div>


</body>


</html>