[gpfsug-discuss] Failed NSD - help appreciated

Wahl, Edward ewahl at osc.edu
Mon May 4 16:44:51 BST 2015


What does the system say when you try to mmchdisk blah blah 'resume or start' on 52?
What does the /var/adm/ras/mmfs.log.latest say

Ed Wahl
OSC

________________________________________
From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Roman Baranowski [roman at chem.ubc.ca]
Sent: Monday, May 04, 2015 4:12 AM
To: gpfsug-discuss at gpfsug.org
Subject: [gpfsug-discuss] Failed NSD - help appreciated

        Dear All,

First of all my apologies if this is not the appropriate place to pose
such a post. However .....

I have a old IBM cluster with the GPFS  version 3.2

mmlsconfig:
clusterName Moraines.westgrid.ubc
clusterType lc
autoload yes
minReleaseLevel 3.2.1.5
dmapiFileHandleSize 32
pagepool 128M
[moraine9]
pagepool 1536M
[moraine1,moraine2,moraine3,moraine4,moraine5,moraine6,moraine7,moraine8]
pagepool 2048M
[common]
dataStructureDump /var/tmp/mmfs
maxFilesToCache 10000
File systems in cluster Moraines.westgrid.ubc:
----------------------------------------------
/dev/gpfs1
/dev/gpfs2


Some time ago we had suffered a few double disk failures on our SAN and
/dev/gpfs1 cannot be mounted and the mmfsck on that fs fails with:

Error accessing inode file.

InodeProblemList: 4 entries
iNum           snapId     status keep delete noScan new error
-------------- ---------- ------ ---- ------ ------ --- ------------------
              0          0      3    0      0      0   1 0x10000010
AddrCorrupt IndblockBad
              1          0      3    0      0      0   1 0x00000010
AddrCorrupt
              2          0      3    0      0      0   1 0x00000010
AddrCorrupt
              3          0      1    0      0      0   1 0x00000010
AddrCorrupt

File system check has ended prematurely.
Errors were encountered which could not be corrected.
Exit status 22:2:26.
mmfsck: Command failed.  Examine previous error messages to determine
cause.

We have corrected all SAN disk failures. The "mmlsdisk gpfs1"
disk         driver   sector failure holds    holds
storage
name         type       size   group metadata data  status
availability pool
------------ -------- ------ ------- -------- ----- -------------
------------ ------------
nsd_home_1T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_2T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_3T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_4T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_5T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_1T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_2T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_3T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_4T2 nsd         512      -1 yes      yes   ready
unrecovered  system
nsd_home_5T2 nsd         512      -1 yes      yes   ready         down
system
nsd_home_6T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_7T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_6T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_7T2 nsd         512      -1 yes      yes   ready         up
system
Attention: Due to an earlier configuration change the file system
may contain data that is at risk of being lost.


We are sure that nsd_home_4T2 is gone (double disk failure on the SAN)
however nsd_home_5T2 (marked down) suffered a failure but using the SAN
storage manager we were able to revive the array and it should contain
valid and good data. However all attempts to star that nsd failed.
We decided to remove the 'bad' nsd_home_4T2 with:

        mmdeldisk gpfs1 nsd_home_4T2 -p

and
        mmdeldisk gpfs1 nsd_home_5T2 -c

The current state is:

disk         driver   sector failure holds    holds
storage
name         type       size   group metadata data  status
availability pool
------------ -------- ------ ------- -------- ----- -------------
------------ ------------
nsd_home_1T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_2T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_3T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_4T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_5T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_1T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_2T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_3T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_4T2 nsd         512      -1 yes      yes   allocmap delp down
system
nsd_home_5T2 nsd         512      -1 yes      yes   being emptied down
system
nsd_home_6T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_7T1 nsd         512      -1 yes      yes   ready         up
system
nsd_home_6T2 nsd         512      -1 yes      yes   ready         up
system
nsd_home_7T2 nsd         512      -1 yes      yes   ready         up
system
Attention: Due to an earlier configuration change the file system
may contain data that is at risk of being lost.

Note: the gpfs1 FS (home) was at the moment of failure ~90% full

The question I am posing here (any help, suggestions are appreciated) is
the following

Is there anything we can do to recover some partial data without removing
the fs1 and using the backup (some other long story and issues we
currently addressing). We have some unused capacity (free nsds)
(free disk)   nsd_home_8T2 moraine1.westgrid.ubc,moraine2.westgrid.ubc
which eventually can be used.

        All the best
        Roman

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



More information about the gpfsug-discuss mailing list