[gpfsug-discuss] Failed NSD - help appreciated

Roman Baranowski roman at chem.ubc.ca
Mon May 4 09:12:37 BST 2015



 	Dear All,

First of all my apologies if this is not the appropriate place to pose 
such a post. However .....

I have a old IBM cluster with the GPFS  version 3.2

mmlsconfig:
clusterName Moraines.westgrid.ubc
clusterType lc
autoload yes
minReleaseLevel 3.2.1.5
dmapiFileHandleSize 32
pagepool 128M
[moraine9]
pagepool 1536M
[moraine1,moraine2,moraine3,moraine4,moraine5,moraine6,moraine7,moraine8]
pagepool 2048M
[common]
dataStructureDump /var/tmp/mmfs
maxFilesToCache 10000
File systems in cluster Moraines.westgrid.ubc:
----------------------------------------------
/dev/gpfs1
/dev/gpfs2


Some time ago we had suffered a few double disk failures on our SAN and
/dev/gpfs1 cannot be mounted and the mmfsck on that fs fails with:

Error accessing inode file.

InodeProblemList: 4 entries
iNum           snapId     status keep delete noScan new error
-------------- ---------- ------ ---- ------ ------ --- ------------------
              0          0      3    0      0      0   1 0x10000010 
AddrCorrupt IndblockBad
              1          0      3    0      0      0   1 0x00000010 
AddrCorrupt
              2          0      3    0      0      0   1 0x00000010 
AddrCorrupt
              3          0      1    0      0      0   1 0x00000010 
AddrCorrupt

File system check has ended prematurely.
Errors were encountered which could not be corrected.
Exit status 22:2:26.
mmfsck: Command failed.  Examine previous error messages to determine 
cause.

We have corrected all SAN disk failures. The "mmlsdisk gpfs1"
disk         driver   sector failure holds    holds 
storage
name         type       size   group metadata data  status 
availability pool
------------ -------- ------ ------- -------- ----- ------------- 
------------ ------------
nsd_home_1T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_2T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_3T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_4T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_5T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_1T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_2T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_3T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_4T2 nsd         512      -1 yes      yes   ready 
unrecovered  system
nsd_home_5T2 nsd         512      -1 yes      yes   ready         down 
system
nsd_home_6T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_7T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_6T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_7T2 nsd         512      -1 yes      yes   ready         up 
system
Attention: Due to an earlier configuration change the file system
may contain data that is at risk of being lost.


We are sure that nsd_home_4T2 is gone (double disk failure on the SAN)
however nsd_home_5T2 (marked down) suffered a failure but using the SAN 
storage manager we were able to revive the array and it should contain 
valid and good data. However all attempts to star that nsd failed.
We decided to remove the 'bad' nsd_home_4T2 with:

 	mmdeldisk gpfs1 nsd_home_4T2 -p

and
 	mmdeldisk gpfs1 nsd_home_5T2 -c

The current state is:

disk         driver   sector failure holds    holds 
storage
name         type       size   group metadata data  status 
availability pool
------------ -------- ------ ------- -------- ----- ------------- 
------------ ------------
nsd_home_1T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_2T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_3T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_4T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_5T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_1T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_2T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_3T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_4T2 nsd         512      -1 yes      yes   allocmap delp down 
system
nsd_home_5T2 nsd         512      -1 yes      yes   being emptied down 
system
nsd_home_6T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_7T1 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_6T2 nsd         512      -1 yes      yes   ready         up 
system
nsd_home_7T2 nsd         512      -1 yes      yes   ready         up 
system
Attention: Due to an earlier configuration change the file system
may contain data that is at risk of being lost.

Note: the gpfs1 FS (home) was at the moment of failure ~90% full

The question I am posing here (any help, suggestions are appreciated) is 
the following

Is there anything we can do to recover some partial data without removing 
the fs1 and using the backup (some other long story and issues we 
currently addressing). We have some unused capacity (free nsds)
(free disk)   nsd_home_8T2 moraine1.westgrid.ubc,moraine2.westgrid.ubc 
which eventually can be used.

 	All the best
 	Roman




More information about the gpfsug-discuss mailing list