[gpfsug-discuss] Failed NSD - help appreciated
Wahl, Edward
ewahl at osc.edu
Mon May 4 16:44:51 BST 2015
What does the system say when you try to mmchdisk blah blah 'resume or start' on 52?
What does the /var/adm/ras/mmfs.log.latest say
Ed Wahl
OSC
________________________________________
From: gpfsug-discuss-bounces at gpfsug.org [gpfsug-discuss-bounces at gpfsug.org] on behalf of Roman Baranowski [roman at chem.ubc.ca]
Sent: Monday, May 04, 2015 4:12 AM
To: gpfsug-discuss at gpfsug.org
Subject: [gpfsug-discuss] Failed NSD - help appreciated
Dear All,
First of all my apologies if this is not the appropriate place to pose
such a post. However .....
I have a old IBM cluster with the GPFS version 3.2
mmlsconfig:
clusterName Moraines.westgrid.ubc
clusterType lc
autoload yes
minReleaseLevel 3.2.1.5
dmapiFileHandleSize 32
pagepool 128M
[moraine9]
pagepool 1536M
[moraine1,moraine2,moraine3,moraine4,moraine5,moraine6,moraine7,moraine8]
pagepool 2048M
[common]
dataStructureDump /var/tmp/mmfs
maxFilesToCache 10000
File systems in cluster Moraines.westgrid.ubc:
----------------------------------------------
/dev/gpfs1
/dev/gpfs2
Some time ago we had suffered a few double disk failures on our SAN and
/dev/gpfs1 cannot be mounted and the mmfsck on that fs fails with:
Error accessing inode file.
InodeProblemList: 4 entries
iNum snapId status keep delete noScan new error
-------------- ---------- ------ ---- ------ ------ --- ------------------
0 0 3 0 0 0 1 0x10000010
AddrCorrupt IndblockBad
1 0 3 0 0 0 1 0x00000010
AddrCorrupt
2 0 3 0 0 0 1 0x00000010
AddrCorrupt
3 0 1 0 0 0 1 0x00000010
AddrCorrupt
File system check has ended prematurely.
Errors were encountered which could not be corrected.
Exit status 22:2:26.
mmfsck: Command failed. Examine previous error messages to determine
cause.
We have corrected all SAN disk failures. The "mmlsdisk gpfs1"
disk driver sector failure holds holds
storage
name type size group metadata data status
availability pool
------------ -------- ------ ------- -------- ----- -------------
------------ ------------
nsd_home_1T1 nsd 512 -1 yes yes ready up
system
nsd_home_2T1 nsd 512 -1 yes yes ready up
system
nsd_home_3T1 nsd 512 -1 yes yes ready up
system
nsd_home_4T1 nsd 512 -1 yes yes ready up
system
nsd_home_5T1 nsd 512 -1 yes yes ready up
system
nsd_home_1T2 nsd 512 -1 yes yes ready up
system
nsd_home_2T2 nsd 512 -1 yes yes ready up
system
nsd_home_3T2 nsd 512 -1 yes yes ready up
system
nsd_home_4T2 nsd 512 -1 yes yes ready
unrecovered system
nsd_home_5T2 nsd 512 -1 yes yes ready down
system
nsd_home_6T1 nsd 512 -1 yes yes ready up
system
nsd_home_7T1 nsd 512 -1 yes yes ready up
system
nsd_home_6T2 nsd 512 -1 yes yes ready up
system
nsd_home_7T2 nsd 512 -1 yes yes ready up
system
Attention: Due to an earlier configuration change the file system
may contain data that is at risk of being lost.
We are sure that nsd_home_4T2 is gone (double disk failure on the SAN)
however nsd_home_5T2 (marked down) suffered a failure but using the SAN
storage manager we were able to revive the array and it should contain
valid and good data. However all attempts to star that nsd failed.
We decided to remove the 'bad' nsd_home_4T2 with:
mmdeldisk gpfs1 nsd_home_4T2 -p
and
mmdeldisk gpfs1 nsd_home_5T2 -c
The current state is:
disk driver sector failure holds holds
storage
name type size group metadata data status
availability pool
------------ -------- ------ ------- -------- ----- -------------
------------ ------------
nsd_home_1T1 nsd 512 -1 yes yes ready up
system
nsd_home_2T1 nsd 512 -1 yes yes ready up
system
nsd_home_3T1 nsd 512 -1 yes yes ready up
system
nsd_home_4T1 nsd 512 -1 yes yes ready up
system
nsd_home_5T1 nsd 512 -1 yes yes ready up
system
nsd_home_1T2 nsd 512 -1 yes yes ready up
system
nsd_home_2T2 nsd 512 -1 yes yes ready up
system
nsd_home_3T2 nsd 512 -1 yes yes ready up
system
nsd_home_4T2 nsd 512 -1 yes yes allocmap delp down
system
nsd_home_5T2 nsd 512 -1 yes yes being emptied down
system
nsd_home_6T1 nsd 512 -1 yes yes ready up
system
nsd_home_7T1 nsd 512 -1 yes yes ready up
system
nsd_home_6T2 nsd 512 -1 yes yes ready up
system
nsd_home_7T2 nsd 512 -1 yes yes ready up
system
Attention: Due to an earlier configuration change the file system
may contain data that is at risk of being lost.
Note: the gpfs1 FS (home) was at the moment of failure ~90% full
The question I am posing here (any help, suggestions are appreciated) is
the following
Is there anything we can do to recover some partial data without removing
the fs1 and using the backup (some other long story and issues we
currently addressing). We have some unused capacity (free nsds)
(free disk) nsd_home_8T2 moraine1.westgrid.ubc,moraine2.westgrid.ubc
which eventually can be used.
All the best
Roman
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
More information about the gpfsug-discuss
mailing list