[gpfsug-discuss] Re. AFM Crashing the MDS
Radhika A Parameswaran
radhika.p at in.ibm.com
Thu Jul 28 06:43:13 BST 2016
Luke,
AFM is not tested for cascading configurations, this is getting added into
the documentation for 4.2.1:
"Cascading of AFM caches is not tested."
Thanks and Regards
Radhika
From: gpfsug-discuss-request at spectrumscale.org
To: gpfsug-discuss at spectrumscale.org
Date: 07/27/2016 04:30 PM
Subject: gpfsug-discuss Digest, Vol 54, Issue 59
Sent by: gpfsug-discuss-bounces at spectrumscale.org
Send gpfsug-discuss mailing list submissions to
gpfsug-discuss at spectrumscale.org
To subscribe or unsubscribe via the World Wide Web, visit
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
gpfsug-discuss-request at spectrumscale.org
You can reach the person managing the list at
gpfsug-discuss-owner at spectrumscale.org
When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."
Today's Topics:
1. AFM Crashing the MDS (Luke Raimbach)
----------------------------------------------------------------------
Message: 1
Date: Tue, 26 Jul 2016 14:17:35 +0000
From: Luke Raimbach <Luke.Raimbach at crick.ac.uk>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: [gpfsug-discuss] AFM Crashing the MDS
Message-ID:
<AMSPR03MB27605D717C5500D86F6ADEFB00E0 at AMSPR03MB276.eurprd03.prod.outlook.com>
Content-Type: text/plain; charset="utf-8"
Hi All,
Anyone seen GPFS barf like this before? I'll explain the setup:
RO AFM cache on remote site (cache A) for reading remote datasets quickly,
LU AFM cache at destination site (cache B) for caching data from cache A
(has a local compute cluster mounting this over multi-cluster),
IW AFM cache at destination site (cache C) for presenting cache B over NAS
protocols,
Reading files in cache C should pull data from the remote source through
cache A->B->C
Modifying files in cache C should pull data into cache B and then break
the cache relationship for that file, converting it to a local copy. Those
modifications should include metadata updates (e.g. chown).
To speed things up we prefetch files into cache B for datasets which are
undergoing migration and have entered a read-only state at the source.
When issuing chown on a directory in cache C containing ~4.5million files,
the MDS for the AFM cache C crashes badly:
Tue Jul 26 13:28:52.487 2016: [X] logAssertFailed: addr.isReserved() ||
addr.getClusterIdx() == clusterIdx
Tue Jul 26 13:28:52.488 2016: [X] return code 0, reason code 1, log record
tag 0
Tue Jul 26 13:28:53.392 2016: [X] *** Assert exp(addr.isReserved() ||
addr.getClusterIdx() == clusterIdx) in line 1936 of file
/project/sprelbmd0/build/rbmd0s003a/src/avs/fs/mmfs/ts/cfgmgr/cfgmgr.h
Tue Jul 26 13:28:53.393 2016: [E] *** Traceback:
Tue Jul 26 13:28:53.394 2016: [E] 2:0x7F6DC95444A6 logAssertFailed
+ 0x2D6 at ??:0
Tue Jul 26 13:28:53.395 2016: [E] 3:0x7F6DC95C7EF4
ClusterConfiguration::getGatewayNewHash(DiskUID, unsigned int, NodeAddr*)
+ 0x4B4 at ??:0
Tue Jul 26 13:28:53.396 2016: [E] 4:0x7F6DC95C8031
ClusterConfiguration::getGatewayNode(DiskUID, unsigned int, NodeAddr,
NodeAddr*, unsigned int) + 0x91 at ??:0
Tue Jul 26 13:28:53.397 2016: [E] 5:0x7F6DC9DC7126
SFSPcache(StripeGroup*, FileUID, int, int, void*, int, voidXPtr*, int*) +
0x346 at ??:0
Tue Jul 26 13:28:53.398 2016: [E] 6:0x7F6DC9332494
HandleMBPcache(MBPcacheParms*) + 0xB4 at ??:0
Tue Jul 26 13:28:53.399 2016: [E] 7:0x7F6DC90A4A53
Mailbox::msgHandlerBody(void*) + 0x3C3 at ??:0
Tue Jul 26 13:28:53.400 2016: [E] 8:0x7F6DC908BC06
Thread::callBody(Thread*) + 0x46 at ??:0
Tue Jul 26 13:28:53.401 2016: [E] 9:0x7F6DC907A0D2
Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0
Tue Jul 26 13:28:53.402 2016: [E] 10:0x7F6DC87A3AA1 start_thread +
0xD1 at ??:0
Tue Jul 26 13:28:53.403 2016: [E] 11:0x7F6DC794A93D clone + 0x6D
at ??:0
mmfsd:
/project/sprelbmd0/build/rbmd0s003a/src/avs/fs/mmfs/ts/cfgmgr/cfgmgr.h:1936:
void logAssertFailed(UInt32, const char*, UInt32, Int32, Int32, UInt32,
const char*, const char*): Assertion `addr.isReserved() ||
addr.getClusterIdx() == clusterIdx' failed.
Tue Jul 26 13:28:53.404 2016: [N] Signal 6 at location 0x7F6DC7894625 in
process 6262, link reg 0xFFFFFFFFFFFFFFFF.
Tue Jul 26 13:28:53.405 2016: [I] rax 0x0000000000000000 rbx
0x00007F6DC8DCB000
Tue Jul 26 13:28:53.406 2016: [I] rcx 0xFFFFFFFFFFFFFFFF rdx
0x0000000000000006
Tue Jul 26 13:28:53.407 2016: [I] rsp 0x00007F6DAAEA01F8 rbp
0x00007F6DCA05C8B0
Tue Jul 26 13:28:53.408 2016: [I] rsi 0x00000000000018F8 rdi
0x0000000000001876
Tue Jul 26 13:28:53.409 2016: [I] r8 0xFEFEFEFEFEFEFEFF r9
0xFEFEFEFEFF092D63
Tue Jul 26 13:28:53.410 2016: [I] r10 0x0000000000000008 r11
0x0000000000000202
Tue Jul 26 13:28:53.411 2016: [I] r12 0x00007F6DC9FC5540 r13
0x00007F6DCA05C1C0
Tue Jul 26 13:28:53.412 2016: [I] r14 0x0000000000000000 r15
0x0000000000000000
Tue Jul 26 13:28:53.413 2016: [I] rip 0x00007F6DC7894625 eflags
0x0000000000000202
Tue Jul 26 13:28:53.414 2016: [I] csgsfs 0x0000000000000033 err
0x0000000000000000
Tue Jul 26 13:28:53.415 2016: [I] trapno 0x0000000000000000 oldmsk
0x0000000010017807
Tue Jul 26 13:28:53.416 2016: [I] cr2 0x0000000000000000
Tue Jul 26 13:28:54.225 2016: [D] Traceback:
Tue Jul 26 13:28:54.226 2016: [D] 0:00007F6DC7894625 raise + 35 at ??:0
Tue Jul 26 13:28:54.227 2016: [D] 1:00007F6DC7895E05 abort + 175 at ??:0
Tue Jul 26 13:28:54.228 2016: [D] 2:00007F6DC788D74E __assert_fail_base +
11E at ??:0
Tue Jul 26 13:28:54.229 2016: [D] 3:00007F6DC788D810 __assert_fail + 50 at
??:0
Tue Jul 26 13:28:54.230 2016: [D] 4:00007F6DC95444CA logAssertFailed + 2FA
at ??:0
Tue Jul 26 13:28:54.231 2016: [D] 5:00007F6DC95C7EF4
ClusterConfiguration::getGatewayNewHash(DiskUID, unsigned int, NodeAddr*)
+ 4B4 at ??:0
Tue Jul 26 13:28:54.232 2016: [D] 6:00007F6DC95C8031
ClusterConfiguration::getGatewayNode(DiskUID, unsigned int, NodeAddr,
NodeAddr*, unsigned int) + 91 at ??:0
Tue Jul 26 13:28:54.233 2016: [D] 7:00007F6DC9DC7126
SFSPcache(StripeGroup*, FileUID, int, int, void*, int, voidXPtr*, int*) +
346 at ??:0
Tue Jul 26 13:28:54.234 2016: [D] 8:00007F6DC9332494
HandleMBPcache(MBPcacheParms*) + B4 at ??:0
Tue Jul 26 13:28:54.235 2016: [D] 9:00007F6DC90A4A53
Mailbox::msgHandlerBody(void*) + 3C3 at ??:0
Tue Jul 26 13:28:54.236 2016: [D] 10:00007F6DC908BC06
Thread::callBody(Thread*) + 46 at ??:0
Tue Jul 26 13:28:54.237 2016: [D] 11:00007F6DC907A0D2
Thread::callBodyWrapper(Thread*) + A2 at ??:0
Tue Jul 26 13:28:54.238 2016: [D] 12:00007F6DC87A3AA1 start_thread + D1 at
??:0
Tue Jul 26 13:28:54.239 2016: [D] 13:00007F6DC794A93D clone + 6D at ??:0
Tue Jul 26 13:28:54.240 2016: [N] Restarting mmsdrserv
Tue Jul 26 13:28:55.535 2016: [N] Signal 6 at location 0x7F6DC790EA7D in
process 6262, link reg 0xFFFFFFFFFFFFFFFF.
Tue Jul 26 13:28:55.536 2016: [N] mmfsd is shutting down.
Tue Jul 26 13:28:55.537 2016: [N] Reason for shutdown: Signal handler
entered
Tue Jul 26 13:28:55 BST 2016: mmcommon mmfsdown invoked. Subsystem: mmfs
Status: active
Tue Jul 26 13:28:55 BST 2016: /var/mmfs/etc/mmfsdown invoked
umount2: Device or resource busy
umount: /camp: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
umount2: Device or resource busy
umount: /ingest: device is busy.
(In some cases useful info about processes that use
the device is found by lsof(8) or fuser(1))
Shutting down NFS daemon: [ OK ]
Shutting down NFS mountd: [ OK ]
Shutting down NFS quotas: [ OK ]
Shutting down NFS services: [ OK ]
Shutting down RPC idmapd: [ OK ]
Stopping NFS statd: [ OK ]
Ugly, right?
Cheers,
Luke.
Luke Raimbach?
Senior HPC Data and Storage Systems Engineer,
The Francis Crick Institute,
Gibbs Building,
215 Euston Road,
London NW1 2BE.
E: luke.raimbach at crick.ac.uk
W: www.crick.ac.uk
The Francis Crick Institute Limited is a registered charity in England and
Wales no. 1140062 and a company registered in England and Wales no.
06885462, with its registered office at 215 Euston Road, London NW1 2BE.
------------------------------
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
End of gpfsug-discuss Digest, Vol 54, Issue 59
**********************************************
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160728/f21cf540/attachment-0001.htm>
More information about the gpfsug-discuss
mailing list