[gpfsug-discuss] AFM Crashing the MDS

Thu Jul 28 09:30:59 BST 2016

Dear Radhika,

In the early days of AFM and at two separate GPFS UK User Group meetings, I discussed AFM cache chaining with IBM technical people plus at least one developer. My distinct recollection of the outcome was that cache chaining was supported.

Nevertheless, the difference between what my memory tells me and what is being reported now is irrelevant. We are stuck with large volumes of data being migrated in this fashion, so there is clearly a customer use case for chaining AFM caches.

It would be much more helpful if IBM could take on this case and look at the suspected bug that's been chased out here.

Real world observation in the field is that queuing large numbers of metadata updates on the MDS itself causes this crash, whereas issuing the updates from another node in the cache cluster adds to the MDS queue and the crash does not happen. My guess is that there is a bug whereby daemon-local additions to the MDS queue aren't handled correctly (further speculation is that there is a memory leak for local MDS operations, but that needs more testing which I don't have time for - perhaps IBM could try it out?); however, when a metadata update operation is sent through an RPC from another node, it is added to the queue and handled correctly. A workaround, if you will.

Other minor observations here are that the further down the chain of caches you are, the larger you should set afmDisconnectTimeout as any intermediate cache recovery time needs to be taken into account following a disconnect event. Initially, this was slightly counterintuitive because caches B and C as described below are connected over multiple IB interfaces and shouldn't disconnect except when there's some other failure. Conversely, the connection between cache A and B is over a very flaky wide area network and although we've managed to tune out a lot of the problems introduced by high and variable latency, the view of cache A from cache B's perspective still sometimes gets suspended.

The failure observed above doesn't really feel like it's an artefact of cascading caches, but a bug in MDS code as described. Sharing background information about the cascading cache setup was in the spirit of the mailing list and might have led IBM or other customers attempting this kind of setup to share some of their experiences.

Hope you can help.

Luke.

From: gpfsug-discuss-bounces at spectrumscale.org [mailto:gpfsug-discuss-bounces at spectrumscale.org] On Behalf Of Radhika A Parameswaran
Sent: 28 July 2016 06:43
To: gpfsug-discuss at spectrumscale.org
Subject: [gpfsug-discuss] Re. AFM Crashing the MDS

Luke,

AFM is not tested for cascading configurations, this is getting added into the documentation for 4.2.1:

"Cascading of AFM caches is not tested."

Thanks and Regards
Radhika

From:        gpfsug-discuss-request at spectrumscale.org<mailto:gpfsug-discuss-request at spectrumscale.org>
To:        gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>
Date:        07/27/2016 04:30 PM
Subject:        gpfsug-discuss Digest, Vol 54, Issue 59
Sent by:        gpfsug-discuss-bounces at spectrumscale.org<mailto:gpfsug-discuss-bounces at spectrumscale.org>
________________________________

Send gpfsug-discuss mailing list submissions to
                gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>

To subscribe or unsubscribe via the World Wide Web, visit
                http://gpfsug.org/mailman/listinfo/gpfsug-discuss
or, via email, send a message with subject or body 'help' to
                gpfsug-discuss-request at spectrumscale.org<mailto:gpfsug-discuss-request at spectrumscale.org>

You can reach the person managing the list at
                gpfsug-discuss-owner at spectrumscale.org<mailto:gpfsug-discuss-owner at spectrumscale.org>

When replying, please edit your Subject line so it is more specific
than "Re: Contents of gpfsug-discuss digest..."

Today's Topics:

  1. AFM Crashing the MDS (Luke Raimbach)

----------------------------------------------------------------------

Message: 1
Date: Tue, 26 Jul 2016 14:17:35 +0000
From: Luke Raimbach <Luke.Raimbach at crick.ac.uk<mailto:Luke.Raimbach at crick.ac.uk>>
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org<mailto:gpfsug-discuss at spectrumscale.org>>
Subject: [gpfsug-discuss] AFM Crashing the MDS
Message-ID:
                <AMSPR03MB27605D717C5500D86F6ADEFB00E0 at AMSPR03MB276.eurprd03.prod.outlook.com<mailto:AMSPR03MB27605D717C5500D86F6ADEFB00E0 at AMSPR03MB276.eurprd03.prod.outlook.com>>

Content-Type: text/plain; charset="utf-8"

Hi All,

Anyone seen GPFS barf like this before? I'll explain the setup:

RO AFM cache on remote site (cache A) for reading remote datasets quickly,
LU AFM cache at destination site (cache B) for caching data from cache A (has a local compute cluster mounting this over multi-cluster),
IW AFM cache at destination site (cache C) for presenting cache B over NAS protocols,

Reading files in cache C should pull data from the remote source through cache A->B->C

Modifying files in cache C should pull data into cache B and then break the cache relationship for that file, converting it to a local copy. Those modifications should include metadata updates (e.g. chown).

To speed things up we prefetch files into cache B for datasets which are undergoing migration and have entered a read-only state at the source.

When issuing chown on a directory in cache C containing ~4.5million files, the MDS for the AFM cache C crashes badly:

Tue Jul 26 13:28:52.487 2016: [X] logAssertFailed: addr.isReserved() || addr.getClusterIdx() == clusterIdx
Tue Jul 26 13:28:52.488 2016: [X] return code 0, reason code 1, log record tag 0
Tue Jul 26 13:28:53.392 2016: [X] *** Assert exp(addr.isReserved() || addr.getClusterIdx() == clusterIdx) in line 1936 of file /project/sprelbmd0/build/rbmd0s003a/src/avs/fs/mmfs/ts/cfgmgr/cfgmgr.h
Tue Jul 26 13:28:53.393 2016: [E] *** Traceback:
Tue Jul 26 13:28:53.394 2016: [E]         2:0x7F6DC95444A6 logAssertFailed + 0x2D6 at ??:0
Tue Jul 26 13:28:53.395 2016: [E]         3:0x7F6DC95C7EF4 ClusterConfiguration::getGatewayNewHash(DiskUID, unsigned int, NodeAddr*) + 0x4B4 at ??:0
Tue Jul 26 13:28:53.396 2016: [E]         4:0x7F6DC95C8031 ClusterConfiguration::getGatewayNode(DiskUID, unsigned int, NodeAddr, NodeAddr*, unsigned int) + 0x91 at ??:0
Tue Jul 26 13:28:53.397 2016: [E]         5:0x7F6DC9DC7126 SFSPcache(StripeGroup*, FileUID, int, int, void*, int, voidXPtr*, int*) + 0x346 at ??:0
Tue Jul 26 13:28:53.398 2016: [E]         6:0x7F6DC9332494 HandleMBPcache(MBPcacheParms*) + 0xB4 at ??:0
Tue Jul 26 13:28:53.399 2016: [E]         7:0x7F6DC90A4A53 Mailbox::msgHandlerBody(void*) + 0x3C3 at ??:0
Tue Jul 26 13:28:53.400 2016: [E]         8:0x7F6DC908BC06 Thread::callBody(Thread*) + 0x46 at ??:0
Tue Jul 26 13:28:53.401 2016: [E]         9:0x7F6DC907A0D2 Thread::callBodyWrapper(Thread*) + 0xA2 at ??:0
Tue Jul 26 13:28:53.402 2016: [E]         10:0x7F6DC87A3AA1 start_thread + 0xD1 at ??:0
Tue Jul 26 13:28:53.403 2016: [E]         11:0x7F6DC794A93D clone + 0x6D at ??:0
mmfsd: /project/sprelbmd0/build/rbmd0s003a/src/avs/fs/mmfs/ts/cfgmgr/cfgmgr.h:1936: void logAssertFailed(UInt32, const char*, UInt32, Int32, Int32, UInt32, const char*, const char*): Assertion `addr.isReserved() || addr.getClusterIdx() == clusterIdx' failed.
Tue Jul 26 13:28:53.404 2016: [N] Signal 6 at location 0x7F6DC7894625 in process 6262, link reg 0xFFFFFFFFFFFFFFFF.
Tue Jul 26 13:28:53.405 2016: [I] rax    0x0000000000000000  rbx    0x00007F6DC8DCB000
Tue Jul 26 13:28:53.406 2016: [I] rcx    0xFFFFFFFFFFFFFFFF  rdx    0x0000000000000006
Tue Jul 26 13:28:53.407 2016: [I] rsp    0x00007F6DAAEA01F8  rbp    0x00007F6DCA05C8B0
Tue Jul 26 13:28:53.408 2016: [I] rsi    0x00000000000018F8  rdi    0x0000000000001876
Tue Jul 26 13:28:53.409 2016: [I] r8     0xFEFEFEFEFEFEFEFF  r9     0xFEFEFEFEFF092D63
Tue Jul 26 13:28:53.410 2016: [I] r10    0x0000000000000008  r11    0x0000000000000202
Tue Jul 26 13:28:53.411 2016: [I] r12    0x00007F6DC9FC5540  r13    0x00007F6DCA05C1C0
Tue Jul 26 13:28:53.412 2016: [I] r14    0x0000000000000000  r15    0x0000000000000000
Tue Jul 26 13:28:53.413 2016: [I] rip    0x00007F6DC7894625  eflags 0x0000000000000202
Tue Jul 26 13:28:53.414 2016: [I] csgsfs 0x0000000000000033  err    0x0000000000000000
Tue Jul 26 13:28:53.415 2016: [I] trapno 0x0000000000000000  oldmsk 0x0000000010017807
Tue Jul 26 13:28:53.416 2016: [I] cr2    0x0000000000000000
Tue Jul 26 13:28:54.225 2016: [D] Traceback:
Tue Jul 26 13:28:54.226 2016: [D] 0:00007F6DC7894625 raise + 35 at ??:0
Tue Jul 26 13:28:54.227 2016: [D] 1:00007F6DC7895E05 abort + 175 at ??:0
Tue Jul 26 13:28:54.228 2016: [D] 2:00007F6DC788D74E __assert_fail_base + 11E at ??:0
Tue Jul 26 13:28:54.229 2016: [D] 3:00007F6DC788D810 __assert_fail + 50 at ??:0
Tue Jul 26 13:28:54.230 2016: [D] 4:00007F6DC95444CA logAssertFailed + 2FA at ??:0
Tue Jul 26 13:28:54.231 2016: [D] 5:00007F6DC95C7EF4 ClusterConfiguration::getGatewayNewHash(DiskUID, unsigned int, NodeAddr*) + 4B4 at ??:0
Tue Jul 26 13:28:54.232 2016: [D] 6:00007F6DC95C8031 ClusterConfiguration::getGatewayNode(DiskUID, unsigned int, NodeAddr, NodeAddr*, unsigned int) + 91 at ??:0
Tue Jul 26 13:28:54.233 2016: [D] 7:00007F6DC9DC7126 SFSPcache(StripeGroup*, FileUID, int, int, void*, int, voidXPtr*, int*) + 346 at ??:0
Tue Jul 26 13:28:54.234 2016: [D] 8:00007F6DC9332494 HandleMBPcache(MBPcacheParms*) + B4 at ??:0
Tue Jul 26 13:28:54.235 2016: [D] 9:00007F6DC90A4A53 Mailbox::msgHandlerBody(void*) + 3C3 at ??:0
Tue Jul 26 13:28:54.236 2016: [D] 10:00007F6DC908BC06 Thread::callBody(Thread*) + 46 at ??:0
Tue Jul 26 13:28:54.237 2016: [D] 11:00007F6DC907A0D2 Thread::callBodyWrapper(Thread*) + A2 at ??:0
Tue Jul 26 13:28:54.238 2016: [D] 12:00007F6DC87A3AA1 start_thread + D1 at ??:0
Tue Jul 26 13:28:54.239 2016: [D] 13:00007F6DC794A93D clone + 6D at ??:0
Tue Jul 26 13:28:54.240 2016: [N] Restarting mmsdrserv
Tue Jul 26 13:28:55.535 2016: [N] Signal 6 at location 0x7F6DC790EA7D in process 6262, link reg 0xFFFFFFFFFFFFFFFF.
Tue Jul 26 13:28:55.536 2016: [N] mmfsd is shutting down.
Tue Jul 26 13:28:55.537 2016: [N] Reason for shutdown: Signal handler entered
Tue Jul 26 13:28:55 BST 2016: mmcommon mmfsdown invoked.  Subsystem: mmfs Status: active
Tue Jul 26 13:28:55 BST 2016: /var/mmfs/etc/mmfsdown invoked
umount2: Device or resource busy
umount: /camp: device is busy.
       (In some cases useful info about processes that use
        the device is found by lsof(8) or fuser(1))
umount2: Device or resource busy
umount: /ingest: device is busy.
       (In some cases useful info about processes that use
        the device is found by lsof(8) or fuser(1))
Shutting down NFS daemon: [  OK  ]
Shutting down NFS mountd: [  OK  ]
Shutting down NFS quotas: [  OK  ]
Shutting down NFS services:  [  OK  ]
Shutting down RPC idmapd: [  OK  ]
Stopping NFS statd: [  OK  ]

Ugly, right?

Cheers,
Luke.

Luke Raimbach?
Senior HPC Data and Storage Systems Engineer,
The Francis Crick Institute,
Gibbs Building,
215 Euston Road,
London NW1 2BE.

E: luke.raimbach at crick.ac.uk<mailto:luke.raimbach at crick.ac.uk>
W: www.crick.ac.uk

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 215 Euston Road, London NW1 2BE.

------------------------------

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss

End of gpfsug-discuss Digest, Vol 54, Issue 59
**********************************************

The Francis Crick Institute Limited is a registered charity in England and Wales no. 1140062 and a company registered in England and Wales no. 06885462, with its registered office at 215 Euston Road, London NW1 2BE.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160728/81712208/attachment-0002.htm>