[gpfsug-discuss] GPFS Remote Mount Fails
Sanchez, Paul
Paul.Sanchez at deshaw.com
Wed May 7 11:59:30 BST 2014
Hi Luke,
When using RFC 1918 space among remote clusters, GPFS assumes that each cluster's privately addressed networks are not reachable from one another. You must add explicit shared subnets via mmchconfig. Try setting subnets as follows:
gpfs.oerc.local:
subnets="10.200.0.0 10.200.0.0/cpdn.oerc.local"
cpdn.oerc.local:
subnets="10.200.0.0 10.200.0.0/gpfs.oerc.local"
I think you may also need to set the cipher list locally on each cluster to AUTHONLY via mmauth. On my clusters, these match. (No cluster says "none specified".)
Hope that helps,
Paul
Sent with Good (www.good.com)
________________________________
From: gpfsug-discuss-bounces at gpfsug.org on behalf of Luke Raimbach
Sent: Wednesday, May 07, 2014 5:28:59 AM
To: gpfsug-discuss at gpfsug.org
Subject: [gpfsug-discuss] GPFS Remote Mount Fails
Dear All,
I'm having a problem remote mounting a file system. I have two clusters:
gpfs.oerc.local which owns file system 'gpfs'
cpdn.oerc.local which owns no file systems
I want to remote mount file system 'gpfs' from cluster cpdn.oerc.local. I'll post the configuration for both clusters further down. The error I receive on a node in cluster cpdn.oerc.local is:
Wed May 7 10:05:19.595 2014: Waiting to join remote cluster gpfs.oerc.local
Wed May 7 10:05:20.598 2014: Remote mounts are not enabled within this cluster.
Wed May 7 10:05:20.599 2014: Remote mounts are not enabled within this cluster.
Wed May 7 10:05:20.598 2014: A node join was rejected. This could be due to
incompatible daemon versions, failure to find the node
in the configuration database, or no configuration manager found.
Wed May 7 10:05:20.600 2014: Failed to join remote cluster gpfs.oerc.local
Wed May 7 10:05:20.601 2014: Command: err 693: mount gpfs.oerc.local:gpfs
Wed May 7 10:05:20.600 2014: Message failed because the destination node refused the connection.
I'm concerned about the "Remote mounts are not enabled within this cluster" messages. Having followed the configuration steps in the GPFS Advanced Administration Guide, I end up with the following configurations:
## GPFS Cluster 'gpfs.oerc.local' ##
[root at gpfs01 ~]# mmlscluster
GPFS cluster information
========================
GPFS cluster name: gpfs.oerc.local
GPFS cluster id: 748734524680043237
GPFS UID domain: gpfs.oerc.local
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
GPFS cluster configuration servers:
-----------------------------------
Primary server: gpfs01.oerc.local
Secondary server: gpfs02.oerc.local
Node Daemon node name IP address Admin node name Designation
--------------------------------------------------------------------------
1 gpfs01.oerc.local 10.100.10.21 gpfs01.oerc.local quorum-manager
2 gpfs02.oerc.local 10.100.10.22 gpfs02.oerc.local quorum-manager
3 linux.oerc.local 10.100.10.1 linux.oerc.local
4 jupiter.oerc.local 10.100.10.2 jupiter.oerc.local
5 cnfs0.oerc.local 10.100.10.100 cnfs0.oerc.local
6 cnfs1.oerc.local 10.100.10.101 cnfs1.oerc.local
7 cnfs2.oerc.local 10.100.10.102 cnfs2.oerc.local
8 cnfs3.oerc.local 10.100.10.103 cnfs3.oerc.local
9 tsm01.oerc.local 10.100.10.51 tsm01.oerc.local quorum-manager
[root at gpfs01 ~]# mmremotecluster show all
Cluster name: cpdn.oerc.local
Contact nodes: 10.100.10.60,10.100.10.61,10.100.10.62
SHA digest: e9a2dc678a62d6c581de0b89b49a90f28f401327
File systems: (none defined)
[root at gpfs01 ~]# mmauth show all
Cluster name: cpdn.oerc.local
Cipher list: AUTHONLY
SHA digest: e9a2dc678a62d6c581de0b89b49a90f28f401327
File system access: gpfs (rw, root remapped to 99:99)
Cluster name: gpfs.oerc.local (this cluster)
Cipher list: (none specified)
SHA digest: e7a68ff688d6ef055eb40fe74677b272d6c60879
File system access: (all rw)
[root at gpfs01 ~]# mmlsconfig
Configuration data for cluster gpfs.oerc.local:
-----------------------------------------------
myNodeConfigNumber 1
clusterName gpfs.oerc.local
clusterId 748734524680043237
autoload yes
minReleaseLevel 3.4.0.7
dmapiFileHandleSize 32
maxMBpS 6400
maxblocksize 2M
pagepool 4G
[cnfs0,cnfs1,cnfs2,cnfs3]
pagepool 2G
[common]
tiebreakerDisks vd0_0;vd2_2;vd5_5
cnfsSharedRoot /gpfs/.ha
nfsPrefetchStrategy 1
cnfsVIP gpfs-nfs
subnets 10.200.0.0
cnfsMountdPort 4000
cnfsNFSDprocs 128
[common]
adminMode central
File systems in cluster gpfs.oerc.local:
----------------------------------------
/dev/gpfs
## GPFS Cluster 'cpdn.oerc.local' ##
[root at cpdn-ppc01 ~]# mmlscluster
GPFS cluster information
========================
GPFS cluster name: cpdn.oerc.local
GPFS cluster id: 10699506775530551223
GPFS UID domain: cpdn.oerc.local
Remote shell command: /usr/bin/ssh
Remote file copy command: /usr/bin/scp
GPFS cluster configuration servers:
-----------------------------------
Primary server: cpdn-ppc02.oerc.local
Secondary server: cpdn-ppc03.oerc.local
Node Daemon node name IP address Admin node name Designation
-------------------------------------------------------------------------------
1 cpdn-ppc01.oerc.local 10.100.10.60 cpdn-ppc01.oerc.local quorum
2 cpdn-ppc02.oerc.local 10.100.10.61 cpdn-ppc02.oerc.local quorum-manager
3 cpdn-ppc03.oerc.local 10.100.10.62 cpdn-ppc03.oerc.local quorum-manager
[root at cpdn-ppc01 ~]# mmremotecluster show all
Cluster name: gpfs.oerc.local
Contact nodes: 10.100.10.21,10.100.10.22
SHA digest: e7a68ff688d6ef055eb40fe74677b272d6c60879
File systems: gpfs (gpfs)
[root at cpdn-ppc01 ~]# mmauth show all
Cluster name: gpfs.oerc.local
Cipher list: AUTHONLY
SHA digest: e7a68ff688d6ef055eb40fe74677b272d6c60879
File system access: (none authorized)
Cluster name: cpdn.oerc.local (this cluster)
Cipher list: (none specified)
SHA digest: e9a2dc678a62d6c581de0b89b49a90f28f401327
File system access: (all rw)
[root at cpdn-ppc01 ~]# mmremotefs show all
Local Name Remote Name Cluster name Mount Point Mount Options Automount Drive Priority
gpfs gpfs gpfs.oerc.local /gpfs rw yes - 0
[root at cpdn-ppc01 ~]# mmlsconfig
Configuration data for cluster cpdn.oerc.local:
-----------------------------------------------
myNodeConfigNumber 1
clusterName cpdn.oerc.local
clusterId 10699506775530551223
autoload yes
dmapiFileHandleSize 32
minReleaseLevel 3.4.0.7
subnets 10.200.0.0
pagepool 4G
[cpdn-ppc02,cpdn-ppc03]
pagepool 2G
[common]
traceRecycle local
trace all 4 tm 2 thread 1 mutex 1 vnode 2 ksvfs 3 klockl 2 io 3 pgalloc 1 mb 1 lock 2 fsck 3
adminMode central
File systems in cluster cpdn.oerc.local:
----------------------------------------
(none)
As far as I can see I have everything set up and have exchanged the public keys for each cluster and installed them using the -k switch for mmremotecluster and mmauth on the respective clusters. I've tried reconfiguring the admin-interface and daemon-interface names on the cpdn.oerc.local cluster but get the same error (stab in the dark after looking at some trace dumps and seeing IP address inconsistencies). Now I'm worried I've missed something really obvious! Any help greatly appreciated. Here's some trace output from the mmmount gpfs command when run from the cpdn.oerc.local cluster:
35.736808 2506 TRACE_MUTEX: Thread 0x320031 (MountHandlerThread) signalling condvar 0x7F8968092D90 (0x7F8968092D90) (ThreadSuspendResumeCondvar) waitCount 1
35.736811 2506 TRACE_MUTEX: internalSignalSave: Created event word 0xFFFF88023AEE1108 for mutex ThreadSuspendResumeMutex
35.736812 2506 TRACE_MUTEX: Releasing mutex 0x1489F28 (0x1489F28) (ThreadSuspendResumeMutex) in daemon (threads waiting)
35.736894 2506 TRACE_BASIC: Wed May 7 08:24:15.991 2014: Waiting to join remote cluster gpfs.oerc.local
35.736927 2506 TRACE_MUTEX: Thread 0x320031 (MountHandlerThread) waiting on condvar 0x14BAB50 (0x14BAB50) (ClusterConfigurationBCHCond): waiting to join remote cluster
35.737369 2643 TRACE_SP: RunProbeCluster: enter. EligibleQuorumNode 0 maxPingIterations 10
35.737371 2643 TRACE_SP: RunProbeCluster: cl 1 gpnStatus none prevLeaseSeconds 0 loopIteration 1 pingIteration 1/10 nToTry 2 nResponses 0 nProbed 0
35.739561 2643 TRACE_DLEASE: Pinger::send: node <c1p2> err 0
35.739620 2643 TRACE_DLEASE: Pinger::send: node <c1p1> err 0
35.739624 2643 TRACE_THREAD: Thread 0x324050 (ProbeRemoteClusterThread) delaying until 1399447456.994516000: waiting for ProbeCluster ping response
35.739726 2579 TRACE_DLEASE: Pinger::receiveLoop: echoreply from <c1p2> 10.100.10.22
35.739728 2579 TRACE_DLEASE: Pinger::receiveLoop: echoreply from <c1p1> 10.100.10.21
35.739730 2579 TRACE_BASIC: cxiRecvfrom: sock 9 buf 0x7F896CB64960 len 128 flags 0 failed with err 11
35.824879 2596 TRACE_DLEASE: checkAndRenewLease: cluster 0 leader <c0n1> (me 0) remountRetryNeeded 0
35.824885 2596 TRACE_DLEASE: renewLease: leaseage 10 (100 ticks/sec) now 429499910 lastLeaseReplyReceived 429498823
35.824891 2596 TRACE_TS: tscSend: service 00010001 msg 'ccMsgDiskLease' n_dest 1 data_len 4 msg_id 94 msg 0x7F89500098B0 mr 0x7F89500096E0
35.824894 2596 TRACE_TS: acquireConn enter: addr <c0n1>
35.824895 2596 TRACE_TS: acquireConn exit: err 0 connP 0x7F8948025210
35.824898 2596 TRACE_TS: sendMessage dest <c0n1> 10.200.61.1 cpdn-ppc02: msg_id 94 type 14 tagP 0x7F8950009CB8 seq 89, state initial
35.824957 2596 TRACE_TS: llc_send_msg: returning 0
35.824958 2596 TRACE_TS: tscSend: replies[0] dest <c0n1>, status pending, err 0
35.824960 2596 TRACE_TS: tscSend: rc = 0x0
35.824961 2596 TRACE_DLEASE: checkAndRenewLease: cluster 0 nextLeaseCheck in 2 sec
35.824989 2596 TRACE_THREAD: Thread 0x20C04D (DiskLeaseThread) delaying until 1399447458.079879000: RunLeaseChecks waiting for next check time
35.825509 2642 TRACE_TS: socket_dequeue_next: returns 8
35.825511 2642 TRACE_TS: socket_dequeue_next: returns -1
35.825513 2642 TRACE_TS: receiverEvent enter: sock 8 event 0x5 state reading header
35.825527 2642 TRACE_TS: service_message: enter: msg 'reply', msg_id 94 seq 88 ackseq 89, from <c0n1> 10.200.61.1, active 0
35.825531 2642 TRACE_TS: tscHandleMsgDirectly: service 00010001, msg 'reply', msg_id 94, len 4, from <c0n1> 10.100.10.61
35.825533 2642 TRACE_TS: HandleReply: status success, err 0; 0 msgs pending after this reply
35.825534 2642 TRACE_MUTEX: Acquired mutex 0x7F896805AC68 (0x7F896805AC68) (PendMsgTabMutex) in daemon using trylock
35.825537 2642 TRACE_DLEASE: renewLease: ccMsgDiskLease reply.status 6 err 0 from <c0n1> (expected 10.100.10.61) current leader 10.100.10.61
35.825545 2642 TRACE_DLEASE: DMS timer [0] started, delay 58, time 4295652
35.825546 2642 TRACE_DLEASE: updateMyLease: oldLease 4294988 newLease 4294999 (35 sec left) leaseLost 0
35.825556 2642 TRACE_BASIC: cxiRecv: sock 8 buf 0x7F8954010BE8 len 32 flags 0 failed with err 11
35.825557 2642 TRACE_TS: receiverEvent exit: sock 8 err 54 newTypes 1 state reading header
36.739811 2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.22 (primary listed 1 0)
36.739814 2643 TRACE_SP: RunProbeCluster: sending probe 1 to <c1p2> gid 00000000:00000000 flags 01
36.739824 2643 TRACE_TS: tscSend: service 00010001 msg 'ccMsgProbeCluster2' n_dest 1 data_len 100 msg_id 95 msg 0x7F8950009F20 mr 0x7F8950009D50
36.739829 2643 TRACE_TS: acquireConn enter: addr <c1p2>
36.739831 2643 TRACE_TS: acquireConn exit: err 0 connP 0x7F8964025040
36.739835 2643 TRACE_TS: sendMessage dest <c1p2> 10.100.10.22 10.100.10.22: msg_id 95 type 36 tagP 0x7F895000A328 seq 1, state initial
36.739838 2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.22 (primary listed 1 0)
36.739914 2643 TRACE_BASIC: Wed May 7 08:24:16.994 2014: Remote mounts are not enabled within this cluster.
36.739963 2643 TRACE_TS: TcpConn::make_connection: status=init, err=720, dest 10.100.10.22
36.739965 2643 TRACE_TS: llc_send_msg: returning 693
36.739966 2643 TRACE_TS: tscSend: replies[0] dest <c1p2>, status node_failed, err 693
36.739968 2643 TRACE_MUTEX: Acquired mutex 0x7F896805AC90 (0x7F896805AC90) (PendMsgTabMutex) in daemon using trylock
36.739969 2643 TRACE_TS: tscSend: rc = 0x1
36.739970 2643 TRACE_SP: RunProbeCluster: reply rc 693 tryHere <none>, flags 0
36.739972 2643 TRACE_SP: RunProbeCluster: cl 1 gpnStatus none prevLeaseSeconds 0 loopIteration 1 pingIteration 2/10 nToTry 2 nResponses 2 nProbed 1
36.739973 2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.21 (primary listed 1 0)
36.739974 2643 TRACE_SP: RunProbeCluster: sending probe 1 to <c1p1> gid 00000000:00000000 flags 01
36.739977 2643 TRACE_TS: tscSend: service 00010001 msg 'ccMsgProbeCluster2' n_dest 1 data_len 100 msg_id 96 msg 0x7F895000A590 mr 0x7F895000A3C0
36.739978 2643 TRACE_TS: acquireConn enter: addr <c1p1>
36.739979 2643 TRACE_TS: acquireConn exit: err 0 connP 0x7F89640258D0
36.739980 2643 TRACE_TS: sendMessage dest <c1p1> 10.100.10.21 10.100.10.21: msg_id 96 type 36 tagP 0x7F895000A998 seq 1, state initial
36.739982 2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.21 (primary listed 1 0)
36.739993 2643 TRACE_BASIC: Wed May 7 08:24:16.995 2014: Remote mounts are not enabled within this cluster.
36.740003 2643 TRACE_TS: TcpConn::make_connection: status=init, err=720, dest 10.100.10.21
36.740005 2643 TRACE_TS: llc_send_msg: returning 693
36.740005 2643 TRACE_TS: tscSend: replies[0] dest <c1p1>, status node_failed, err 693
Sorry if the formatting above gets horribly screwed. Thanks for any assistance,
Luke
--
Luke Raimbach
IT Manager
Oxford e-Research Centre
7 Keble Road,
Oxford,
OX1 3QG
+44(0)1865 610639
_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140507/f289cd64/attachment-0003.htm>
More information about the gpfsug-discuss
mailing list