[gpfsug-discuss] GPFS Remote Mount Fails

Wed May 7 11:59:30 BST 2014

Hi Luke,

When using RFC 1918 space among remote clusters, GPFS assumes that each cluster's privately addressed networks are not reachable from one another.  You must add explicit shared subnets via mmchconfig. Try setting subnets as follows:

gpfs.oerc.local:
subnets="10.200.0.0 10.200.0.0/cpdn.oerc.local"

cpdn.oerc.local:
subnets="10.200.0.0 10.200.0.0/gpfs.oerc.local"

I think you may also need to set the cipher list locally on each cluster to AUTHONLY via mmauth. On my clusters, these match. (No cluster says "none specified".)

Hope that helps,
Paul

Sent with Good (www.good.com)

________________________________
From: gpfsug-discuss-bounces at gpfsug.org on behalf of Luke Raimbach
Sent: Wednesday, May 07, 2014 5:28:59 AM
To: gpfsug-discuss at gpfsug.org
Subject: [gpfsug-discuss] GPFS Remote Mount Fails

Dear All,

I'm having a problem remote mounting a file system. I have two clusters:

gpfs.oerc.local which owns file system 'gpfs'
cpdn.oerc.local which owns no file systems

I want to remote mount file system 'gpfs' from cluster cpdn.oerc.local. I'll post the configuration for both clusters further down. The error I receive on a node in cluster cpdn.oerc.local is:

Wed May  7 10:05:19.595 2014: Waiting to join remote cluster gpfs.oerc.local
Wed May  7 10:05:20.598 2014: Remote mounts are not enabled within this cluster.
Wed May  7 10:05:20.599 2014: Remote mounts are not enabled within this cluster.
Wed May  7 10:05:20.598 2014: A node join was rejected.  This could be due to
incompatible daemon versions, failure to find the node
in the configuration database, or no configuration manager found.
Wed May  7 10:05:20.600 2014: Failed to join remote cluster gpfs.oerc.local
Wed May  7 10:05:20.601 2014: Command: err 693: mount gpfs.oerc.local:gpfs
Wed May  7 10:05:20.600 2014: Message failed because the destination node refused the connection.

I'm concerned about the "Remote mounts are not enabled within this cluster" messages. Having followed the configuration steps in the GPFS Advanced Administration Guide, I end up with the following configurations:

## GPFS Cluster 'gpfs.oerc.local' ##

[root at gpfs01 ~]# mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         gpfs.oerc.local
  GPFS cluster id:           748734524680043237
  GPFS UID domain:           gpfs.oerc.local
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp

GPFS cluster configuration servers:
-----------------------------------
  Primary server:    gpfs01.oerc.local
  Secondary server:  gpfs02.oerc.local

 Node  Daemon node name    IP address     Admin node name     Designation
--------------------------------------------------------------------------
   1   gpfs01.oerc.local   10.100.10.21   gpfs01.oerc.local   quorum-manager
   2   gpfs02.oerc.local   10.100.10.22   gpfs02.oerc.local   quorum-manager
   3   linux.oerc.local    10.100.10.1    linux.oerc.local
   4   jupiter.oerc.local  10.100.10.2    jupiter.oerc.local
   5   cnfs0.oerc.local    10.100.10.100  cnfs0.oerc.local
   6   cnfs1.oerc.local    10.100.10.101  cnfs1.oerc.local
   7   cnfs2.oerc.local    10.100.10.102  cnfs2.oerc.local
   8   cnfs3.oerc.local    10.100.10.103  cnfs3.oerc.local
   9   tsm01.oerc.local    10.100.10.51   tsm01.oerc.local    quorum-manager

[root at gpfs01 ~]# mmremotecluster show all
Cluster name:    cpdn.oerc.local
Contact nodes:   10.100.10.60,10.100.10.61,10.100.10.62
SHA digest:      e9a2dc678a62d6c581de0b89b49a90f28f401327
File systems:    (none defined)

[root at gpfs01 ~]# mmauth show all
Cluster name:        cpdn.oerc.local
Cipher list:         AUTHONLY
SHA digest:          e9a2dc678a62d6c581de0b89b49a90f28f401327
File system access:  gpfs      (rw, root remapped to 99:99)

Cluster name:        gpfs.oerc.local (this cluster)
Cipher list:         (none specified)
SHA digest:          e7a68ff688d6ef055eb40fe74677b272d6c60879
File system access:  (all rw)

[root at gpfs01 ~]# mmlsconfig
Configuration data for cluster gpfs.oerc.local:
-----------------------------------------------
myNodeConfigNumber 1
clusterName gpfs.oerc.local
clusterId 748734524680043237
autoload yes
minReleaseLevel 3.4.0.7
dmapiFileHandleSize 32
maxMBpS 6400
maxblocksize 2M
pagepool 4G
[cnfs0,cnfs1,cnfs2,cnfs3]
pagepool 2G
[common]
tiebreakerDisks vd0_0;vd2_2;vd5_5
cnfsSharedRoot /gpfs/.ha
nfsPrefetchStrategy 1
cnfsVIP gpfs-nfs
subnets 10.200.0.0
cnfsMountdPort 4000
cnfsNFSDprocs 128
[common]
adminMode central

File systems in cluster gpfs.oerc.local:
----------------------------------------
/dev/gpfs

## GPFS Cluster 'cpdn.oerc.local' ##

[root at cpdn-ppc01 ~]# mmlscluster

GPFS cluster information
========================
  GPFS cluster name:         cpdn.oerc.local
  GPFS cluster id:           10699506775530551223
  GPFS UID domain:           cpdn.oerc.local
  Remote shell command:      /usr/bin/ssh
  Remote file copy command:  /usr/bin/scp

GPFS cluster configuration servers:
-----------------------------------
  Primary server:    cpdn-ppc02.oerc.local
  Secondary server:  cpdn-ppc03.oerc.local

 Node  Daemon node name       IP address    Admin node name        Designation
-------------------------------------------------------------------------------
   1   cpdn-ppc01.oerc.local  10.100.10.60  cpdn-ppc01.oerc.local  quorum
   2   cpdn-ppc02.oerc.local  10.100.10.61  cpdn-ppc02.oerc.local  quorum-manager
   3   cpdn-ppc03.oerc.local  10.100.10.62  cpdn-ppc03.oerc.local  quorum-manager

[root at cpdn-ppc01 ~]# mmremotecluster show all
Cluster name:    gpfs.oerc.local
Contact nodes:   10.100.10.21,10.100.10.22
SHA digest:      e7a68ff688d6ef055eb40fe74677b272d6c60879
File systems:    gpfs (gpfs)

[root at cpdn-ppc01 ~]# mmauth show all
Cluster name:        gpfs.oerc.local
Cipher list:         AUTHONLY
SHA digest:          e7a68ff688d6ef055eb40fe74677b272d6c60879
File system access:  (none authorized)

Cluster name:        cpdn.oerc.local (this cluster)
Cipher list:         (none specified)
SHA digest:          e9a2dc678a62d6c581de0b89b49a90f28f401327
File system access:  (all rw)

[root at cpdn-ppc01 ~]# mmremotefs show all
Local Name  Remote Name  Cluster name       Mount Point        Mount Options    Automount  Drive  Priority
gpfs        gpfs         gpfs.oerc.local    /gpfs              rw               yes          -        0

[root at cpdn-ppc01 ~]# mmlsconfig
Configuration data for cluster cpdn.oerc.local:
-----------------------------------------------
myNodeConfigNumber 1
clusterName cpdn.oerc.local
clusterId 10699506775530551223
autoload yes
dmapiFileHandleSize 32
minReleaseLevel 3.4.0.7
subnets 10.200.0.0
pagepool 4G
[cpdn-ppc02,cpdn-ppc03]
pagepool 2G
[common]
traceRecycle local
trace all 4 tm 2 thread 1 mutex 1 vnode 2 ksvfs 3 klockl 2 io 3 pgalloc 1 mb 1 lock 2 fsck 3
adminMode central

File systems in cluster cpdn.oerc.local:
----------------------------------------
(none)

As far as I can see I have everything set up and have exchanged the public keys for each cluster and installed them using the -k switch for mmremotecluster and mmauth on the respective clusters. I've tried reconfiguring the admin-interface and daemon-interface names on the cpdn.oerc.local cluster but get the same error (stab in the dark after looking at some trace dumps and seeing IP address inconsistencies). Now I'm worried I've missed something really obvious! Any help greatly appreciated. Here's some trace output from the mmmount gpfs command when run from the cpdn.oerc.local cluster:

35.736808   2506 TRACE_MUTEX: Thread 0x320031 (MountHandlerThread) signalling condvar 0x7F8968092D90 (0x7F8968092D90) (ThreadSuspendResumeCondvar) waitCount 1
35.736811   2506 TRACE_MUTEX: internalSignalSave: Created event word 0xFFFF88023AEE1108 for mutex ThreadSuspendResumeMutex
35.736812   2506 TRACE_MUTEX: Releasing mutex 0x1489F28 (0x1489F28) (ThreadSuspendResumeMutex) in daemon (threads waiting)
35.736894   2506 TRACE_BASIC: Wed May  7 08:24:15.991 2014: Waiting to join remote cluster gpfs.oerc.local
35.736927   2506 TRACE_MUTEX: Thread 0x320031 (MountHandlerThread) waiting on condvar 0x14BAB50 (0x14BAB50) (ClusterConfigurationBCHCond): waiting to join remote cluster
35.737369   2643 TRACE_SP: RunProbeCluster: enter. EligibleQuorumNode 0 maxPingIterations 10
35.737371   2643 TRACE_SP: RunProbeCluster: cl 1 gpnStatus none prevLeaseSeconds 0 loopIteration 1 pingIteration 1/10 nToTry 2 nResponses 0 nProbed 0
35.739561   2643 TRACE_DLEASE: Pinger::send: node <c1p2> err 0
35.739620   2643 TRACE_DLEASE: Pinger::send: node <c1p1> err 0
35.739624   2643 TRACE_THREAD: Thread 0x324050 (ProbeRemoteClusterThread) delaying until 1399447456.994516000: waiting for ProbeCluster ping response
35.739726   2579 TRACE_DLEASE: Pinger::receiveLoop: echoreply from <c1p2> 10.100.10.22
35.739728   2579 TRACE_DLEASE: Pinger::receiveLoop: echoreply from <c1p1> 10.100.10.21
35.739730   2579 TRACE_BASIC: cxiRecvfrom: sock 9 buf 0x7F896CB64960 len 128 flags 0 failed with err 11
35.824879   2596 TRACE_DLEASE: checkAndRenewLease: cluster 0 leader <c0n1> (me 0) remountRetryNeeded 0
35.824885   2596 TRACE_DLEASE: renewLease: leaseage 10 (100 ticks/sec) now 429499910 lastLeaseReplyReceived 429498823
35.824891   2596 TRACE_TS: tscSend: service 00010001 msg 'ccMsgDiskLease' n_dest 1 data_len 4 msg_id 94 msg 0x7F89500098B0 mr 0x7F89500096E0
35.824894   2596 TRACE_TS: acquireConn enter: addr <c0n1>
35.824895   2596 TRACE_TS: acquireConn exit: err 0 connP 0x7F8948025210
35.824898   2596 TRACE_TS: sendMessage dest <c0n1> 10.200.61.1 cpdn-ppc02: msg_id 94 type 14 tagP 0x7F8950009CB8 seq 89, state initial
35.824957   2596 TRACE_TS: llc_send_msg: returning 0
35.824958   2596 TRACE_TS: tscSend: replies[0] dest <c0n1>, status pending, err 0
35.824960   2596 TRACE_TS: tscSend: rc = 0x0
35.824961   2596 TRACE_DLEASE: checkAndRenewLease: cluster 0 nextLeaseCheck in 2 sec
35.824989   2596 TRACE_THREAD: Thread 0x20C04D (DiskLeaseThread) delaying until 1399447458.079879000: RunLeaseChecks waiting for next check time
35.825509   2642 TRACE_TS: socket_dequeue_next: returns 8
35.825511   2642 TRACE_TS: socket_dequeue_next: returns -1
35.825513   2642 TRACE_TS: receiverEvent enter: sock 8 event 0x5 state reading header
35.825527   2642 TRACE_TS: service_message: enter: msg 'reply', msg_id 94 seq 88 ackseq 89, from <c0n1> 10.200.61.1, active 0
35.825531   2642 TRACE_TS: tscHandleMsgDirectly: service 00010001, msg 'reply', msg_id 94, len 4, from <c0n1> 10.100.10.61
35.825533   2642 TRACE_TS: HandleReply: status success, err 0; 0 msgs pending after this reply
35.825534   2642 TRACE_MUTEX: Acquired mutex 0x7F896805AC68 (0x7F896805AC68) (PendMsgTabMutex) in daemon using trylock
35.825537   2642 TRACE_DLEASE: renewLease: ccMsgDiskLease reply.status 6 err 0 from <c0n1> (expected 10.100.10.61) current leader 10.100.10.61
35.825545   2642 TRACE_DLEASE: DMS timer [0] started, delay 58, time 4295652
35.825546   2642 TRACE_DLEASE: updateMyLease: oldLease 4294988 newLease 4294999 (35 sec left) leaseLost 0
35.825556   2642 TRACE_BASIC: cxiRecv: sock 8 buf 0x7F8954010BE8 len 32 flags 0 failed with err 11
35.825557   2642 TRACE_TS: receiverEvent exit: sock 8 err 54 newTypes 1 state reading header
36.739811   2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.22 (primary listed 1 0)
36.739814   2643 TRACE_SP: RunProbeCluster: sending probe 1 to <c1p2> gid 00000000:00000000 flags 01
36.739824   2643 TRACE_TS: tscSend: service 00010001 msg 'ccMsgProbeCluster2' n_dest 1 data_len 100 msg_id 95 msg 0x7F8950009F20 mr 0x7F8950009D50
36.739829   2643 TRACE_TS: acquireConn enter: addr <c1p2>
36.739831   2643 TRACE_TS: acquireConn exit: err 0 connP 0x7F8964025040
36.739835   2643 TRACE_TS: sendMessage dest <c1p2> 10.100.10.22 10.100.10.22: msg_id 95 type 36 tagP 0x7F895000A328 seq 1, state initial
36.739838   2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.22 (primary listed 1 0)
36.739914   2643 TRACE_BASIC: Wed May  7 08:24:16.994 2014: Remote mounts are not enabled within this cluster.
36.739963   2643 TRACE_TS: TcpConn::make_connection: status=init, err=720, dest 10.100.10.22
36.739965   2643 TRACE_TS: llc_send_msg: returning 693
36.739966   2643 TRACE_TS: tscSend: replies[0] dest <c1p2>, status node_failed, err 693
36.739968   2643 TRACE_MUTEX: Acquired mutex 0x7F896805AC90 (0x7F896805AC90) (PendMsgTabMutex) in daemon using trylock
36.739969   2643 TRACE_TS: tscSend: rc = 0x1
36.739970   2643 TRACE_SP: RunProbeCluster: reply rc 693 tryHere <none>, flags 0
36.739972   2643 TRACE_SP: RunProbeCluster: cl 1 gpnStatus none prevLeaseSeconds 0 loopIteration 1 pingIteration 2/10 nToTry 2 nResponses 2 nProbed 1
36.739973   2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.21 (primary listed 1 0)
36.739974   2643 TRACE_SP: RunProbeCluster: sending probe 1 to <c1p1> gid 00000000:00000000 flags 01
36.739977   2643 TRACE_TS: tscSend: service 00010001 msg 'ccMsgProbeCluster2' n_dest 1 data_len 100 msg_id 96 msg 0x7F895000A590 mr 0x7F895000A3C0
36.739978   2643 TRACE_TS: acquireConn enter: addr <c1p1>
36.739979   2643 TRACE_TS: acquireConn exit: err 0 connP 0x7F89640258D0
36.739980   2643 TRACE_TS: sendMessage dest <c1p1> 10.100.10.21 10.100.10.21: msg_id 96 type 36 tagP 0x7F895000A998 seq 1, state initial
36.739982   2643 TRACE_TS: llc_pick_dest_addr: use default addrs from 10.100.10.60 to 10.100.10.21 (primary listed 1 0)
36.739993   2643 TRACE_BASIC: Wed May  7 08:24:16.995 2014: Remote mounts are not enabled within this cluster.
36.740003   2643 TRACE_TS: TcpConn::make_connection: status=init, err=720, dest 10.100.10.21
36.740005   2643 TRACE_TS: llc_send_msg: returning 693
36.740005   2643 TRACE_TS: tscSend: replies[0] dest <c1p1>, status node_failed, err 693

Sorry if the formatting above gets horribly screwed. Thanks for any assistance,

Luke

--

Luke Raimbach
IT Manager
Oxford e-Research Centre
7 Keble Road,
Oxford,
OX1 3QG

+44(0)1865 610639

_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at gpfsug.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20140507/f289cd64/attachment-0003.htm>