[gpfsug-discuss] unusual node expels?

Alex Chekholko chekh at stanford.edu
Tue Dec 15 23:11:39 GMT 2015


Hi,

In the end the error message "no route to host" was the correct one, to 
be taken at face value.

Some iptables rules got accidentally set up on some private network 
interfaces and so a GPFS node that was already up was not accessible 
from the GPFS nodes that were coming up next, so they would all be expelled.

Regards,
Alex

On 12/15/2015 12:34 PM, Alex Chekholko wrote:
> Hi all,
>
> I had a RHEL6.3 / MLNX OFED 1.5.3 / GPFS 3.5.0.10 cluster, which was
> working fine.
>
> We tried to upgrade some stuff (our mistake!), specifically the Mellanox
> firmwares and the OS and switched to in-built CentOS OFED.
>
> So now I have
> CentOS 6.7 / GPFS 3.5.0.29 cluster where the GPFS client nodes refuse to
> stay connected.   Here is a typical log:
>
>
> [root at cn1 ~]# cat /var/adm/ras/mmfs.log.latest
> Tue Dec 15 12:21:38 PST 2015: runmmfs starting
> Removing old /var/adm/ras/mmfs.log.* files:
> Unloading modules from /lib/modules/2.6.32-573.8.1.el6.x86_64/extra
> Loading modules from /lib/modules/2.6.32-573.8.1.el6.x86_64/extra
> Module                  Size  Used by
> mmfs26               1836054  0
> mmfslinux             330095  1 mmfs26
> tracedev               43757  2 mmfs26,mmfslinux
> Tue Dec 15 12:21:39.230 2015: mmfsd initializing. {Version: 3.5.0.29
> Built: Nov  6 2015 15:28:46} ...
> Tue Dec 15 12:21:40.847 2015: VERBS RDMA starting.
> Tue Dec 15 12:21:40.849 2015: VERBS RDMA library libibverbs.so.1
> (version >= 1.1) loaded and initialized.
> Tue Dec 15 12:21:40.850 2015: VERBS RDMA verbsRdmasPerNode reduced from
> 128 to 98 to match (nsdMaxWorkerThreads 96 + (nspdThreadsPerQueue 2 *
> nspdQueues 1)).
> Tue Dec 15 12:21:41.122 2015: VERBS RDMA device mlx4_0 port 1 fabnum 0
> opened, lid 10, 4x FDR INFINIBAND.
> Tue Dec 15 12:21:41.123 2015: VERBS RDMA started.
> Tue Dec 15 12:21:41.626 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:21:41.627 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:21:41.628 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:21:41.629 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:21:41.630 2015: Node 10.210.16.41 (hs-gs-02) is now the
> Group Leader.
> Tue Dec 15 12:21:41.641 2015: mmfsd ready
> Tue Dec 15 12:21:41 PST 2015: mmcommon mmfsup invoked. Parameters:
> 10.210.17.1 10.210.16.41 all
> Tue Dec 15 12:21:41 PST 2015: mounting /dev/hsgs
> Tue Dec 15 12:21:41.918 2015: Command: mount hsgs
> Tue Dec 15 12:21:42.131 2015: Connecting to 10.210.16.42 hs-gs-03 <c0n2>
> Tue Dec 15 12:21:42.132 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:21:42.133 2015: Connected to 10.210.16.42 hs-gs-03 <c0n2>
> Tue Dec 15 12:21:42.134 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:21:42.148 2015: VERBS RDMA connecting to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:21:42.149 2015: VERBS RDMA connected to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 0
> Tue Dec 15 12:21:42.153 2015: VERBS RDMA connecting to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:21:42.154 2015: VERBS RDMA connected to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 1
> Tue Dec 15 12:21:42.171 2015: Connecting to 10.210.16.11 hs-ln01.local
> <c0n5>
> Tue Dec 15 12:21:42.173 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:21:42.174 2015: Retry connection to 10.210.16.11
> hs-ln01.local <c0n5>
> Tue Dec 15 12:21:42.173 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:22:55.322 2015: Request sent to 10.210.16.41 (hs-gs-02) to
> expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:22:55.323 2015: This node will be expelled from cluster
> HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
> Tue Dec 15 12:22:55.324 2015: This node is being expelled from the cluster.
> Tue Dec 15 12:22:55.323 2015: Lost membership in cluster
> HS-GS-Cluster.hs-gs-01. Unmounting file systems.
> Tue Dec 15 12:22:55.325 2015: VERBS RDMA closed connection to
> 10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:22:55.327 2015: Cluster Manager connection broke. Probing
> cluster HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:22:55.328 2015: VERBS RDMA closed connection to
> 10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:22:56.419 2015: Command: err 2: mount hsgs
> Tue Dec 15 12:22:56.420 2015: Specified entity, such as a disk or file
> system, does not exist.
> mount: No such file or directory
> Tue Dec 15 12:22:56 PST 2015: finished mounting /dev/hsgs
> Tue Dec 15 12:22:56.587 2015: Quorum loss. Probing cluster
> HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:22:57.087 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:22:57.088 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:22:57.089 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:22:57.090 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:23:02.090 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
> Tue Dec 15 12:23:02.092 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
> Tue Dec 15 12:23:49.604 2015: Node 10.210.16.41 (hs-gs-02) is now the
> Group Leader.
> Tue Dec 15 12:23:49.614 2015: mmfsd ready
> Tue Dec 15 12:23:49 PST 2015: mmcommon mmfsup invoked. Parameters:
> 10.210.17.1 10.210.16.41 all
> Tue Dec 15 12:23:49 PST 2015: mounting /dev/hsgs
> Tue Dec 15 12:23:49.866 2015: Command: mount hsgs
> Tue Dec 15 12:23:49.949 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:23:49.950 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:23:49.957 2015: VERBS RDMA connecting to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:23:49.958 2015: VERBS RDMA connected to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 1
> Tue Dec 15 12:23:49.962 2015: VERBS RDMA connecting to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:23:49.963 2015: VERBS RDMA connected to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 0
> Tue Dec 15 12:23:49.980 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:23:49.981 2015: Retry connection to 10.210.16.11
> hs-ln01.local <c0n5>
> Tue Dec 15 12:23:49.980 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:25:05.321 2015: Request sent to 10.210.16.41 (hs-gs-02) to
> expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:25:05.322 2015: This node will be expelled from cluster
> HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
> Tue Dec 15 12:25:05.323 2015: This node is being expelled from the cluster.
> Tue Dec 15 12:25:05.324 2015: Lost membership in cluster
> HS-GS-Cluster.hs-gs-01. Unmounting file systems.
> Tue Dec 15 12:25:05.325 2015: VERBS RDMA closed connection to
> 10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:25:05.326 2015: VERBS RDMA closed connection to
> 10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:25:05.327 2015: Cluster Manager connection broke. Probing
> cluster HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:25:06.413 2015: Command: err 2: mount hsgs
> Tue Dec 15 12:25:06.414 2015: Specified entity, such as a disk or file
> system, does not exist.
> mount: No such file or directory
> Tue Dec 15 12:25:06 PST 2015: finished mounting /dev/hsgs
> Tue Dec 15 12:25:06.569 2015: Quorum loss. Probing cluster
> HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:25:07.069 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:25:07.070 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:25:07.071 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:25:07.072 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:25:12.072 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
> Tue Dec 15 12:25:12.073 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
> Tue Dec 15 12:25:59.585 2015: Node 10.210.16.41 (hs-gs-02) is now the
> Group Leader.
> Tue Dec 15 12:25:59.596 2015: mmfsd ready
> Tue Dec 15 12:25:59 PST 2015: mmcommon mmfsup invoked. Parameters:
> 10.210.17.1 10.210.16.41 all
> Tue Dec 15 12:25:59 PST 2015: mounting /dev/hsgs
> Tue Dec 15 12:25:59.856 2015: Command: mount hsgs
> Tue Dec 15 12:25:59.934 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:25:59.935 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:25:59.941 2015: VERBS RDMA connecting to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:25:59.942 2015: VERBS RDMA connected to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 0
> Tue Dec 15 12:25:59.945 2015: VERBS RDMA connecting to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:25:59.947 2015: VERBS RDMA connected to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 1
> Tue Dec 15 12:25:59.963 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:25:59.964 2015: Retry connection to 10.210.16.11
> hs-ln01.local <c0n5>
> Tue Dec 15 12:25:59.965 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:27:15.457 2015: Request sent to 10.210.16.41 (hs-gs-02) to
> expel 10.210.16.11 (hs-ln01.local) from cluster HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:27:15.458 2015: This node will be expelled from cluster
> HS-GS-Cluster.hs-gs-01 due to expel msg from 10.210.17.1 (cn1.local)
> Tue Dec 15 12:27:15.459 2015: This node is being expelled from the cluster.
> Tue Dec 15 12:27:15.460 2015: Lost membership in cluster
> HS-GS-Cluster.hs-gs-01. Unmounting file systems.
> Tue Dec 15 12:27:15.461 2015: VERBS RDMA closed connection to
> 10.210.16.41 (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:27:15.462 2015: Cluster Manager connection broke. Probing
> cluster HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:27:15.463 2015: VERBS RDMA closed connection to
> 10.210.16.40 (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:27:16.578 2015: Command: err 2: mount hsgs
> Tue Dec 15 12:27:16.579 2015: Specified entity, such as a disk or file
> system, does not exist.
> mount: No such file or directory
> Tue Dec 15 12:27:16 PST 2015: finished mounting /dev/hsgs
> Tue Dec 15 12:27:16.938 2015: Quorum loss. Probing cluster
> HS-GS-Cluster.hs-gs-01
> Tue Dec 15 12:27:17.439 2015: Connecting to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:27:17.440 2015: Connected to 10.210.16.40 hs-gs-01 <c0p0>
> Tue Dec 15 12:27:17.441 2015: Connecting to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:27:17.442 2015: Connected to 10.210.16.41 hs-gs-02 <c0p1>
> Tue Dec 15 12:27:22.442 2015: Connecting to 10.210.16.42 hs-gs-03 <c0p2>
> Tue Dec 15 12:27:22.443 2015: Connected to 10.210.16.42 hs-gs-03 <c0p2>
> Tue Dec 15 12:28:09.955 2015: Node 10.210.16.41 (hs-gs-02) is now the
> Group Leader.
> Tue Dec 15 12:28:09.965 2015: mmfsd ready
> Tue Dec 15 12:28:10 PST 2015: mmcommon mmfsup invoked. Parameters:
> 10.210.17.1 10.210.16.41 all
> Tue Dec 15 12:28:10 PST 2015: mounting /dev/hsgs
> Tue Dec 15 12:28:10.222 2015: Command: mount hsgs
> Tue Dec 15 12:28:10.314 2015: Connecting to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:28:10.315 2015: Connected to 10.210.16.43 hs-gs-04 <c0n3>
> Tue Dec 15 12:28:10.322 2015: VERBS RDMA connecting to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 index 1
> Tue Dec 15 12:28:10.323 2015: VERBS RDMA connected to 10.210.16.41
> (hs-gs-02) on mlx4_0 port 1 fabnum 0 sl 0 index 1
> Tue Dec 15 12:28:10.326 2015: VERBS RDMA connecting to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 index 0
> Tue Dec 15 12:28:10.328 2015: VERBS RDMA connected to 10.210.16.40
> (hs-gs-01) on mlx4_0 port 1 fabnum 0 sl 0 index 0
> Tue Dec 15 12:28:10.344 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
> Tue Dec 15 12:28:10.345 2015: Retry connection to 10.210.16.11
> hs-ln01.local <c0n5>
> Tue Dec 15 12:28:10.346 2015: Close connection to 10.210.16.11
> hs-ln01.local <c0n5> (No route to host)
>
>
>
> All the IB / RDMA stuff looks OK to me, but as soon as the GPFS clients
> connect, they try to expel each other.  The 4 NSD servers seem just fine
> though.  Trying the Mellanox OFED 3.x yields the same results, so
> somehow I think it's not an IB issue.
>
> [root at cn1 ~]# uname -r
> 2.6.32-573.8.1.el6.x86_64
> [root at cn1 ~]# rpm -qa|grep gpfs
> gpfs.gpl-3.5.0-29.noarch
> gpfs.docs-3.5.0-29.noarch
> gpfs.msg.en_US-3.5.0-29.noarch
> gpfs.base-3.5.0-29.x86_64
>
> Does anyone have any suggestions?
>
> Regards,

-- 
Alex Chekholko chekh at stanford.edu 347-401-4860




More information about the gpfsug-discuss mailing list