[gpfsug-discuss] Inter-clusters issue with change of the subnet IP

Ivano Talamo Ivano.Talamo at psi.ch
Thu May 24 14:51:56 BST 2018


Hi all,

We currently have an issue with our GPFS clusters.
Shortly when we removed/added a node to a cluster we changed IP
address for the IPoIB subnet and this broke GPFS. The primary IP
didn't change.

In details our setup is quite standard: one GPFS cluster with CPU
nodes only accessing (via remote cluster mount) different storage
clusters. Clusters are on an Infiniband fabric plus IPoIB for
communication via the subnet parameter.

Yesterday it happened that some nodes were added to the CPU
cluster with the correct primary IP addresses but incorrect IPoIB
ones. Incorrect in the sense that the IPoIB addresses were already
in use by some other nodes in the same CPU cluster.

This made all the clusters (not only the CPU one) suffer for a lot of
errors, gpfs restarting, file systems being unmounted. Removing the
wrong nodes brought the clusters to a stable state.

But the real strange thing came when one of these node was
reinstalled, configured with the correct IPoIB address and added
again to the cluster. At this point (when the node tried to mount the
remote filesystems) the issue happened again.

In the log files we have lines like:
  2018-05-24_10:32:45.520+0200: [I] Accepted and connected to 
192.168.x.y <hostname> <c0n52>

Where the IP number 192.168.x.y is the old/incorrect one.

And looking at mmdiag --network there are a bunch of lines like the
following:
     <hostname> <c039>  192.168.x.z broken 233 -1 0 0 L

With the wrong/old IPs. And this appears on all cluster (CPU and
storage ones).

Is it possible that the other nodes in the clusters use this outdated
information when the reinstalled node is brought back into the
cluster? Is there any kind of timeout, so that after sometimes this
information is purged? Or is there any procedure that we could use to
now introduce the nodes?

Otherwise we see no other option but to restart GPFS on all the
nodes of all clusters one by one to make sure that the incorrect
information goes away.

Thanks,
Ivano



More information about the gpfsug-discuss mailing list