[gpfsug-discuss] Odd networking/name resolution issue

Jaime Pinto pinto at scinet.utoronto.ca
Sat May 9 12:06:44 BST 2020


DNS shouldn't be relied upon on a GPFS cluster for internal communication/management or data.

As a starting point, make sure the IP's and names of all managers/quorum nodes and clients have *unique* entries in the hosts files of all other nodes in the clusters, being the same as how they where joined and licensed in the first place. If you issue a 'mmlscluster' on the cluster manager for the servers and clients, those results should be used to build the common hosts file for all nodes involved.

Also, all nodes should have a common ntp configuration, pointing to the same *internal* ntp server, easily accessible via name/IP also on the hosts file.

And obviously, you need a stable network, eth or IB. Have a good monitoring tool in place, to rule out network as a possible culprit. In the particular case of IB, check that the fabric managers are doing their jobs properly.

And keep one eye on the 'tail -f /var/mmfs/gen/mmfslog' output of the managers and the nodes being expelled for other clues.

Jaime



On 5/9/2020 06:25:28, TURNER Aaron wrote:
> Dear All,
> 
> We are getting, on an intermittent basis with currently no obvious pattern, an issue with GPFS nodes reporting rejecting nodes of the form:
> 
> nodename.domain.domain.domain....
> 
> DNS resolution using the standard command-line tools of the IP address present in the logs does not repeat the domain, and so far it seems isolated to GPFS.
> 
> Ultimately the nodes are rejected as not responding on the network.
> 
> Has anyone seen this sort of behaviour before?
> 
> Regards
> 
> Aaron Turner
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
> 
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> 

.
.
.        ************************************
           TELL US ABOUT YOUR SUCCESS STORIES
          http://www.scinethpc.ca/testimonials
          ************************************
---
Jaime Pinto - Storage Analyst
SciNet HPC Consortium - Compute/Calcul Canada
www.scinet.utoronto.ca - www.computecanada.ca
University of Toronto
661 University Ave. (MaRS), Suite 1140
Toronto, ON, M5G1M1
P: 416-978-2755
C: 416-505-1477



More information about the gpfsug-discuss mailing list