[gpfsug-discuss] Odd networking/name resolution issue
Jaime Pinto
pinto at scinet.utoronto.ca
Sat May 9 12:06:44 BST 2020
DNS shouldn't be relied upon on a GPFS cluster for internal communication/management or data.
As a starting point, make sure the IP's and names of all managers/quorum nodes and clients have *unique* entries in the hosts files of all other nodes in the clusters, being the same as how they where joined and licensed in the first place. If you issue a 'mmlscluster' on the cluster manager for the servers and clients, those results should be used to build the common hosts file for all nodes involved.
Also, all nodes should have a common ntp configuration, pointing to the same *internal* ntp server, easily accessible via name/IP also on the hosts file.
And obviously, you need a stable network, eth or IB. Have a good monitoring tool in place, to rule out network as a possible culprit. In the particular case of IB, check that the fabric managers are doing their jobs properly.
And keep one eye on the 'tail -f /var/mmfs/gen/mmfslog' output of the managers and the nodes being expelled for other clues.
Jaime
On 5/9/2020 06:25:28, TURNER Aaron wrote:
> Dear All,
>
> We are getting, on an intermittent basis with currently no obvious pattern, an issue with GPFS nodes reporting rejecting nodes of the form:
>
> nodename.domain.domain.domain....
>
> DNS resolution using the standard command-line tools of the IP address present in the logs does not repeat the domain, and so far it seems isolated to GPFS.
>
> Ultimately the nodes are rejected as not responding on the network.
>
> Has anyone seen this sort of behaviour before?
>
> Regards
>
> Aaron Turner
> The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
.
.
. ************************************
TELL US ABOUT YOUR SUCCESS STORIES
http://www.scinethpc.ca/testimonials
************************************
---
Jaime Pinto - Storage Analyst
SciNet HPC Consortium - Compute/Calcul Canada
www.scinet.utoronto.ca - www.computecanada.ca
University of Toronto
661 University Ave. (MaRS), Suite 1140
Toronto, ON, M5G1M1
P: 416-978-2755
C: 416-505-1477
More information about the gpfsug-discuss
mailing list