[gpfsug-discuss] Odd networking/name resolution issue

Josh Catana jcatana at gmail.com
Sun May 10 20:09:39 BST 2020


I've seen odd behavior like this before and it is to do with name
resolution.
It might be from your local /etc/hosts entries or potentially the names
used to add the nodes to the cluster or even if you are using DNS aliases
that are configured improperly.

In my case someone added DNS aliases to use in our cluster with fqdn
instead of shortname which caused the triple append to appear in the logs
you mentioned.

I don't think it hurts anything since GPFS has its own name-to-ip table,
but you probably want to track it down and fix it to be safe.




On Sun, May 10, 2020, 2:31 PM RICHARD RUPP <richard.rupp at us.ibm.com> wrote:

> *Normally the DNS server publishes a TTL and the client side caches the
> info until the TTL expires. Could the server side be mis-configured for a
> very short TTL?*
>
>
> Regards,
>
> *Richard Rupp*, Sales Specialist, *Phone:* *1-347-510-6746*
>
>
> [image: Inactive hide details for Jaime Pinto ---05/10/2020 09:28:46
> AM---The rationale for my suggestion doesn't have much to do with]Jaime
> Pinto ---05/10/2020 09:28:46 AM---The rationale for my suggestion doesn't
> have much to do with the central DNS server, but everything
>
> From: Jaime Pinto <pinto at scinet.utoronto.ca>
> To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>,
> TURNER Aaron <aaron.turner at ed.ac.uk>
> Date: 05/10/2020 09:28 AM
> Subject: [EXTERNAL] Re: [gpfsug-discuss] Odd networking/name resolution
> issue
> Sent by: gpfsug-discuss-bounces at spectrumscale.org
> ------------------------------
>
>
>
> The rationale for my suggestion doesn't have much to do with the central
> DNS server, but everything to do with the DNS client side of the service.
> If you have a very busy cluster at times, and a number of nodes really
> busy with 1000+ IOPs for instance, so much that the OS on the client can't
> barely spare a cycle to query the DSN server on what the IP associated with
> the name of interface leading to the GPFS infrastructure is, or even
> process that response when it returns, on the same interface where it's
> having contentions and trying to process all the gpfs data transactions,
> you can have temporary catch 22 situations. This can generate a backlog of
> waiters, and eventual expelling of some nodes when the cluster managers
> don't hear from them in reasonable time.
>
> It's doesn't really matter if you have a central DNS server in steroids.
>
> Jaime
>
> On 5/10/2020 03:35:29, TURNER Aaron wrote:
> > Following on from Jonathan Buzzards comments, I'd also like to point out
> that I've never known a central DNS failure in a UK HEI for as long as I
> can remember, and it was certainly not my intention to suggest that as I
> think a central DNS issue is highly unlikely. And indeed, as I originally
> noted, the standard command-line tools on the nodes resolve the names as
> expected, so whatever is going on looks like it affects GPFS only. It may
> even be that the repetition of the domain names in the logs is just a
> function of something it is doing when logging when a node is failing to
> connect for some other reason entirely. It's just not something I recall
> having seen before and wanted to see if anyone else had seen it.
> >
> ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> > *From:* gpfsug-discuss-bounces at spectrumscale.org <
> gpfsug-discuss-bounces at spectrumscale.org> on behalf of Jonathan Buzzard <
> jonathan.buzzard at strath.ac.uk>
> > *Sent:* 09 May 2020 23:22
> > *To:* gpfsug-discuss at spectrumscale.org <gpfsug-discuss at spectrumscale.org
> >
> > *Subject:* Re: [gpfsug-discuss] Odd networking/name resolution issue
> > On 09/05/2020 12:06, Jaime Pinto wrote:
> >> DNS shouldn't be relied upon on a GPFS cluster for internal
> >> communication/management or data.
> >>
> >
> > The 1980's have called and want their lack of IP resolution protocols
> > back :-)
> >
> > I would kindly disagree. If your DNS is not working then your cluster is
> > fubar anyway and a zillion other things will also break very rapidly.
> > For us at least half of the running jobs would be dead in a few minutes
> > as failure to contact license servers would cause the software to stop.
> > All authentication and account lookup is also going to fail as well.
> >
> > You could distribute a hosts file but frankly outside of a storage only
> > cluster (as opposed to one with hundreds if not thousands of compute
> > nodes) that is frankly madness and will inevitably come to bite you in
> > the ass because they *will* get out of sync. The only hosts entry we
> > have is for the Salt Stack host because it tries to do things before the
> > DNS resolvers have been setup and consequently breaks otherwise. Which
> > IMHO is duff on it's behalf.
> >
> > I would add I can't think of a time in the last 16 years where internal
> > DNS at any University I have worked at has stopped working for even one
> > millisecond. If DNS is that flaky at your institution then I suggest
> > sacking the people responsible for it's maintenance as being incompetent
> > twits. It is just such a vanishingly remote possibility that it's not
> > worth bothering about. Frankly a aircraft falling out the sky and
> > squishing your data centre seems more likely to me.
> >
> > Finally in a world of IPv6 then anything other than DNS is a utter
> > madness IMHO.
> >
> >
> > JAB.
> >
> > --
> > Jonathan A. Buzzard                         Tel: +44141-5483420
> > HPC System Administrator, ARCHIE-WeSt.
> > University of Strathclyde, John Anderson Building, Glasgow. G4 0NG
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> > The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> >
> > _______________________________________________
> > gpfsug-discuss mailing list
> > gpfsug-discuss at spectrumscale.org
> > http://gpfsug.org/mailman/listinfo/gpfsug-discuss
> >
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
>
>
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200510/f97e7732/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20200510/f97e7732/attachment-0002.gif>


More information about the gpfsug-discuss mailing list