[gpfsug-discuss] VerbsReconnectThread waiters

IBM Spectrum Scale scale at us.ibm.com
Mon Sep 16 10:33:58 BST 2019


Damir, Joseph,

> Is this something to pay attention to, and what does this waiter mean?
This waiter means GPFS fails to reconnect broken verbs connection,  which
can cause performance degradation.

> I have seen these on our cluster after the IB network goes down (GPFS
still runs over ethernet) and then comes back up.  They will retry forever
it seems, even after the IB is healthy again.
> Restarting GPFS on the nodes with waiters has fixed the issue for me, I
don't know if IBM has any other tricks to fix this without a restart.

This is a code bug which is fixed through internal defect 1090669. It will
be backport to service releases after verification.
There is a work-around which can fix this problem without a restart.
-   On nodes which have this waiter list, run command 'mmfsadm test
breakconn all 744'
     744 is E_RECONNECT, which triggers tcp reconnect and will not cause
node leave/rejoin. Its side effect clears RDMA connections and their
incorrect status.

Regards, The Spectrum Scale (GPFS) team

------------------------------------------------------------------------------------------------------------------

If you feel that your question can benefit other users of  Spectrum Scale
(GPFS), then please post it to the public IBM developerWroks Forum at
https://www.ibm.com/developerworks/community/forums/html/forum?id=11111111-0000-0000-0000-000000000479.


If your query concerns a potential software error in Spectrum Scale (GPFS)
and you have an IBM software maintenance contract please contact
1-800-237-5511 in the United States or your local IBM Service Center in
other countries.

The forum is informally monitored as time permits and should not be used
for priority messages to the Spectrum Scale (GPFS) team.



From:	Joseph Mendoza <jam at ucar.edu>
To:	gpfsug-discuss at spectrumscale.org
Date:	2019/09/14 12:08 AM
Subject:	[EXTERNAL] Re: [gpfsug-discuss] VerbsReconnectThread waiters
Sent by:	gpfsug-discuss-bounces at spectrumscale.org



I have seen these on our cluster after the IB network goes down (GPFS still
runs over ethernet) and then comes back up.  They will retry forever it
seems, even after the IB is healthy again.  The effect they seem to have is
that verbs connections between some nodes breaks and GPFS uses
ethernet/ipoib instead.  You may see messages in your mmfs.log.latest about
verbs being disabled "due to too many errors".  You can also see fewer
verbs connections between nodes in "mmfsadm test verbs conn" output.


Restarting GPFS on the nodes with waiters has fixed the issue for me, I
don't know if IBM has any other tricks to fix this without a restart.


--Joey





On 9/12/19 8:16 AM, Damir Krstic wrote:
      On my cluster I have seen couple of long waiters such as this:

      gss01: Waiting 16.8543 sec since 09:07:02, ignored, thread 46230
      VerbsReconnectThread: delaying for 43.145624000 more seconds, reason:
      delaying for next reconnect attempt

      I tried searching on gpfs wiki for this type of waiter, but was
      unable to find anything of value.

      Is this something to pay attention to, and what does this waiter
      mean?

      Thank you.
      Damir

      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      http://gpfsug.org/mailman/listinfo/gpfsug-discuss
      _______________________________________________
      gpfsug-discuss mailing list
      gpfsug-discuss at spectrumscale.org
      https://urldefense.proofpoint.com/v2/url?u=http-3A__gpfsug.org_mailman_listinfo_gpfsug-2Ddiscuss&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=IbxtjdkPAM2Sbon4Lbbi4w&m=WoT3TYlCvAM8RQxUISD9L6UzqY0I_ffCJTS-UHhw8z4&s=18A0j0Zmp8OwZ6Y6cc3HFe3OgFZRHIv8OeJcBpkaPwQ&e=



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190916/e5e489f9/attachment-0002.htm>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: graycol.gif
Type: image/gif
Size: 105 bytes
Desc: not available
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190916/e5e489f9/attachment-0002.gif>


More information about the gpfsug-discuss mailing list