[gpfsug-discuss] VerbsReconnectThread waiters

Fri Sep 13 17:07:01 BST 2019

I have seen these on our cluster after the IB network goes down (GPFS still runs over ethernet) and then comes back up. 
They will retry forever it seems, even after the IB is healthy again.  The effect they seem to have is that verbs
connections between some nodes breaks and GPFS uses ethernet/ipoib instead.  You may see messages in your
mmfs.log.latest about verbs being disabled "due to too many errors".  You can also see fewer verbs connections between
nodes in "mmfsadm test verbs conn" output.

Restarting GPFS on the nodes with waiters has fixed the issue for me, I don't know if IBM has any other tricks to fix
this without a restart.

--Joey

On 9/12/19 8:16 AM, Damir Krstic wrote:
> On my cluster I have seen couple of long waiters such as this:
>
> gss01: Waiting 16.8543 sec since 09:07:02, ignored, thread 46230 VerbsReconnectThread: delaying for 43.145624000 more
> seconds, reason: delaying for next reconnect attempt
>
> I tried searching on gpfs wiki for this type of waiter, but was unable to find anything of value.
>
> Is this something to pay attention to, and what does this waiter mean?
>
> Thank you.
> Damir
>
> _______________________________________________
> gpfsug-discuss mailing list
> gpfsug-discuss at spectrumscale.org
> http://gpfsug.org/mailman/listinfo/gpfsug-discuss
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20190913/62e11588/attachment-0002.htm>