[gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR

Simon Thompson S.J.Thompson at bham.ac.uk
Fri Jul 9 16:00:37 BST 2021


If you have multiple switches, this could be a faulty ISL (or to your NSDs). So I would look for SYMBOL errors on the ports, high churning numbers indicates a cable fault.

Simon


From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of "olaf.weiser at de.ibm.com" <olaf.weiser at de.ibm.com>
Reply to: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Date: Friday, 9 July 2021 at 12:36
To: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Cc: "gpfsug-discuss at spectrumscale.org" <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR

smells like a network problem ..

IBV_WC_RETRY_EXC_ERR  comes from OFED and clearly says that the data didn't get through successfully,

further help .. check
ibstat
iblinkinfo
ibdiagnet
and the sminfo .. (should be the same on all members)




----- Ursprüngliche Nachricht -----
Von: "Iban Cabrillo" <cabrillo at ifca.unican.es>
Gesendet von: gpfsug-discuss-bounces at spectrumscale.org
An: "gpfsug-discuss" <gpfsug-discuss at spectrumscale.org>
CC:
Betreff: [EXTERNAL] [gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR
Datum: Fr, 9. Jul 2021 13:29

Dear,
    Since a couple of hours we are seen lots off IB error at GPFS logs, on every IB node (gpfs version is 5.0.4-3):

  2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.73 (node157) on mlx5_0 port 1 fabnum 0 index 251 cookie 648 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.18 (node102) on mlx5_0 port 1 fabnum 0 index 227 cookie 687 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.17 (node101) on mlx5_0 port 1 fabnum 0 index 298 cookie 693 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.151.6 (node6) on mlx5_0 port 1 fabnum 0 index 18 cookie 696 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.152.46 (node130) on mlx5_0 port 1 fabnum 0 index 254 cookie 680 RDMA write error IBV_WC_RETRY_EXC_ERR
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.151.81 (node81) on mlx5_0 port 1 fabnum 0 index 289 cookie 679 RDMA read error IBV_WC_RETRY_EXC_ERR

and ofcourse long waiters:

=== mmdiag: waiters ===
Waiting 34.8493 sec since 13:11:35, ignored, thread 2935 VerbsReconnectThread: delaying for 25.150686000 more seconds, reason: delaying for next reconnect attempt
Waiting 34.6249 sec since 13:11:35, ignored, thread 10198 VerbsReconnectThread: delaying for 25.375072000 more seconds, reason: delaying for next reconnect attempt
Waiting 27.0957 sec since 13:11:43, ignored, thread 10052 VerbsReconnectThread: delaying for 32.904264000 more seconds, reason: delaying for next reconnect attempt
Waiting 14.8909 sec since 13:11:55, monitored, thread 23135 NSDThread: for RDMA write completion fast on node 10.10.151.65 <c0n258>
Waiting 14.8891 sec since 13:11:55, monitored, thread 23109 NSDThread: for RDMA write completion fast on node 10.10.152.32 <c0n247>
Waiting 14.8865 sec since 13:11:55, monitored, thread 23302 NSDThread: for RDMA write completion fast on node 10.10.150.1 <c0n29>

[common]
verbsRdma enable
verbsPorts mlx4_0/1/0
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08]
verbsPorts mlx5_0/1/0
[gpfs01]
verbsPorts mlx5_1/1/0
[gpfs03]
verbsPorts mlx5_0/1/0 mlx5_1/1/0


[common]
verbsRdma enable
verbsPorts mlx4_0/1/0
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08,wngpu001,wngpu002,wngpu003,wngpu004,wngpu005]
verbsPorts mlx5_0/1/0
[gpfs01]
verbsPorts mlx5_1/1/0
[gpfs03]
verbsPorts mlx5_0/1/0 mlx5_1/1/0

Any advise is welcomed
regards, I


_______________________________________________
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20210709/b1f13ef0/attachment-0002.htm>


More information about the gpfsug-discuss mailing list