[gpfsug-discuss] RDMA write error IBV_WC_RETRY_EXC_ERR

Iban Cabrillo cabrillo at ifca.unican.es
Fri Jul 9 12:19:07 BST 2021


Dear, 
Since a couple of hours we are seen lots off IB error at GPFS logs, on every IB node (gpfs version is 5.0.4-3 ): 

2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.73 (node157) on mlx5_0 port 1 fabnum 0 index 251 cookie 648 RDMA write error IBV_WC_RETRY_EXC_ERR 
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.18 (node102) on mlx5_0 port 1 fabnum 0 index 227 cookie 687 RDMA write error IBV_WC_RETRY_EXC_ERR 
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.152.17 (node101) on mlx5_0 port 1 fabnum 0 index 298 cookie 693 RDMA write error IBV_WC_RETRY_EXC_ERR 
2021-07-09_13:11:40.600+0200: [E] VERBS RDMA closed connection to 10.10.151.6 (node6) on mlx5_0 port 1 fabnum 0 index 18 cookie 696 RDMA write error IBV_WC_RETRY_EXC_ERR 
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.152.46 (node130) on mlx5_0 port 1 fabnum 0 index 254 cookie 680 RDMA write error IBV_WC_RETRY_EXC_ERR 
2021-07-09_13:11:40.601+0200: [E] VERBS RDMA closed connection to 10.10.151.81 (node81) on mlx5_0 port 1 fabnum 0 index 289 cookie 679 RDMA read error IBV_WC_RETRY_EXC_ERR 

and ofcourse long waiters: 

=== mmdiag: waiters === 
Waiting 34.8493 sec since 13:11:35, ignored, thread 2935 VerbsReconnectThread: delaying for 25.150686000 more seconds, reason: delaying for next reconnect attempt 
Waiting 34.6249 sec since 13:11:35, ignored, thread 10198 VerbsReconnectThread: delaying for 25.375072000 more seconds, reason: delaying for next reconnect attempt 
Waiting 27.0957 sec since 13:11:43, ignored, thread 10052 VerbsReconnectThread: delaying for 32.904264000 more seconds, reason: delaying for next reconnect attempt 
Waiting 14.8909 sec since 13:11:55, monitored, thread 23135 NSDThread: for RDMA write completion fast on node 10.10.151.65 <c0n258> 
Waiting 14.8891 sec since 13:11:55, monitored, thread 23109 NSDThread: for RDMA write completion fast on node 10.10.152.32 <c0n247> 
Waiting 14.8865 sec since 13:11:55, monitored, thread 23302 NSDThread: for RDMA write completion fast on node 10.10.150.1 <c0n29> 

[common] 
verbsRdma enable 
verbsPorts mlx4_0/1/0 
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08] 
verbsPorts mlx5_0/1/0 
[gpfs01] 
verbsPorts mlx5_1/1/0 
[gpfs03] 
verbsPorts mlx5_0/1/0 mlx5_1/1/0 


[common] 
verbsRdma enable 
verbsPorts mlx4_0/1/0 
[gpfs02,gpfs04,gpfs05,gpfs06,gpfs07,gpfs08,wngpu001,wngpu002,wngpu003,wngpu004,wngpu005] 
verbsPorts mlx5_0/1/0 
[gpfs01] 
verbsPorts mlx5_1/1/0 
[gpfs03] 
verbsPorts mlx5_0/1/0 mlx5_1/1/0 

Any advise is welcomed 
regards, I 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20210709/85468531/attachment-0001.htm>


More information about the gpfsug-discuss mailing list