[gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR (Philipp Helo Rehs)

Wei Guo Wei1.Guo at UTSouthwestern.edu
Tue Mar 13 03:06:34 GMT 2018


Hi, Philipp,

FYI. We had exactly the same IBV_WC_RETRY_EXC_ERR error message in our gpfs client log along with other client error  kernel: ib0: ipoib_cm_handle_tx_wc_rss: failed cm send event (status=12, wrid=83 vend_err 81) in the syslog. The root cause was a bad IB cable connecting a leaf switch to the core switch where the client used as route. When we changed a new cable, the problem was solved and no more errors. We don't really have ipoib setup. The problem might be different from yours, but does the error message suggest that when your gpfs daemon tries to use mlx5_1, the packets are discarded so no connection? Did you do an IB bonding?

Wei Guo
HPC Administrator
UTSW



Message: 1
Date: Mon, 12 Mar 2018 21:09:14 +0100
From: Philipp Helo Rehs <Philipp.Rehs at uni-duesseldorf.de>
To: gpfsug-discuss at spectrumscale.org
Subject: [gpfsug-discuss] RDMA read error IBV_WC_RETRY_EXC_ERR
Message-ID: <4c6f6655-5712-663e-d551-375a42d562d8 at uni-duesseldorf.de>
Content-Type: text/plain; charset=utf-8

Hello,
I am reading your mailing-list since some weeks and I am quiete impressed about the knowledge and shared information here.

We have a gpfs cluster with 4 nsds and 120 clients on Infiniband.

Our NSD-Server have two infiniband ports on seperate cards
mlx5_0 and mlx5_1. We have RDMA-CM enabled and ipv6 enabled on all nodes. We have added an IPoIB IP to all interfaces.

But when we enable the second interface we get the following error from all nodes:

2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to
10.100.0.83 (hilbert83-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 45
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA closed connection to
10.100.0.129 (hilbert129-ib) on mlx5_1 port 1 fabnum 0 due to RDMA read error IBV_WC_RETRY_EXC_ERR index 31
2018-03-12_20:49:38.923+0100: [E] VERBS RDMA rdma read error IBV_WC_RETRY_EXC_ERR to 10.100.0.134 (hilbert134-ib) on mlx5_1 port 1 fabnum 0 vendor_err 129

I have read that this issue can happen when verbsRdmasPerConnection is to low. We tried to increase the value and it got better but the problem is not fixed.


Current config:
minReleaseLevel 4.2.3.0
maxblocksize 16m
cipherList AUTHONLY
cesSharedRoot /ces
ccrEnabled yes
failureDetectionTime 40
leaseRecoveryWait 40
[hilbert1-ib,hilbert2-ib]
worker1Threads 256
maxReceiverThreads 256
[common]
tiebreakerDisks vd3;vd5;vd7
minQuorumNodes 2
verbsLibName libibverbs.so.1
verbsRdma enable
verbsRdmasPerNode 256
verbsRdmaSend no
scatterBufferSize 262144
pagepool 16g
verbsPorts mlx4_0/1
[nsdNodes]
verbsPorts mlx5_0/1 mlx5_1/1
[hilbert200-ib,hilbert201-ib,hilbert202-ib,hilbert203-ib,hilbert204-ib,hilbert205-ib,hilbert206-ib]
verbsPorts mlx4_0/1 mlx4_1/1
[common]
maxMBpS 11200
[common]
verbsRdmaCm enable
verbsRdmasPerConnection 14
adminMode central


Kind regards
 Philipp Rehs

---------------------------

Zentrum f?r Informations- und Medientechnologie Kompetenzzentrum f?r wissenschaftliches Rechnen und Speichern

Heinrich-Heine-Universit?t D?sseldorf
Universit?tsstr. 1
Raum 25.41.00.51
40225 D?sseldorf / Germany
Tel: +49-211-81-15557



________________________________

UT Southwestern


Medical Center



The future of medicine, today.





More information about the gpfsug-discuss mailing list