[gpfsug-discuss] verbs rdma errors in logs

Damir Krstic damir.krstic at gmail.com
Sun Jun 26 16:22:46 BST 2016


We recently enabled verbs/rdma on our IB network (previously we used IPoIB
exclusively) and now are getting all sorts of errors/warning in logs:

Jun 25 23:41:30 gssio2 mmfs: [E] VERBS RDMA rdma read error
IBV_WC_RETRY_EXC_ERR to 172.41.125.27 (qnode4111-ib0.quest) on mlx5_0 port
1 fabnum 0 vendor_err 129
Jun 25 23:41:30 gssio2 mmfs: [E] VERBS RDMA closed connection to
172.41.125.27 (qnode4111-ib0.quest) on mlx5_0 port 1 fabnum 0 due to RDMA
read error IBV_WC_RETRY_EXC_ERR index 1589

Jun 25 20:40:05 gssio2 mmfs: [N] VERBS RDMA closed connection to
172.41.124.12 (qnode4054-ib0.quest) on mlx5_0 port 1 fabnum 0 index 1417

Jun 25 18:30:01 ems1 root: mmfs: [N] VERBS RDMA closed connection to
172.41.130.131 (qnode6131-ib0.quest.it.northwestern.edu) on mlx5_0 port 1
fabnum 0 index 195

Jun 25 18:28:23 gssio2 mmfs: [N] VERBS RDMA closed connection to
172.41.130.131 (qnode6131-ib0.quest.it.northwestern.edu) on mlx5_0 port 1
fabnum 0 index 1044

Something to note (not sure if this is important or not) is that our ESS
storage cluster and our login nodes are in connected mode with 64K MTU and
all compute nodes are in datagram mode with 2.4K MTU.

Are these messages something to be concerned about? Cluster seems to be
performing well and although there are some node ejections, they do not
seem higher than before we turned on verbs/rdma.

Thanks,
Damir
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20160626/840bee0a/attachment-0001.htm>


More information about the gpfsug-discuss mailing list