[gpfsug-discuss] VERBS RDMA issue

Tushar Pathare tpathare at sidra.org
Sun May 21 10:19:23 BST 2017


Hello Aaron,
Yes we saw recently an issue with

VERBS RDMA rdma send error IBV_WC_RETRY_EXC_ERR to 111.11.11.11 (sidra.nnode_group2.gpfs) on mlx5_0 port 2 fabnum 0 vendor_err 129
And

VERBS RDMA rdma write error IBV_WC_REM_ACCESS_ERR to 112.11.11.11 ( sidra.snode_group2.gpfs) on mlx5_0 port 2 fabnum 0 vendor_err 136

Thanks

Tushar B Pathare MBA IT,BE IT
Bigdata & GPFS
Software Development & Databases
Scientific Computing
Bioinformatics Division
Research

"What ever the mind of man can conceive and believe, drill can query"

Sidra Medical and Research Centre
Sidra OPC Building
Sidra Medical & Research Center
PO Box 26999
Al Luqta Street
Education City North Campus
​Qatar Foundation, Doha, Qatar
Office 4003 3333 ext 37443 | M +974 74793547
tpathare at sidra.org<mailto:tpathare at sidra.org> | www.sidra.org<http://www.sidra.org/>


From: Tushar Pathare <tpathare at sidra.org>
Date: Sunday, May 21, 2017 at 12:18 PM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] VERBS RDMA issue

Hello Aaron,
Yes we saw recently an issue with

VERBS RDMA rdma send error IBV_WC_RETRY_EXC_ERR to 111.11.11.11 (sidra.nnode_group2.gpfs) on mlx5_0 port 2 fabnum 0 vendor_err 129
And




Tushar B Pathare MBA IT,BE IT
Bigdata & GPFS
Software Development & Databases
Scientific Computing
Bioinformatics Division
Research

"What ever the mind of man can conceive and believe, drill can query"

Sidra Medical and Research Centre
Sidra OPC Building
Sidra Medical & Research Center
PO Box 26999
Al Luqta Street
Education City North Campus
​Qatar Foundation, Doha, Qatar
Office 4003 3333 ext 37443 | M +974 74793547
tpathare at sidra.org<mailto:tpathare at sidra.org> | www.sidra.org<http://www.sidra.org/>


From: <gpfsug-discuss-bounces at spectrumscale.org> on behalf of "Knister, Aaron S. (GSFC-606.2)[COMPUTER SCIENCE CORP]" <aaron.s.knister at nasa.gov>
Reply-To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Date: Sunday, May 21, 2017 at 11:59 AM
To: gpfsug main discussion list <gpfsug-discuss at spectrumscale.org>
Subject: Re: [gpfsug-discuss] VERBS RDMA issue

Hi Tushar,

For me the issue was an underlying performance bottleneck (some CPU frequency scaling problems causing cores to throttle back when it wasn't appropriate).

I noticed you have verbsRdmaSend set to yes. I've seen suggestions in the past to turn this off under certain conditions although I don't remember what those where. Hopefully others can chime in and qualify that.



Are you seeing any RDMA errors in your logs? (e.g. grep IBV_ out of the mmfs.log).



-Aaron




On May 21, 2017 at 04:41:00 EDT, Tushar Pathare <tpathare at sidra.org> wrote:

Hello Team,



We are facing a lot of messages waiters  related to waiting for conn rdmas < conn maxrdmas<https://www.mail-archive.com/search?l=gpfsug-discuss@spectrumscale.org&q=subject:%22Re%5C%3A+%5C%5Bgpfsug%5C-discuss%5C%5D+waiting+for+conn+rdmas+%3C+conn+maxrdmas%22&o=newest>



Is there some recommended settings to resolve this issue.?

Our config for RDMA is as follows for 140 nodes(32 cores each)





VERBS RDMA Configuration:

  Status                              : started

  Start time                          : Thu

  Stats reset time                    : Thu

  Dump time                           : Sun

  mmfs verbsRdma                      : enable

  mmfs verbsRdmaCm                    : disable

  mmfs verbsPorts                     : mlx4_0/1 mlx4_0/2

  mmfs verbsRdmasPerNode              : 3200

  mmfs verbsRdmasPerNode (max)        : 3200

  mmfs verbsRdmasPerNodeOptimize      : yes

  mmfs verbsRdmasPerConnection        : 16

  mmfs verbsRdmasPerConnection (max)  : 16

  mmfs verbsRdmaMinBytes              : 16384

  mmfs verbsRdmaRoCEToS               : -1

  mmfs verbsRdmaQpRtrMinRnrTimer      : 18

  mmfs verbsRdmaQpRtrPathMtu          : 2048

  mmfs verbsRdmaQpRtrSl               : 0

  mmfs verbsRdmaQpRtrSlDynamic        : no

  mmfs verbsRdmaQpRtrSlDynamicTimeout : 10

  mmfs verbsRdmaQpRtsRnrRetry         : 6

  mmfs verbsRdmaQpRtsRetryCnt         : 6

  mmfs verbsRdmaQpRtsTimeout          : 18

  mmfs verbsRdmaMaxSendBytes          : 16777216

  mmfs verbsRdmaMaxSendSge            : 27

  mmfs verbsRdmaSend                  : yes

  mmfs verbsRdmaSerializeRecv         : no

  mmfs verbsRdmaSerializeSend         : no

  mmfs verbsRdmaUseMultiCqThreads     : yes

  mmfs verbsSendBufferMemoryMB        : 1024

  mmfs verbsLibName                   : libibverbs.so

  mmfs verbsRdmaCmLibName             : librdmacm.so

  mmfs verbsRdmaMaxReconnectInterval  : 60

  mmfs verbsRdmaMaxReconnectRetries   : -1

  mmfs verbsRdmaReconnectAction       : disable

  mmfs verbsRdmaReconnectThreads      : 32

  mmfs verbsHungRdmaTimeout           : 90

  ibv_fork_support                    : true

  Max connections                     : 196608

  Max RDMA size                       : 16777216

  Target number of vsend buffs        : 16384

  Initial vsend buffs per conn        : 59

  nQPs                                : 140

  nCQs                                : 282

  nCMIDs                              : 0

  nDtoThreads                         : 2

  nextIndex                           : 141

  Number of Devices opened            : 1

    Device                            : mlx4_0

      vendor_id                       : 713

      Device vendor_part_id           : 4099

      Device mem register chunk       : 8589934592 (0x200000000)

      Device max_sge                  : 32

      Adjusted max_sge                : 0

      Adjusted max_sge vsend          : 30

      Device max_qp_wr                : 16351

      Device max_qp_rd_atom           : 16

      Open Connect Ports              : 1

        verbsConnectPorts[0]          : mlx4_0/1/0

          lid                         : 129

          state                       : IBV_PORT_ACTIVE

          path_mtu                    : 2048

          interface ID                : 0xe41d2d030073b9d1

          sendChannel.ib_channel      : 0x7FA6CB816200

          sendChannel.dtoThreadP      : 0x7FA6CB821870

          sendChannel.dtoThreadId     : 12540

          sendChannel.nFreeCq         : 1

          recvChannel.ib_channel      : 0x7FA6CB81D590

          recvChannel.dtoThreadP      : 0x7FA6CB822BA0

          recvChannel.dtoThreadId     : 12541

          recvChannel.nFreeCq         : 1

          ibv_cq                      : 0x7FA2724C81F8

          ibv_cq.cqP                  : 0x0

          ibv_cq.nEvents              : 0

          ibv_cq.contextP             : 0x0

          ibv_cq.ib_channel           : 0x0



Thanks





Tushar B Pathare MBA IT,BE IT

Bigdata & GPFS

Software Development & Databases

Scientific Computing

Bioinformatics Division

Research



"What ever the mind of man can conceive and believe, drill can query"



Sidra Medical and Research Centre

Sidra OPC Building

Sidra Medical & Research Center

PO Box 26999

Al Luqta Street

Education City North Campus

​Qatar Foundation, Doha, Qatar

Office 4003 3333 ext 37443 | M +974 74793547

tpathare at sidra.org<mailto:tpathare at sidra.org> | www.sidra.org<http://www.sidra.org/>


Disclaimer: This email and its attachments may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient, any reading, printing, storage, disclosure, copying or any other action taken in respect of this e-mail is prohibited and may be unlawful. If you are not the intended recipient, please notify the sender immediately by using the reply function and then permanently delete what you have received. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Sidra Medical and Research Center.
Disclaimer: This email and its attachments may be confidential and are intended solely for the use of the individual to whom it is addressed. If you are not the intended recipient, any reading, printing, storage, disclosure, copying or any other action taken in respect of this e-mail is prohibited and may be unlawful. If you are not the intended recipient, please notify the sender immediately by using the reply function and then permanently delete what you have received. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Sidra Medical and Research Center.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://gpfsug.org/pipermail/gpfsug-discuss_gpfsug.org/attachments/20170521/4b27efb5/attachment-0002.htm>


More information about the gpfsug-discuss mailing list