You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@brpc.apache.org by GitBox <gi...@apache.org> on 2020/12/01 01:11:47 UTC

[GitHub] [incubator-brpc] Pating opened a new issue #1296: brpc rdma异常

Pating opened a new issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296


   **Describe the bug (描述bug)**
   
   客户端向服务端发送rdma请求,有时会出现如下问题。
   E1127 17:51:28.570752 210341 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:28.819981 210349 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:29.223821 210342 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:30.028343 210349 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:37.485410 210342 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:37.590259 210344 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:37.793829 210341 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:38.240385 210345 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   E1127 17:51:39.044575 210342 /root/byw/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:123] Fail to intialize RdmaCompletionQueue: Invalid argument
   
   上述问题出现,客户端不能向服务端发送数据
   
   I1127 09:51:20.775034 941982 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2377] Checking Socket{id=9 addr=192.168.3.5:8001} (0x18cf530)
   I1127 09:51:20.775188 941981 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2377] Checking Socket{id=6 addr=192.168.3.5:8002} (0x18cef30)
   I1127 09:51:20.775263 941959 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2377] Checking Socket{id=4 addr=192.168.3.5:8000} (0x18ceb30)
   I1127 09:51:20.775347 941977 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2437] Revived Socket{id=9 addr=192.168.3.5:8001} (0x18cf530) (Connectable)
   I1127 09:51:20.775396 941974 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2377] Checking Socket{id=8 addr=192.168.3.6:8002} (0x18cf330)
   I1127 09:51:20.775412 941955 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2377] Checking Socket{id=7 addr=192.168.3.6:8000} (0x18cf130)
   I1127 09:51:20.775495 941968 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2437] Revived Socket{id=6 addr=192.168.3.5:8002} (0x18cef30) (Connectable)
   I1127 09:51:20.775557 941974 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2437] Revived Socket{id=4 addr=192.168.3.5:8000} (0x18ceb30) (Connectable)
   I1127 09:51:20.775696 941973 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2377] Checking Socket{id=5 addr=192.168.3.6:8001} (0x18ced30)
   I1127 09:51:20.775832 941967 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2437] Revived Socket{id=7 addr=192.168.3.6:8000} (0x18cf130) (Connectable)
   I1127 09:51:20.775841 941975 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2437] Revived Socket{id=8 addr=192.168.3.6:8002} (0x18cf330) (Connectable)
   I1127 09:51:20.775944 941984 /root/byw/changlox/software/brpc/src/brpc/socket.cpp:2437] Revived Socket{id=5 addr=192.168.3.6:8001} (0x18ced30) (Connectable)
   I1127 09:51:20.995304 941950 /home/changlox/blockmaster/essd/src/blockfs/blockmaster/test/test-perf.cc:317] Sending EchoRequest at qps=0 latency=0
   I1127 09:51:21.995442 941950 /home/changlox/blockmaster/essd/src/blockfs/blockmaster/test/test-perf.cc:317] Sending EchoRequest at qps=0 latency=0
   I1127 09:51:22.995585 941950 /home/changlox/blockmaster/essd/src/blockfs/blockmaster/test/test-perf.cc:317] Sending EchoRequest at qps=0 latency=0
   
   
   **To Reproduce (复现方法)**
   rdma大压力测试
   
   **Expected behavior (期望行为)**
   出现异常后可以继续通信
   
   **Versions (各种版本)**
   OS: rhel7.6 3.10.0-957.el7.x86_64
   Compiler: g++ (GCC) 4.8.5 20150623 (Red Hat 4.8.5-39)
   brpc: rdma分支 commit 9028c39a0c54
   protobuf: 
   
   [root@node78 ~]# rpm -qa |grep protobuf
   protobuf-java-2.5.0-8.el7.x86_64
   protobuf-emacs-2.5.0-8.el7.x86_64
   protobuf-c-1.0.2-3.el7.x86_64
   protobuf-2.5.0-8.el7.x86_64
   protobuf-static-2.5.0-8.el7.x86_64
   protobuf-c-compiler-1.0.2-3.el7.x86_64
   protobuf-lite-devel-2.5.0-8.el7.x86_64
   protobuf-compiler-2.5.0-8.el7.x86_64
   protobuf-lite-2.5.0-8.el7.x86_64
   protobuf-javadoc-2.5.0-8.el7.x86_64
   protobuf-vim-2.5.0-8.el7.x86_64
   protobuf-lite-static-2.5.0-8.el7.x86_64
   protobuf-emacs-el-2.5.0-8.el7.x86_64
   protobuf-c-devel-1.0.2-3.el7.x86_64
   protobuf-devel-2.5.0-8.el7.x86_64
   protobuf-python-2.5.0-8.el7.x86_64
   
   
   **Additional context/screenshots (更多上下文/截图)**
   
   ![image](https://user-images.githubusercontent.com/3406829/100406382-9fdea000-30a0-11eb-9de7-342b3d342165.png)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Pating edited a comment on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Pating edited a comment on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-758449638


   > @Pating The patch is merged into rdma branch. You can try again and give me a feedback.
   
   Hi Tuvie
      我试了你提供最新的rdma分支, commit id是 62240fea92。在这个最新分支上还会有rdma的问题,虽然出现频率比老版本低,但也不是很难复现。我这边持续大压力读写,1个小时内就会复现。
      如下是当前版本brpc的报错日志:
     
   ![image](https://user-images.githubusercontent.com/3406829/104279841-eddf2400-54e5-11eb-82b2-7f21e21a23d1.png)
   ![image](https://user-images.githubusercontent.com/3406829/104279916-0bac8900-54e6-11eb-9258-ed0c9cec323d.png)
   ![image](https://user-images.githubusercontent.com/3406829/104279980-25e66700-54e6-11eb-9642-a604fc321c9e.png)
   
   E0112 14:18:15.942765 934036 /home/changlox/essd/src/megrez/extentsvr/handler/write_block.cc:252] error replica cntl[E3001]Fail to handle RDMA completion, error status: 12
   E0112 14:18:15.942794 934038 /home/changlox/essd/src/megrez/extentsvr/handler/write_block.cc:252] error replica cntl[E3001]Fail to handle RDMA completion, error status: 12
   E0112 14:18:15.942765 934037 /home/changlox/essd/src/megrez/extentsvr/handler/write_block.cc:252] error replica cntl[E3001]Fail to handle RDMA completion, error status: 12
   E0112 14:18:15.942765 934031 /home/changlox/essd/src/megrez/extentsvr/handler/write_block.cc:252] error replica cntl[E3001]Fail to handle RDMA completion, error status: 12
   E0112 14:18:15.942822 934036 /home/changlox/essd/src/megrez/extentsvr/handler/write_block.cc:252] error replica cntl[E3001]Fail to handle RDMA completion, error status: 12
   
   
   N0112 22:17:17.946238 238499 /home/changlox/sxj/src/blockfs/blockserver/core/plane_impl.cc:303] Seal extent in segment:0 success, clear error state
   W0112 22:17:24.573034 238500 /home/changlox/software/brpc/src/brpc/rdma/rdma_completion_queue.cpp:417] Fail to handle RDMA completion, error status(12): RDMA verbs error
   E0112 22:17:24.573174 238511 /home/changlox/sxj/src/blockfs/blockserver/rpc/write_block.cc:78] Write block failed:[E3001]Fail to handle RDMA completion, error status: 12
   E0112 22:17:24.573203 238510 /home/changlox/sxj/src/blockfs/blockserver/rpc/write_block.cc:78] Write block failed:[E3001]Fail to handle RDMA completion, error status: 12
   E0112 22:17:24.573188 238501 /home/changlox/sxj/src/blockfs/blockserver/rpc/write_block.cc:78] Write block failed:[E3001]Fail to handle RDMA completion, error status: 12
   E0112 22:17:24.573192 238509 /home/changlox/sxj/src/blockfs/blockserver/rpc/write_block.cc:78] Write block failed:[E3001]Fail to handle RDMA completion, error status: 12
   E0112 22:17:24.573314 238509 /home/changlox/sxj/src/blockfs/blockserver/core/stream_rw.cc:244] Write block ret:-12000 err:network cntl err
   E0112 22:17:24.573182 238500 /home/changlox/sxj/src/blockfs/blockserver/rpc/write_block.cc:78] Write block failed:[E3001]Fail to handle RDMA completion, error status: 12
   E0112 22:17:24.573218 238511 /home/changlox/sxj/src/blockfs/blockserver/core/stream_rw.cc:244] Write block ret:-12000 err:network cntl err
   E0112 22:17:24.573367 238500 /home/changlox/sxj/src/blockfs/blockserver/core/stream_rw.cc:244] Write block ret:-12000 err:network cntl err
   E0112 22:17:24.573295 238501 /home/changlox/sxj/src/blockfs/blockserver/core/stream_rw.cc:244] Write block ret:-12000 err:network cntl err
   E0112 22:17:24.573352 238506 /home/changlox/sxj/src/blockfs/blockserver/handle/write.cc:111] Write ret:-12000 err:network cntl err retry write lba:5076942848 logid:0
   E0112 22:17:24.573249 238510 /home/changlox/sxj/src/blockfs/blockserver/core/stream_rw.cc:244] Write block ret:-12000 err:network cntl err
   E0112 22:17:24.573480 238506 /home/changlox/sxj/src/blockfs/blockserver/handle/write.cc:111] Write ret:-12000 err:network cntl err retry write lba:5076029440 logid:0
   E0112 22:17:24.573546 238506 /home/changlox/sxj/src/blockfs/blockserver/handle/write.cc:111] Write ret:-12000 err:network cntl err retry write lba:5077991424 logid:0
   E0112 22:17:24.573565 238506 /home/changlox/sxj/src/blockfs/blockserver/handle/write.cc:111] Write ret:-12000 err:network cntl err retry write lba:5075894272 logid:0
   W0112 22:17:24.573583 238506 /home/changlox/sxj/src/blockfs/blockserver/core/plane_impl.cc:267] segment:0 meet error, we'll do seal
   E0112 22:17:24.573599 238506 /home/changlox/sxj/src/blockfs/blockserver/handle/write.cc:111] Write ret:-12000 err:network cntl err retry write lba:5079040000 logid:0
   
   
     如果需要更多信息请随时@我,多谢。
   
   
   
         
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Tuvie commented on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Tuvie commented on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-734767268


   This should be a bug when initialize rdma CQ with incorrect queue index ID.
   I will commit a patch for that soon and you can retry.
   
   Thanks for the feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Pating closed issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Pating closed issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Tuvie commented on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Tuvie commented on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-758574671


   从日志上看,反馈的error status 是12,这个在rdma里代表IBV_WC_RETRY_EXC_ERR,也就是重试次数超限。RDMA和TCP不一样,TCP丢包可以近乎无限的重传,RDMA丢包后,超时重传有次数上限。超出上限后,整个QP会报上面的错,连接中断。
   看上去你测试底层使用的物理网络环境没有做特殊配置。建议在有PFC和ECN网络配置的物理网络环境中使用RoCE。否则确实会出现大量丢包导致连接中断的情况。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Pating commented on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Pating commented on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-758449638


   > @Pating The patch is merged into rdma branch. You can try again and give me a feedback.
   
   Hi Tuvie
      我试了你提供最新的rdma分支, commit id是 62240fea92。在这个最新分支上还会有rdma的问题,虽然出现频率比老版本低,但也不是很难复现。我这边持续大压力读写,1个小时内就会复现。
      如下是当前版本brpc的报错日志:
     
   ![image](https://user-images.githubusercontent.com/3406829/104279841-eddf2400-54e5-11eb-82b2-7f21e21a23d1.png)
   ![image](https://user-images.githubusercontent.com/3406829/104279916-0bac8900-54e6-11eb-9258-ed0c9cec323d.png)
   ![image](https://user-images.githubusercontent.com/3406829/104279980-25e66700-54e6-11eb-9642-a604fc321c9e.png)
   
     如果需要更多信息请随时@我,多谢。
   
   
   
         
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Pating commented on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Pating commented on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-736151075


   
   
    
   
   > This should be a bug when initialize rdma CQ with incorrect queue index ID.
   > I will commit a patch for that soon and you can retry.
   > 
   > Thanks for the feedback.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Pating removed a comment on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Pating removed a comment on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-736151075


   
   
    
   
   > This should be a bug when initialize rdma CQ with incorrect queue index ID.
   > I will commit a patch for that soon and you can retry.
   > 
   > Thanks for the feedback.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Tuvie commented on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Tuvie commented on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-737917762


   @Pating The patch is merged into rdma branch. You can try again and give me a feedback.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] Pating commented on issue #1296: brpc rdma异常

Posted by GitBox <gi...@apache.org>.
Pating commented on issue #1296:
URL: https://github.com/apache/incubator-brpc/issues/1296#issuecomment-736151597


   > This should be a bug when initialize rdma CQ with incorrect queue index ID.
   > I will commit a patch for that soon and you can retry.
   > 
   > Thanks for the feedback.
   
   Thansk, look forward to your patch. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org