You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@brpc.apache.org by GitBox <gi...@apache.org> on 2020/07/16 04:32:14 UTC

[GitHub] [incubator-brpc] gaodayue opened a new issue #1168: 当下游节点故障重启后,上游节点有概率会一直连不上

gaodayue opened a new issue #1168:
URL: https://github.com/apache/incubator-brpc/issues/1168


   **Describe the bug (描述bug)**
   当下游节点X故障重启后,集群有时候会出现某个上游节点Y一直无法连接X的情况,其他上游节点在健康检查后会重建与X的连接。例如
   
   1)  下游节点10.26.44.32在09:02:17因故障重启后,某个上游节点Y没有重建与10.26.44.32的连接,日志中持续输出"Not connected to 10.26.44.32:8060 yet"
   
   ```
   W0716 09:02:17.695824 142210 input_messenger.cpp:212] Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0xa352000): Connection reset by peer [104]
   W0716 09:02:17.702852 141955 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E104]Fail to read from Socket{id=896 fd=4228 addr=10.26.44.32:8060:55309} (0x0xa352000): Connection reset by peer [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
   ....忽略类似内容....
   W0716 09:09:58.714361 38298 data_stream_sender.cpp:138] failed to send brpc batch, error=Host is down, error_text=[E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R1][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R2][E112]Not connected to 10.26.44.32:8060 yet, server_id=896 [R3][E112]Not connected to 10.26.44.32:8060 yet, server_id=896
   ```
   
   2)查看netstat发现没有Y与10.26.44.32的TCP连接
   3)查看Y的/connections发现Socket状态为Broken,信息如下
   
   ```
   $ curl http://localhost:8060/connections | grep 10.26.44.32:8060
   Broken                    |10.26.44.32:8060   |55309|-  |-           |-    |-        |-     |-         |-       |-         |-     |-         |-       |-          |896
   
   $ curl http://localhost:8060/sockets/896
   # This is a broken Socket
   version=1
   shared_part={
     ref_count=1
     socket_pool=null
     creator_socket=896
     in_size=316616369
     in_num_messages=12120483
     out_size=114960271066
     out_num_messages=12120511
   }
   nref=1
   nevent=1
   fd=4228
   tos=0
   reset_fd_to_now=485975008182us
   remote_side=10.26.44.32:8060
   local_side=10.22.180.15:55309
   on_et_events=0x1bc5dd0
   user=(brpc::InputMessenger*)0x5c0ab40
   this_id=896
   preferred_index=1 (baidu_std)
   hc_count=0
   avg_input_msg_size=26
   read_buf=0
   last_read_to_now=960432766us
   last_write_to_now=960412394us
   overcrowded=0
   id_wait_list={}
   parsing_context=0
   pipeline_q=0
   hc_interval_s=3
   ninprocess=1
   auth_flag_error=0
   auth_id=177098681547473
   auth_context=0
   logoff_flag=0
   recycle_flag=1
   agent_socket_id=(none)
   cid=0
   write_head=0
   ssl_state=SSL_OFF
   tcpi={
     state=7
     ca_state=0
     retransmits=0
     probes=0
     backoff=0
     options=7
     snd_wscale=7
     rcv_wscale=7
     rto=205000
     ato=40000
     snd_mss=1448
     rcv_mss=736
     unacked=0
     sacked=0
     lost=0
     retrans=0
     fackets=0
     last_data_sent=960413
     last_ack_sent=0
     last_data_recv=960433
     last_ack_recv=960413
     pmtu=1500
     rcv_ssthresh=52260
     rtt=2750
     rttvar=3000
     snd_ssthresh=18
     snd_cwnd=18
     advmss=1448
     reordering=3
   }
   ```
   
   4)Y日志中没有"Checking Socket"和"Revived Socket"的日志(集群启用了健康检查,health_check_interval = 3,其他上游节点有Checking和Revived日志)
   
   **To Reproduce (复现方法)**
   
   生产环境小概率出现,目前需要通过重启上游节点恢复。
   
   **Versions (各种版本)**
   OS: CentOS Linux release 7.1.1503 (Core)
   Compiler: gcc (GCC) 7.2.0
   brpc: 0.9.5
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] gaodayue commented on issue #1168: 当下游节点故障重启后,上游节点有概率会一直连不上

Posted by GitBox <gi...@apache.org>.
gaodayue commented on issue #1168:
URL: https://github.com/apache/incubator-brpc/issues/1168#issuecomment-663941178


   > "日志中持续输出Not connected to 10.26.44.32:8060 yet", 这是指Y还是其他(正常的)上游?Y所在的server并没有陷入死锁吧?
   
   是指Y节点。应该是没有死锁,除了连不上X(没有触发心跳检查),其他功能都正常。


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] jamesge commented on issue #1168: 当下游节点故障重启后,上游节点有概率会一直连不上

Posted by GitBox <gi...@apache.org>.
jamesge commented on issue #1168:
URL: https://github.com/apache/incubator-brpc/issues/1168#issuecomment-661936369


   "日志中持续输出Not connected to 10.26.44.32:8060 yet", 这是指Y还是其他(正常的)上游?Y所在的server并没有陷入死锁吧?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org


[GitHub] [incubator-brpc] zhengchengyao commented on issue #1168: 当下游节点故障重启后,上游节点有概率会一直连不上

Posted by GitBox <gi...@apache.org>.
zhengchengyao commented on issue #1168:
URL: https://github.com/apache/incubator-brpc/issues/1168#issuecomment-861315334


   @gaodayue 您好,请问下,这个问题后面是怎么解决的?我们现在也遇到这个问题,非常感谢!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@brpc.apache.org
For additional commands, e-mail: dev-help@brpc.apache.org