You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/05/02 21:42:00 UTC

[jira] [Commented] (KAFKA-6846) Controller can spend long time in shutting down RequestSendThread when processing BrokerChange event

    [ https://issues.apache.org/jira/browse/KAFKA-6846?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16461649#comment-16461649 ] 

ASF GitHub Bot commented on KAFKA-6846:
---------------------------------------

hzxa21 opened a new pull request #4960: KAFKA-6846: Throw InterruptedException after poll in awaitReady() and sendAndReceive() if the interrupt flag is set
URL: https://github.com/apache/kafka/pull/4960
 
 
   This PR resolves the issue that controller can spend a long time (more than 60s) in processing BrokerChange event when there are dead brokers, by throwing InterruptedException in the right place if the RequestSendThread sees the interrupt flag is set. In this case, RequestSendThread can break the poll loop before timeout to finish the shutdown and unblock the controller event thread, who is waiting for RequestSendThread to shutdown when removing the broker.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Controller can spend long time in shutting down RequestSendThread when processing BrokerChange event
> ----------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-6846
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6846
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>            Reporter: Zhanxiang (Patrick) Huang
>            Priority: Major
>
> Controller can spend a long time (more than 60s) in processing BrokerChange event when there are dead brokers. For example, we saw entries like these in controller log:
>  
> {code:java}
> 2018/04/28 18:13:50.021 [KafkaController] [Controller 7586]: Newly added brokers: , deleted brokers: 5222, bounced Brokers: , all live brokers: 3238,3322,5134,5177,5213,5214,5217,5218,5219,5220,5221,5319,5652,5949,7569,7574,7577,7581,7586,7589,7594,7595,7601,7609,14838,14840,14848,14855,14882,14886,14889,14901,16033
> 2018/04/28 18:13:50.021 [RequestSendThread] [Controller-7586-to-broker-5222-send-thread]: Shutting down
> .
> .
> .
> 2018/04/28 18:14:49.196 [RequestSendThread] [Controller-7586-to-broker-5222-send-thread]: Shutdown completed
> 2018/04/28 18:14:49.196 [RequestSendThread] [Controller-7586-to-broker-5222-send-thread]: Stopped
> 2018/04/28 18:14:49.200 [KafkaController] [Controller 7586]: Broker failure callback for 5222{code}
>  
> It indicates that the time difference between RequestSendThread shutdown is initiated (18:13:50) and shutdown completes (18:14:49) is 59s.
> The root cause is that RequestSendThread will call NetworkClient.pool() in a while loop in NetworkClientsUtils.awaitReady() and NetworkClientsUtils.sendAndReceive() without checking the interrupt flag. This causes the interrupt triggered by controller thread only breaks poll() for once and then the RequestSendThread will be blocked in the next poll() until it receives the disconnected message or timeout, before it can actually finish the shutdown. During this time period, controller event thread is blocked to wait for the shutdownComplete latch, which is bad because we only have single controller event thread.
> This issue can be resolved by making the thread throw InterruptedException right after each poll call in awaitReady() and sendAndReceive() if it sees the interrupt flag has been set. I will create a PR for that.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)