You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Neha Narkhede (JIRA)" <ji...@apache.org> on 2013/08/05 21:46:47 UTC

[jira] [Commented] (KAFKA-999) Controlled shutdown never succeeds until the broker is killed

    [ https://issues.apache.org/jira/browse/KAFKA-999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13729864#comment-13729864 ] 

Neha Narkhede commented on KAFKA-999:
-------------------------------------

I think the fix is to remove the smart in the broker's handling of a become follower request. Even if the leader is not alive, it should depend on the controller to send it another LeaderAndIsrRequest to connect to some other leader or to become a leader itself.
                
> Controlled shutdown never succeeds until the broker is killed
> -------------------------------------------------------------
>
>                 Key: KAFKA-999
>                 URL: https://issues.apache.org/jira/browse/KAFKA-999
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>            Priority: Critical
>
> A race condition in the way leader and isr request is handled by the broker and controlled shutdown can lead to a situation where controlled shutdown can never succeed and the only way to bounce the broker is to kill it.
> The root cause is that broker uses a smart to avoid fetching from a leader that is not alive according to the controller. This leads to the broker aborting a become follower request. And in cases where replication factor is 2, the leader can never be transferred to a follower since it keeps rejecting the become follower request and stays out of the ISR. This causes controlled shutdown to fail forever
> One sequence of events that led to this bug is as follows -
> - Broker 2 is leader and controller
> - Broker 2 is bounced (uncontrolled shutdown)
> - Controller fails over
> - Controlled shutdown is invoked on broker 1
> - Controller starts leader election for partitions that broker 2 led
> - Controller sends become follower request with leader as broker 1 to broker 2. At the same time, it does not include broker 1 in alive broker list sent as part of leader and isr request
> - Broker 2 rejects leaderAndIsr request since leader is not in the list of alive brokers
> - Broker 1 fails to transfer leadership to broker 2 since broker 2 is not in ISR
> - Controlled shutdown can never succeed on broker 1
> Since controlled shutdown is a config option, if there are bugs in controlled shutdown, there is no option but to kill the broker

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira