You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "James Cheng (JIRA)" <ji...@apache.org> on 2016/11/22 23:31:58 UTC

[jira] [Commented] (KAFKA-1120) Controller could miss a broker state change

    [ https://issues.apache.org/jira/browse/KAFKA-1120?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15688295#comment-15688295 ] 

James Cheng commented on KAFKA-1120:
------------------------------------

I believe we ran into this today.

{noformat}
core@core04 $ grep brokers controller.log.2016-11-22-22
[2016-11-22 22:50:32,883] INFO [Controller 4]: Currently active brokers in the cluster: Set(1, 3, 4, 5) (kafka.controller.KafkaController)
[2016-11-22 22:50:32,883] INFO [Controller 4]: Currently shutting brokers in the cluster: Set() (kafka.controller.KafkaController)
[2016-11-22 22:51:44,601] INFO [BrokerChangeListener on Controller 4]: Broker change listener fired for path /brokers/ids with children 1,2,3,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:51:44,607] INFO [BrokerChangeListener on Controller 4]: Newly added brokers: 2, deleted brokers: , all live brokers: 1,2,3,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:55:18,831] DEBUG [Controller 4]: All shutting down brokers: 1 (kafka.controller.KafkaController)
[2016-11-22 22:55:18,831] DEBUG [Controller 4]: Live brokers: 5,2,3,4 (kafka.controller.KafkaController)
[2016-11-22 22:57:11,791] INFO [BrokerChangeListener on Controller 4]: Broker change listener fired for path /brokers/ids with children 2,3,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:11,980] INFO [BrokerChangeListener on Controller 4]: Newly added brokers: , deleted brokers: 1, all live brokers: 2,3,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:11,985] INFO [Controller 4]: Removed ArrayBuffer(1) from list of shutting down brokers. (kafka.controller.KafkaController)
[2016-11-22 22:57:43,133] INFO [BrokerChangeListener on Controller 4]: Broker change listener fired for path /brokers/ids with children 1,2,3,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:43,359] INFO [BrokerChangeListener on Controller 4]: Newly added brokers: 1, deleted brokers: , all live brokers: 1,2,3,4,5 (kafka.controller.ReplicaStateMachine$BrokerChangeListener)
[2016-11-22 22:57:50,218] DEBUG [Controller 4]: All shutting down brokers: 1 (kafka.controller.KafkaController)
[2016-11-22 22:57:50,218] DEBUG [Controller 4]: Live brokers: 5,2,3,4 (kafka.controller.KafkaController)
[2016-11-22 22:58:01,668] DEBUG [Controller 4]: All shutting down brokers: 1 (kafka.controller.KafkaController)
[2016-11-22 22:58:01,668] DEBUG [Controller 4]: Live brokers: 5,2,3,4 (kafka.controller.KafkaController)
core@core04 $
{noformat}

At 2016-11-22 22:57:11,791, broker 1 went away, and the controller noticed it.
At 2016-11-22 22:57:43,133, broker 1 came back, and the controller noticed it.
At 2016-11-22 22:57:50,218, the controller said it was "done" with stuff, and it doesn't seem to know about broker 1, even though broker 1 is running

> Controller could miss a broker state change 
> --------------------------------------------
>
>                 Key: KAFKA-1120
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1120
>             Project: Kafka
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 0.8.1
>            Reporter: Jun Rao
>
> When the controller is in the middle of processing a task (e.g., preferred leader election, broker change), it holds a controller lock. During this time, a broker could have de-registered and re-registered itself in ZK. After the controller finishes processing the current task, it will start processing the logic in the broker change listener. However, it will see no broker change and therefore won't do anything to the restarted broker. This broker will be in a weird state since the controller doesn't inform it to become the leader of any partition. Yet, the cached metadata in other brokers could still list that broker as the leader for some partitions. Client requests routed to that broker will then get a TopicOrPartitionNotExistException. This broker will continue to be in this bad state until it's restarted again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)