You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Ding Haifeng (JIRA)" <ji...@apache.org> on 2014/08/20 09:46:27 UTC

[jira] [Updated] (KAFKA-1600) Controller failover not working correctly.

     [ https://issues.apache.org/jira/browse/KAFKA-1600?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ding Haifeng updated KAFKA-1600:
--------------------------------

    Attachment: kafka_failure_logs.tar.gz

Guozhang and Neha, Thanks for reply.

In the attachment are controller.log and server.log from 2 of total 10 brokers. broker.id=6 is the misbehaving controller broker.

controller.log from other brokers are empty at that time. It also proves that controller failover didn't happen. server.log from other brokers are much the same with these broker and not attached.

Some critical moments I found which could help understanding the logs:
14:30:50 - a new topic "user_action_log_from_history" created.
16:04:51 - topic "user_action_log_from_history" deleted.
16:04:56 - the last line in controller.log from broker 6. The ActiveControllerCount metric also decreased to 0 since then.
16:28:48 - another broker (broker.id=1) restarted manually but failed to start. Some topic partitions on broker 1 lost their leader and were not readable and writeable since then.

What happens later:
We didn’t fully get what was wrong at that time. To bring the production system back to work ASAP, we created another Kafka cluster and switched to the new cluster. In the post-mortem analysis, we found the clues above and open this issue here. Hope it can helps. Also contact me if you need any other information.


> Controller failover not working correctly.
> ------------------------------------------
>
>                 Key: KAFKA-1600
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1600
>             Project: Kafka
>          Issue Type: Bug
>          Components: controller
>    Affects Versions: 0.8.1
>         Environment: Linux 3.2.0-4-amd64 #1 SMP Debian 3.2.46-1 x86_64 GNU/Linux
> java version "1.7.0_03"
>            Reporter: Ding Haifeng
>            Assignee: Neha Narkhede
>         Attachments: kafka_failure_logs.tar.gz
>
>
> We are running a 10 node Kafka 0.8.1 cluster and experienced a failure as following. 
> At some time, broker A stopped acting as controller any more. We see this by kafka.controller - KafkaController - ActiveControllerCount in JMX metrics jumped from 1 to 0.
> In the meanwhile, broker A was still running and registering itself in the zookeeper /kafka/controller node. So no other brokers could be elected as new controller.
> Since that the cluster was running without controller. Producers and consumers still worked. But functions requiring a controller such as new topic leader election and topic leader failover were not working any more.
> A force restart of broker A could lead to a controller election and bring the cluster back to a correct state.
> Here is our brief observations. I can provide more necessary informations if needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)