You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Jiangjie Qin (JIRA)" <ji...@apache.org> on 2015/09/03 01:51:46 UTC

[jira] [Commented] (KAFKA-2437) Controller does not handle zk node deletion correctly.

    [ https://issues.apache.org/jira/browse/KAFKA-2437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14728248#comment-14728248 ] 

Jiangjie Qin commented on KAFKA-2437:
-------------------------------------

Debugged with [~jjkoshy] and found the following root cause.

zkClient determine whether to find handleDataChanged() or handledDataDeleted() in the following way. When receive event from zookeeper, it tries to read the data from the watched path. If the path does not exist any more, handledDataDeleted() will be fired. Otherwise, handleDataChange() will be fired.

When we delete /controller path. zkClient watcher will receive zk event, but before zkClient read data from the watched path, the path got created again by another broker. In this case, only handleDataChange() will fire, i.e. a broker will miss a node deletion event. If the broker missed the node deletion event happen to be the old controller, it will not resign and the cluster will end up with more than one controller.

> Controller does not handle zk node deletion correctly.
> ------------------------------------------------------
>
>                 Key: KAFKA-2437
>                 URL: https://issues.apache.org/jira/browse/KAFKA-2437
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Jiangjie Qin
>            Assignee: Jiangjie Qin
>
> We see this issue occasionally. The symptom is that when /controller path got deleted, the old controller does not resign so we end up having more than one controller in the cluster (although the requests from controller with old epoch will not be accepted). After checking zookeeper watcher by using wchp, it looks the zookeeper session who created the /controller path does not have a watcher on /controller. That causes the old controller not resigning. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)