You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Json Tu (JIRA)" <ji...@apache.org> on 2016/12/01 06:24:58 UTC

[jira] [Comment Edited] (KAFKA-4447) Controller resigned but it also acts as a controller for a long time

    [ https://issues.apache.org/jira/browse/KAFKA-4447?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15702384#comment-15702384 ] 

Json Tu edited comment on KAFKA-4447 at 12/1/16 6:24 AM:
---------------------------------------------------------

[~hachikuji] thanks for your reply.
    the latest kafka's release version is 0.10.1.0,and kafka controller 's handleNewSession() is implemented as below,
  def handleNewSession() {
      info("ZK expired; shut down all controller components and try to re-elect")
      inLock(controllerContext.controllerLock) {
        onControllerResignation()
        controllerElector.elect
      }
    }

    so deregisterIsrChangeNotificationListener() is also with the controllerlock. the lock is out of the onControllerResignation(). and this is a bug which was reported at https://issues.apache.org/jira/browse/KAFKA-4360.

    my version is 0.9.0.1, so it is not bugfixed,  because we are reassign partitions at that time. so may be we can image it as below.
    there are many callbacks(for example 40 callbacks, such as IsrChangeNotificationListener) to process before to process zk's expired callback, so before startting to process zk expired callback, there will be have a lot of time to wait more listeners to be fired(for example 100 callbacks, also include IsrChangeNotificationListener) . 

    as we know,the zkclient callback thread is single thread,so the listener fired after zk's expired callback will and only be executed after handleNewSession().

    may be this is make sense.


was (Author: json tu):
[~skarface] thanks for your reply.
    the latest kafka's release version is 0.10.1.0,and kafka controller 's handleNewSession() is implemented as below,
  def handleNewSession() {
      info("ZK expired; shut down all controller components and try to re-elect")
      inLock(controllerContext.controllerLock) {
        onControllerResignation()
        controllerElector.elect
      }
    }

    so deregisterIsrChangeNotificationListener() is also with the controllerlock. the lock is out of the onControllerResignation(). and this is a bug which was reported at https://issues.apache.org/jira/browse/KAFKA-4360.

    my version is 0.9.0.1, so it is not bugfixed,  because we are reassign partitions at that time. so may be we can image it as below.
    there are many callbacks(for example 40 callbacks, such as IsrChangeNotificationListener) to process before to process zk's expired callback, so before startting to process zk expired callback, there will be have a lot of time to wait more listeners to be fired(for example 100 callbacks, also include IsrChangeNotificationListener) . 

    as we know,the zkclient callback thread is single thread,so the listener fired after zk's expired callback will and only be executed after handleNewSession().

    may be this is make sense.

> Controller resigned but it also acts as a controller for a long time 
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-4447
>                 URL: https://issues.apache.org/jira/browse/KAFKA-4447
>             Project: Kafka
>          Issue Type: Improvement
>          Components: controller
>    Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>         Environment: Linux Os
>            Reporter: Json Tu
>              Labels: reliability
>         Attachments: log.tar.gz
>
>
> We have a cluster with 10 nodes,and we execute following operation as below.
> 1.we execute some topic partition reassign from one node to other 9 nodes in the cluster, and which triggered controller.
> 2.controller invoke PartitionsReassignedListener's handleDataChange and read all partition reassign rules from the zk path, and executed all onPartitionReassignment for all partition that match conditions.
> 3.but the controller is expired from zk, after what some nodes of 9 nodes also expired from zk.
> 5.then controller invoke onControllerResignation to resigned as the controller.
> we found after the controller is resigned, it acts as controller for about 3 minutes, which can be found in my attachment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)