You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Neha Narkhede (JIRA)" <ji...@apache.org> on 2014/10/07 16:05:34 UTC

[jira] [Reopened] (KAFKA-1663) Controller unable to shutdown after a soft failure

     [ https://issues.apache.org/jira/browse/KAFKA-1663?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Neha Narkhede reopened KAFKA-1663:
----------------------------------

[~sriharsha], while talking to Jun, realized that there may have been a regression introduced by the patch by removing the deleteTopicStateChanged.set(true) from startup(). The purpose of that is to let the delete topic thread resume topic deletion on startup for topics for which deletion was initiated on the previous controller. During the review, I assumed that the controller is signaling the delete topic thread separately after startup, but that is not the case. 

However, while reading through the code, I think there is a bug in the above case where the controller needs to resume topic deletion on startup. Basically the way for the controller to notify the TopicDeletionManager of resuming the thread is via the callers of resumeTopicDeletionThread(). Each of those caller APIs are protected via the controllerLock in KafkaController. However, awaitTopicDeletionNotification is not. So there is a window when the controller might signal a thread that is not waiting on the same monitor. I think the main problem is with having 2 locks - deleteLock and controllerLock. We might have to revisit that decision and see if we consolidate on a single lock (controllerLock). Since this is a different bug, can you file it and link it back to this issue? 

> Controller unable to shutdown after a soft failure
> --------------------------------------------------
>
>                 Key: KAFKA-1663
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1663
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Sriharsha Chintalapani
>            Assignee: Sriharsha Chintalapani
>            Priority: Blocker
>             Fix For: 0.8.2
>
>         Attachments: KAFKA-1663.patch
>
>
> As part of testing KAFKA-1558 I came across a case where inducing soft failure in the current controller elects a new controller  but the old controller doesn't shutdown properly.
> steps to reproduce
> 1) 5 broker cluster
> 2) high number of topics(I tested it with 1000 topics)
> 3) on the current controller do kill -SIGSTOP  pid( broker's process id)
> 4) wait for bit over zookeeper timeout (server.properties)
> 5) kill -SIGCONT pid
> 6) There will be a new controller elected. check old controller's
> log 
> [2014-09-30 15:59:53,398] INFO [SessionExpirationListener on 1], ZK expired; shut down all controller components and try to re-elect (kafka.controller.KafkaController$SessionExpirationListener)
> [2014-09-30 15:59:53,400] INFO [delete-topics-thread-1], Shutting down (kafka.controller.TopicDeletionManager$DeleteTopicsThread)
> If it stops there and the broker  logs keeps printing 
> Cached zkVersion [0] not equal to that in zookeeper, skip updating ISR (kafka.cluster.Partition)
> than the controller shutdown never completes.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)