You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Json Tu <ka...@126.com> on 2016/11/01 03:20:03 UTC
KAFKA-4360 issue

> 在 2016年11月1日，上午10:54，huxi (JIRA) <ji...@apache.org> 写道：
> 
> 
>    [ https://issues.apache.org/jira/browse/KAFKA-4360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15624155#comment-15624155 ] 
> 
> huxi commented on KAFKA-4360:
> -----------------------------
> 
> Excellent analysis! What I am intrigued is whether this is a deadlock issue or a liveness issue. Here is my analysis:
> 1. Say at time T1, the zookeeper session expires, so 'handleNewSession' methods for SessionExpirationListener is executed, therefore, obtaining the controller lock(controllerContext.controllerLock)
> 2. Then it invokes 'onControllerResignation' method to have the current controller quit, which will shutdown leader rebalance scheduler by calling KafkaScheduler.shutdown
> 3. In 'shutdown' method, it shuts down the ScheduledThreadPoolExecutor and blocks until all tasks have completed execution after a shutdown request
> 4. If there exists any tasks submitted before calling shutdown, the check-imbalance thread should get started with checking isActive which acquires the controller lock at the very beginning and then soon be blocked due to the lock has already been held by the main thread.
> 5. In that case, the main thread will block in onControllerResignation method until one day has elapsed by default or you just interrupt the check thread.
> 
> Does it make sense?
> 
> 
>> Controller may deadLock when autoLeaderRebalance encounter zk expired
>> ---------------------------------------------------------------------
>> 
>>                Key: KAFKA-4360
>>                URL: https://issues.apache.org/jira/browse/KAFKA-4360
>>            Project: Kafka
>>         Issue Type: Bug
>>         Components: controller
>>   Affects Versions: 0.9.0.0, 0.9.0.1, 0.10.0.0, 0.10.0.1
>>           Reporter: Json Tu
>>             Labels: bugfix
>>        Attachments: yf-mafka2-common02_jstack.txt
>> 
>>  Original Estimate: 168h
>> Remaining Estimate: 168h
>> 
>> when controller has checkAndTriggerPartitionRebalance task in autoRebalanceScheduler，and then zk expired at that time. It will
>> run into deadlock.
>> we can restore the scene as below，when zk session expired，zk thread will call handleNewSession which defined in SessionExpirationListener, and it will get controllerContext.controllerLock，and then it will autoRebalanceScheduler.shutdown()，which need complete all the task in the autoRebalanceScheduler，but that threadPoll also need get controllerContext.controllerLock，but it has already owned by zk callback thread，which will then run into deadlock.
>> because of that，it will cause two problems at least, first is the broker’s id is cannot register to the zookeeper，and it will be considered as dead by new controller，second this procedure can not be stop by kafka-server-stop.sh, because shutdown function
>> can not get controllerContext.controllerLock also, we cannot shutdown kafka except using kill -9.
>> In my attachment, I upload a jstack file, which was created when my kafka procedure cannot shutdown by kafka-server-stop.sh.
>> I have met this scenes for several times，I think this may be a bug that not solved in kafka.
> 
> 
> 
> --
> This message was sent by Atlassian JIRA
> (v6.3.4#6332)