You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (JIRA)" <ji...@apache.org> on 2017/07/18 23:32:00 UTC

[jira] [Commented] (KAFKA-5152) Kafka Streams keeps restoring state after shutdown is initiated during startup

    [ https://issues.apache.org/jira/browse/KAFKA-5152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16092330#comment-16092330 ] 

Guozhang Wang commented on KAFKA-5152:
--------------------------------------

Here are a couple alternative solutions to resolve this issue:

1. Revert the task-suspension optimization, this is the worst-case solution we can get to work around it.

2. Move the restoration process completely out of the {{onPartitionsAssigned}} callback, in the thread's main while loop. More specifically the workflow of the thread will be:

a) only create the active / standby tasks without executing the restoration of states on the active tasks; release the dir file locks in {{onPartitionAssigned}} for those suspended-and-not-reassigned tasks. Mark all the created active tasks as not-ready first and pause all these task's corresponding source topic-partitions.

b) whose state has not been restored up to date (for those suspended-and-reassigned tasks double check that the they should not be included since their state should be up-to-date)

b) in the main loop:

b.1) check for all the not-ready tasks and see if their stores have completed restoration (restored offset = logend offset), if yes mark these tasks as active now and resume their corresponding source topic-partitions.

b.2) process / punctuate / commit if possible for active tasks which have already some records fetched.

b.3) restore the state for both standby tasks and active-not-ready tasks.

One thing needed for care though, is that since now tasks within the same thread can start processing at different time, but for IQ we still need to only make the instance / thread as RUNNING after all its tasks are now running.

> Kafka Streams keeps restoring state after shutdown is initiated during startup
> ------------------------------------------------------------------------------
>
>                 Key: KAFKA-5152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-5152
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 0.10.2.1
>            Reporter: Xavier Léauté
>            Assignee: Matthias J. Sax
>             Fix For: 0.10.2.2, 0.11.0.1
>
>
> If streams shutdown is initiated during state restore (e.g. an uncaught exception is thrown) streams will not shut down until all stores are first finished restoring.
> As restore progresses, stream threads appear to be taken out of service as part of the shutdown sequence, causing rebalancing of tasks. This compounds the problem by slowing down the restore process even further, since the remaining threads now have to also restore the reassigned tasks before they can shut down.
> A more severe issue is that if there is a new rebalance triggered during the end of the waitingSync phase (e.g. due to a new member joining the group, or some members timed out the SyncGroup response), then some consumer clients of the group may already proceed with the {{onPartitionsAssigned}} and blocked on trying to grab the file dir lock not yet released from other clients, while the other clients holding the lock are consistently re-sending {{JoinGroup}} requests while the rebalance cannot be completed because the clients blocked on the file dir lock will not be kicked out of the group as its heartbeat thread has been consistently sending HBRequest. Hence this is a deadlock caused by not releasing the file dir locks in task suspension.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)