You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2021/03/19 04:28:00 UTC

[jira] [Commented] (KAFKA-10563) Make sure task directories don't remain locked by dead threads

    [ https://issues.apache.org/jira/browse/KAFKA-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17304637#comment-17304637 ] 

A. Sophie Blee-Goldman commented on KAFKA-10563:
------------------------------------------------

I partially addressed this on the side in another PR:https://github.com/apache/kafka/pull/10342#issuecomment-802542776

We clean up any orphaned task directories when the cleaner thread runs. As mentioned above this is not perfect since it could mean this task directory remains blocked for up to 10 minutes (by default), which in the current architecture would also blocks progress on any other task assigned to that StreamThread. It's better than nothing but we should still follow up and make sure it's not possible to leave locked task directories behind

> Make sure task directories don't remain locked by dead threads
> --------------------------------------------------------------
>
>                 Key: KAFKA-10563
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10563
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: A. Sophie Blee-Goldman
>            Priority: Major
>
> Most common/expected exceptions within Streams are handled gracefully, and the thread will make sure to clean up all resources such as task locks during shutdown. However, there are some instances where an unexpected exception such as an IllegalStateException can leave some resources orphaned.
> We have seen this happen to task directories after an IllegalStateException is hit during the TaskManager's rebalance handling logic – the Thread shuts down, but loses track of some tasks before unlocking them. This blocks any further work on that task by any other thread in the same instance.
> Previously we decided that this was "ok" because an IllegalStateException means all bets are off. But with the upcoming work of KIP-663 and KIP-671, users will be able to react smartly on dying threads and replace them with new ones, making it more important than ever to ensure that the application can continue on with no lasting repercussions of a thread death. If we allow users to revive/replace a thread that dies due to IllegalStateException, that thread should not be blocked from doing any work by the ghost of its predecessor. 
> It might be easiest to just add some logic to the cleanup thread to verify all the existing locks against the list of live threads, and remove any zombie locks. But we probably want to do this purging more frequently than the cleanup thread runs (10min by default) – so maybe we can leverage the work in KIP-671 and have each thread purge any locks still owned by it after the uncaught exception handler runs, but before the thread dies.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)