You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2021/03/23 03:42:00 UTC

[jira] [Assigned] (KAFKA-12523) Need to improve handling of TimeoutException when committing offsets

     [ https://issues.apache.org/jira/browse/KAFKA-12523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

A. Sophie Blee-Goldman reassigned KAFKA-12523:
----------------------------------------------

    Assignee: A. Sophie Blee-Goldman

> Need to improve handling of TimeoutException when committing offsets
> --------------------------------------------------------------------
>
>                 Key: KAFKA-12523
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12523
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.8.0
>            Reporter: A. Sophie Blee-Goldman
>            Assignee: A. Sophie Blee-Goldman
>            Priority: Blocker
>             Fix For: 2.8.0
>
>
> Right now, in TaskManager#commitOffsetsOrTransaction if we  catch a TimeoutException then under ALOS we just rethrow it while in EOS we rethrow it as TaskCorruptedException. The problem is that commitOffsetsOrTransaction can be invoked from several places:
> # Commit within StreamThread main processing loop (either user requested or commit interval has elapsed: this is presumably the case we had in mind when deciding how to handle the TimeoutException in commitOffsetsOrTransaction , no problem here
> # Clean shutdown of application: a bit weird to throw a TaskCorruptedException in this case, but it’ll just end up being caught and forcing a closeDirty, so again no problem here
> # From TaskManager#handleRevocation: in this case, it’s possible we hit a TimeoutException on a task that’s actually being revoked. This exception will be saved and rethrown from poll, so under EOS we would catch a TaskCorruptedException and then try to revive this task that we actually no longer own. Pretty sure this will cause an NPE in the TaskManager. Under ALOS, the rethrown TimeoutException will be bubbled up through poll again, but unlike TaskCorruptedException we actually don’t catch TimeoutException anywhere in the StreamThread loop. This will trigger the uncaught exception handler
> # From TaskManager#handleTaskCorrupted:  this method is itself invoked from within the catch TaskCorruptedException block of the StreamThread’s runLoop. If we throw TaskCorruptedException again then I believe we won’t even catch this in the safety net catch Throwable block of the runLoop -- it’ll just be thrown directly up through run(). 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)