You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Sudesh Wasnik (Jira)" <ji...@apache.org> on 2023/04/14 14:00:00 UTC

[jira] [Assigned] (KAFKA-14091) Suddenly-killed tasks can leave hanging transactions open

     [ https://issues.apache.org/jira/browse/KAFKA-14091?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sudesh Wasnik reassigned KAFKA-14091:
-------------------------------------

    Assignee: Sudesh Wasnik  (was: Sagar Rao)

> Suddenly-killed tasks can leave hanging transactions open
> ---------------------------------------------------------
>
>                 Key: KAFKA-14091
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14091
>             Project: Kafka
>          Issue Type: Improvement
>          Components: KafkaConnect
>            Reporter: Chris Egerton
>            Assignee: Sudesh Wasnik
>            Priority: Major
>
> Right now, if a task running with exactly-once support is killed ungracefully, it may leave a hanging transaction open. If the transaction included writes to the global offsets topic, then startup for future workers becomes blocked on that transaction expiring.
> Ideally, we could identify these kinds of hanging transactions and proactively abort them.
> Unfortunately, there are a few facts that make this fairly complicated:
>  # Workers read to the end of the offsets topic during startup, before joining the cluster
>  # Workers do not know which tasks they are assigned until they join the cluster
> The result of these facts is that we cannot trust workers that are restarted shortly after being ungracefully shut down to fence out their own hanging transactions, since any hanging transactions would prevent them from being able to join the group and receive their task assignment in the first place.
> We could possibly accomplish this by having the leader proactively abort any open transactions for tasks on workers that appear to have left the cluster during a rebalance. This would not require us to wait for the scheduled rebalance delay to elapse, since the intent of the delay is to provide a buffer between when workers leave and when their connectors/tasks are reallocated across the cluster (and, if the worker is able to rejoin before that buffer is consumed, then give it back the same connectors/tasks it was running previously); aborting transactions for tasks on these workers would not interfere with that goal.
>  
> It's also possible that we may have to handle the case where a [cancelled|https://github.com/apache/kafka/blob/badfbacdd09a9ee8821847f4b28d98625f354ed7/connect/runtime/src/main/java/org/apache/kafka/connect/runtime/AbstractWorkerSourceTask.java#L274-L287] task leaves a transaction open; I have yet to confirm whether this is possible, though.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)