You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "John Gray (Jira)" <ji...@apache.org> on 2022/09/01 13:13:00 UTC

[jira] [Comment Edited] (KAFKA-14172) bug: State stores lose state when tasks are reassigned under EOS wit…

    [ https://issues.apache.org/jira/browse/KAFKA-14172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17598956#comment-17598956 ] 

John Gray edited comment on KAFKA-14172 at 9/1/22 1:12 PM:
-----------------------------------------------------------

My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar theme: it appears the restore consumers are not consuming all of their messages for a full restore before processing begins. This sad situation seems to happen consistently after Strimzi rolls out an upgrade to our cluster. Once the brokers are all rolled, if our stateful apps rebalance, we lose data. We do not have the extra disk space for standby replicas, so the acceptable.recovery.lag and related bits to the standby replicas are not at play for us. But the restore consumers fumbling data w/ EOS seems to be a big problem for us as well. 


was (Author: gray.john):
My stateful/EOS Kafka apps also seem to be struggling on 3.0.0+, with a similar theme: it appears the restore consumers are not consuming all of their messages for a full restore before processing begins. This sad situation seems to happen consistently after Strimzi rolls out an upgrade to our cluster. Once the brokers are all rolled, if our stateful apps rebalance, we lose data. We do not have the extra disk space for standby replicas, so the acceptable.recovery.lag and related bits to the standby replicas are not at play for us. But the restore consumers fumbling data w/ EOS seems to be a big problem for us. 

> bug: State stores lose state when tasks are reassigned under EOS wit…
> ---------------------------------------------------------------------
>
>                 Key: KAFKA-14172
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14172
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 3.1.1
>            Reporter: Martin Hørslev
>            Priority: Major
>
> h1. State stores lose state when tasks are reassigned under EOS with standby replicas and default acceptable lag.
> I have observed that state stores used in a transform step under a Exactly Once semantics ends up losing state after a rebalancing event that includes reassignment of tasks to previous standby task within the acceptable standby lag.
>  
> The problem is reproduceable and an integration test have been created to showcase the [issue|https://github.com/apache/kafka/pull/12540]. 
> A detailed description of the observed issue is provided [here|https://github.com/apache/kafka/pull/12540/files?short_path=3ca480e#diff-3ca480ef093a1faa18912e1ebc679be492b341147b96d7a85bda59911228ef45]
> Similar issues have been observed and reported to StackOverflow for example [here|https://stackoverflow.com/questions/69038181/kafka-streams-aggregation-data-loss-between-instance-restarts-and-rebalances].
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)