You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "A. Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2021/09/08 01:14:00 UTC

[jira] [Commented] (KAFKA-13269) Kafka Streams Aggregation data loss between instance restarts and rebalances

    [ https://issues.apache.org/jira/browse/KAFKA-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17411606#comment-17411606 ] 

A. Sophie Blee-Goldman commented on KAFKA-13269:
------------------------------------------------

Hey [~rohitbobade], thanks for the report. Unfortunately I don't think the acceptable recovery lag is directly responsible, as that config is only used within the assignor to figure out the placement of tasks. Assigning a task as "Active" just means that the instance should try to process it, the task still has to go through restoration if it's anything less than 100% caught up with the end of the changelog. 

Wonder if this might be due to https://issues.apache.org/jira/browse/KAFKA-13249?

> Kafka Streams Aggregation data loss between instance restarts and rebalances
> ----------------------------------------------------------------------------
>
>                 Key: KAFKA-13269
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13269
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>    Affects Versions: 2.6.2
>            Reporter: Rohit Bobade
>            Priority: Major
>
> Using Kafka Streams 2.6.2 and doing count based aggregation of messages. Also setting Processing Guarantee - EXACTLY_ONCE_BETA and NUM_STANDBY_REPLICAS_CONFIG = 1. Sending some messages and restarting instances in middle while processing to test fault tolerance. The output count is incorrect because of data loss while restoring state.
> It looks like the streams task becomes active and starts processing even when the state is not fully restored but is within the acceptable recovery lag (default is 10000) This results in data loss
> {quote}A stateful active task is assigned to an instance only when its state is within the configured acceptable.recovery.lag, if one exists
> {quote}
> [https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html?_ga=2.33073014.912824567.1630441414-1598368976.1615841473#state-restoration-during-workload-rebalance]
> [https://docs.confluent.io/platform/current/installation/configuration/streams-configs.html#streamsconfigs_acceptable.recovery.lag]
> Setting acceptable.recovery.lag to 0 and re-running the chaos tests gives the correct result.
> Related KIP: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams#KIP441:SmoothScalingOutforKafkaStreams-Computingthemost-caught-upinstances]
> Just want to get some thoughts on this use case from the Kafka team or if anyone has encountered similar issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)