You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by "Rohit Bobade (Jira)" <ji...@apache.org> on 2021/09/03 00:16:00 UTC

[jira] [Created] (KAFKA-13269) Kafka Streams Aggregation data loss between instance restarts and rebalances

Rohit Bobade created KAFKA-13269:
------------------------------------

             Summary: Kafka Streams Aggregation data loss between instance restarts and rebalances
                 Key: KAFKA-13269
                 URL: https://issues.apache.org/jira/browse/KAFKA-13269
             Project: Kafka
          Issue Type: Bug
          Components: streams
    Affects Versions: 2.6.2
            Reporter: Rohit Bobade


Using Kafka Streams 2.6.2 and doing count based aggregation of messages. Also setting Processing Guarantee - EXACTLY_ONCE_BETA and NUM_STANDBY_REPLICAS_CONFIG = 1. Sending some messages and restarting instances in middle while processing to test fault tolerance. The output count is incorrect because of data loss while restoring state.

It looks like the streams task becomes active and starts processing even when the state is not fully restored but is within the acceptable recovery lag (default is 10000) This results in data loss
{quote}A stateful active task is assigned to an instance only when its state is within the configured acceptable.recovery.lag, if one exists
{quote}
[https://docs.confluent.io/platform/current/streams/developer-guide/running-app.html?_ga=2.33073014.912824567.1630441414-1598368976.1615841473#state-restoration-during-workload-rebalance]

[https://docs.confluent.io/platform/current/installation/configuration/streams-configs.html#streamsconfigs_acceptable.recovery.lag]

Setting acceptable.recovery.lag to 0 and re-running the chaos tests gives the correct result.

Related KIP: [https://cwiki.apache.org/confluence/display/KAFKA/KIP-441%3A+Smooth+Scaling+Out+for+Kafka+Streams#KIP441:SmoothScalingOutforKafkaStreams-Computingthemost-caught-upinstances]

Just want to get some thoughts on this use case from the Kafka team or if anyone has encountered similar issue



--
This message was sent by Atlassian Jira
(v8.3.4#803005)