You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Guozhang Wang (Jira)" <ji...@apache.org> on 2021/03/17 21:28:00 UTC

[jira] [Commented] (KAFKA-12478) Consumer group may lose data for newly expanded partitions when add partitions for topic if the group is set to consume from the latest

    [ https://issues.apache.org/jira/browse/KAFKA-12478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303739#comment-17303739 ] 

Guozhang Wang commented on KAFKA-12478:
---------------------------------------

Hello [~hudeqi] this is a valid concern. One workaround for now is 1) set the config to earliest, but note that this would only take effects if there's no committed offsets, 2) when starting a consumer for the first time on a new topic, manually reset to latest via consumer.seekToEnd() -> consumer.commit() (you can even skip the second if you are not depending on the subscription group protocol to distribute partitions for you).

> Consumer group may lose data for newly expanded partitions when add partitions for topic if the group is set to consume from the latest
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KAFKA-12478
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12478
>             Project: Kafka
>          Issue Type: Improvement
>          Components: clients
>    Affects Versions: 2.7.0
>            Reporter: hudeqi
>            Priority: Blocker
>              Labels: patch
>   Original Estimate: 1,158h
>  Remaining Estimate: 1,158h
>
>   This problem is exposed in our product environment: a topic is used to produce monitoring data. *After expanding partitions, the consumer side of the business reported that the data is lost.*
>   After preliminary investigation, the lost data is all concentrated in the newly expanded partitions. The reason is: when the server expands, the producer firstly perceives the expansion, and some data is written in the newly expanded partitions. But the consumer group perceives the expansion later, after the rebalance is completed, the newly expanded partitions will be consumed from the latest if it is set to consume from the latest. Within a period of time, the data of the newly expanded partitions is skipped and lost by the consumer.
>   If it is not necessarily set to consume from the earliest for a huge data flow topic when starts up, this will make the group consume historical data from the broker crazily, which will affect the performance of brokers to a certain extent. Therefore, *it is necessary to consume these partitions from the earliest separately.*



--
This message was sent by Atlassian Jira
(v8.3.4#803005)