You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by 悟空 <wu...@foxmail.com> on 2022/08/25 10:13:19 UTC
回复：About the change of Kafka Consumer Lag between Flink 1.14 and 1.9

hi dawnwords:
&nbsp;as you said, you change this consumer rate from&nbsp;10min to 1min and producer will still stop unitl this conusmer catch up kafka lag。
&nbsp;so the second graph maybe correct，you can try increase your parallelism and maybe this graph will down faster&nbsp;




------------------&nbsp;原始邮件&nbsp;------------------
发件人:                                                                                                                        "dawnwords"                                                                                    <dawnwords@163.com&gt;;
发送时间:&nbsp;2022年8月11日(星期四) 下午4:36
收件人:&nbsp;"user"<user@flink.apache.org&gt;;

主题:&nbsp;About the change of Kafka Consumer Lag between Flink 1.14 and 1.9



Hi,
Recently, our team have finished migrating our realtime processing job based on Flink&nbsp; from version 1.9 to 1.14, which consumses a Kafka topic as the input. The good news is that the migration gives us about 10% CPU usage reduction together with good and steady performance during lag consuming. However, we came up with this unexplained phenomenon about the input kafka topic lag graph behavior change showing bellow. The purple plot belongs to version 1.9 and the green one belongs to version 1.14. The promQL of the graph is simply 
sum(kafka_consumergroup_lag{topic="$input_topic"}) by (consumergroup, topic) 



&nbsp;
The plot of 1.9 can be easily explained that a checkpoint occurs every 10min and the kafka consumer commits the offset leads to the sharp descrease for the lag, but the plot of 1.14 becomes a mirror&nbsp;of 1.9 and seems that after the consumer offset catches up with the producer offset, the producer produces a very large set of data records suddenly and stop producing before the next 'catch up'. So what mechanism indeed causes this strage kafka lag graph? Thanks in advance.


What we change during the migration:
1. We migrated the deperated APIs in Flink Kafka Connector, i.e. FlinkKafkaConsumer and FlinkKafkaProducer, into KafkaSource and KafkaSink. 
2. We migrated the FsStateBackend into HashMapStateBackend.
3. We shorttern the checkpoint interval from 10min to 1min (as the checkpoint duration becomes several minutes if we do checkpoint every 10min when consuming lag, which seems unacceptable)
What we donot change during the migration:
1. the producer for the input topic. It simply generated data in a steady rate.
2. the consumer properties. All configurations are default except 'group.id', 'client.id' and 'bootstrap.servers'.
3. AT_LEAST_ONCE checkpoint mode.