You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Marcus Horsley-Rai <ma...@gmail.com> on 2021/06/02 14:25:52 UTC
Increase in consumer lag
Hi all,
Hoping someone can sanity check my logic!
A cluster I'm working on went into production with some topics poorly
configured; ReplicationFactor of 1 mostly being the issue.
To avoid downtime as much as possible, I used the
kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
This worked like a charm for the majority of topics; except when I got to
our highest throughput one.
The async execution of the re-assign got stuck in a never-ending loop, and
I caused a slight live issue in that some of our consumer groups lag shot
through the roof, meaning data was no longer real-time.
I backed some of the changes out - and went back to the drawing board.
More reading later - I came to know of monitoring ISR shrinks/expands, and
that settings like num.replica.fetchers probably needed tuning since
replication was not keeping up.
A line of documentation "A message is committed only after it has been
successfully copied to all the in-sync replicas" led me to conclude that
consumer lag had increased because of this delay in replication.
I planned to ratchet up the num.replica.fetchers until I saw ISR
shrinks/expands diminish. In return I expected some extra CPU/Network/Disk
I/O on the brokers, but for consumer lag to decrease. Then I would go back
to increasing the RF on any remaining topics.
The first part went OK - increasing fetcher threads from 1 to 3; I saw
Shrinks/Expands *decrease*, although not entirely to 0.
Contrary to what I expected though, the consumer lag *increased* for some
of our apps.
I couldn't see any resource bottleneck on the hosts the apps are on; can
anyone suggest if there could be any resource contention otherwise in Kafka
itself?
Many thanks in advance,
Marcus
Re: Increase in consumer lag
Posted by Nikita Kretov <kr...@gmail.com>.
Hello! Basically , I don't think that we can simply conclude that
consumer lag is dependent on number of replica fetching threads.
Maybe the first thing to double check is to use kafka-confumer-group cli
instead of some lag exporters (in case you using this type of monitoring
for consumer lag). Next I'll be check bytes_in bytes_out per topic in
case there is a huge dis-balance between produce and consume. Third
thing - check number and rate of rebalance for consumer groups with high
lag.
For me this steps can help to clarify state of the issue (if it
monitoring issue (some of your exporters are lying to your), client
issue (disproportional huge produce load or constant rebalance of
consumer groups), or cluster performance issue).
On 6/2/21 5:25 PM, Marcus Horsley-Rai wrote:
> Hi all,
>
> Hoping someone can sanity check my logic!
> A cluster I'm working on went into production with some topics poorly
> configured; ReplicationFactor of 1 mostly being the issue.
>
> To avoid downtime as much as possible, I used the
> kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
> This worked like a charm for the majority of topics; except when I got to
> our highest throughput one.
> The async execution of the re-assign got stuck in a never-ending loop, and
> I caused a slight live issue in that some of our consumer groups lag shot
> through the roof, meaning data was no longer real-time.
> I backed some of the changes out - and went back to the drawing board.
>
> More reading later - I came to know of monitoring ISR shrinks/expands, and
> that settings like num.replica.fetchers probably needed tuning since
> replication was not keeping up.
> A line of documentation "A message is committed only after it has been
> successfully copied to all the in-sync replicas" led me to conclude that
> consumer lag had increased because of this delay in replication.
>
> I planned to ratchet up the num.replica.fetchers until I saw ISR
> shrinks/expands diminish. In return I expected some extra CPU/Network/Disk
> I/O on the brokers, but for consumer lag to decrease. Then I would go back
> to increasing the RF on any remaining topics.
>
> The first part went OK - increasing fetcher threads from 1 to 3; I saw
> Shrinks/Expands *decrease*, although not entirely to 0.
> Contrary to what I expected though, the consumer lag *increased* for some
> of our apps.
> I couldn't see any resource bottleneck on the hosts the apps are on; can
> anyone suggest if there could be any resource contention otherwise in Kafka
> itself?
>
> Many thanks in advance,
>
> Marcus
>