You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Marcus Horsley-Rai <ma...@gmail.com> on 2021/06/02 14:25:52 UTC

Increase in consumer lag

Hi all,

Hoping someone can sanity check my logic!
A cluster I'm working on went into production with some topics poorly
configured; ReplicationFactor of 1 mostly being the issue.

To avoid downtime as much as possible, I used the
kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
This worked like a charm for the majority of topics; except when I got to
our highest throughput one.
The async execution of the re-assign got stuck in a never-ending loop, and
I caused a slight live issue in that some of our consumer groups lag shot
through the roof, meaning data was no longer real-time.
I backed some of the changes out - and went back to the drawing board.

More reading later - I came to know of monitoring ISR shrinks/expands, and
that settings like num.replica.fetchers probably needed tuning since
replication was not keeping up.
A line of documentation "A message is committed only after it has been
successfully copied to all the in-sync replicas" led me to conclude that
consumer lag had increased because of this delay in replication.

I planned to ratchet up the num.replica.fetchers until I saw ISR
shrinks/expands diminish.  In return I expected some extra CPU/Network/Disk
I/O on the brokers, but for consumer lag to decrease. Then I would go back
to increasing the RF on any remaining topics.

The first part went OK - increasing fetcher threads from 1 to 3; I saw
Shrinks/Expands *decrease*, although not entirely to 0.
Contrary to what I expected though, the consumer lag *increased* for some
of our apps.
I couldn't see any resource bottleneck on the hosts the apps are on; can
anyone suggest if there could be any resource contention otherwise in Kafka
itself?

Many thanks in advance,

Marcus

Re: Increase in consumer lag

Posted by Nikita Kretov <kr...@gmail.com>.
Hello! Basically , I don't think that we can simply conclude that 
consumer lag is dependent on number of replica fetching threads.
Maybe the first thing to double check is to use kafka-confumer-group cli 
instead of some lag exporters (in case you using this type of monitoring 
for consumer lag). Next I'll be check bytes_in bytes_out per topic in 
case there is a huge dis-balance between produce and consume. Third 
thing - check number and rate of rebalance for consumer groups with high 
lag.

For me this steps can help to clarify state of the issue (if it 
monitoring issue (some of your exporters are lying to your), client 
issue (disproportional huge produce load or constant rebalance of 
consumer groups), or cluster performance issue).

On 6/2/21 5:25 PM, Marcus Horsley-Rai wrote:
> Hi all,
> 
> Hoping someone can sanity check my logic!
> A cluster I'm working on went into production with some topics poorly
> configured; ReplicationFactor of 1 mostly being the issue.
> 
> To avoid downtime as much as possible, I used the
> kafka-reassign-partitions.sh tool to add extra replicas to topic partitions.
> This worked like a charm for the majority of topics; except when I got to
> our highest throughput one.
> The async execution of the re-assign got stuck in a never-ending loop, and
> I caused a slight live issue in that some of our consumer groups lag shot
> through the roof, meaning data was no longer real-time.
> I backed some of the changes out - and went back to the drawing board.
> 
> More reading later - I came to know of monitoring ISR shrinks/expands, and
> that settings like num.replica.fetchers probably needed tuning since
> replication was not keeping up.
> A line of documentation "A message is committed only after it has been
> successfully copied to all the in-sync replicas" led me to conclude that
> consumer lag had increased because of this delay in replication.
> 
> I planned to ratchet up the num.replica.fetchers until I saw ISR
> shrinks/expands diminish.  In return I expected some extra CPU/Network/Disk
> I/O on the brokers, but for consumer lag to decrease. Then I would go back
> to increasing the RF on any remaining topics.
> 
> The first part went OK - increasing fetcher threads from 1 to 3; I saw
> Shrinks/Expands *decrease*, although not entirely to 0.
> Contrary to what I expected though, the consumer lag *increased* for some
> of our apps.
> I couldn't see any resource bottleneck on the hosts the apps are on; can
> anyone suggest if there could be any resource contention otherwise in Kafka
> itself?
> 
> Many thanks in advance,
> 
> Marcus
>