You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Kai Huang <ka...@gmail.com> on 2021/08/12 20:25:37 UTC

Re: Propose a KIP to report "REAL" broker/consumer fetch latency?

Hi Israel,

Thanks for your interest in this capability. I would like to follow up on this discussion.

First let me answer your last question:
"Are you looking to modify the Admin API for this capability to be added?"
- No, this will be a broker-side metric added to the ReplicaFetcher thread.

Then, allow me to explain the problem we encountered while operating Kafka at Twitter:
In our production environment, we saw a lot of Produce latency issues. Majority of the issues are caused by the fact that the follower is slow to catch up with leader. Currently we do not have good ways to determine where the latency was introduced into the pipeline due to the limitation of the existing fetch metrics in Kafka. Nor do we have good ways to monitor the latency and detect network device grey failures between each pair of brokers in a cluster.

I’m curious if you have seen similar issues when you operate Kafka? How do you typically debug broker latency issues and detect network device grey failures/network congestions?

I’ve tested the proposed fetch latency metric in Twitter’s production environment and found it useful. Here is a summary of my findings:
- The metric makes it possible to monitor the latency between each pair of brokers in a cluster.
- The metric makes it easy for us to identify the “culprit” in a cluster. For example, a broker that causes all its followers' fetching to slow down, or a broker that is slow at fetching from all other brokers.
- In some cases, this metric allows us to determine where the slowness is introduced into the pipeline. For example, when the fetch latency between a (follower, leader) pair is high, it usually means either 1) the leader is slow at processing fetch requests, or 2) the network connection between leader and follower is slow. The former case can be confirmed by looking at the broker fetch latency and network processor/request processor metrics.
- There are certain cases where this metric doesn’t provide any insights, for example, the follower itself is slow at issuing fetch requests, or slow at processing fetch responses.

I’ve shared some viz graphs to illustrate how this metric can be used in https://cwiki.apache.org/confluence/display/KAFKA/KIP-736%3A+Report+the+true+end+to+end+fetch+latency. Can you please take a look again, and let me know what are your thoughts/concerns?

Best regards,

Kai


On 2021/04/25 01:09:46, Israel Ekpo <is...@gmail.com> wrote: 
> Hi Ming
> 
> This would be a useful metric from a monitoring perspective especially when
> troubleshooting or diagnosing issues.
> 
> Are you looking to modify the Admin API for this capability to be added?
> The metrics for quorum controllers, brokers, replicas and consumers may
> need to be reported differently
> 
> I am interested in this capability as well.
> 
> Maybe there is something in the current Admin API that is not obvious yet
> so I will need to investigate first and will get back to you with my
> thoughts/suggestions.
> 
> Thanks for bringing this up
> 
> Cheers
> 
> 
> 
> On Sat, Apr 24, 2021 at 1:21 PM Ming Liu <mi...@gmail.com> wrote:
> 
> > Hi All,
> >      I am thinking about to start a KIP to report "REAL" broker/consumer
> > fetch latency. Before that, I like to collect any idea or suggestions.  I
> > created https://issues.apache.org/jira/browse/KAFKA-12713.
> >      The fetch latency is an important metric to monitor for the cluster
> > performance. With ACK=ALL, the produce latency is affected primarily by
> > broker fetch latency.  However, currently the reported fetch latency didn't
> > reflect the true fetch latency because it sometimes needs to stay in
> > purgatory and wait for replica.fetch.wait.max.ms when data is not
> > available. This greatly affects the real P50, P99 etc.
> >
> > I like to propose a KIP to be able track the real fetch latency for both
> > broker follower and consumer.
> >
> > Ming
> >
>