You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Haruki Okada (Jira)" <ji...@apache.org> on 2020/11/06 03:45:00 UTC
[jira] [Updated] (KAFKA-10690) Produce-response delay caused by lagging replica fetch which affects in-sync one

     [ https://issues.apache.org/jira/browse/KAFKA-10690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Haruki Okada updated KAFKA-10690:
---------------------------------
    Summary: Produce-response delay caused by lagging replica fetch which affects in-sync one  (was: Produce-response delay caused by lagging replica fetch which blocks in-sync one)

> Produce-response delay caused by lagging replica fetch which affects in-sync one
> --------------------------------------------------------------------------------
>
>                 Key: KAFKA-10690
>                 URL: https://issues.apache.org/jira/browse/KAFKA-10690
>             Project: Kafka
>          Issue Type: Improvement
>          Components: core
>    Affects Versions: 2.4.1
>            Reporter: Haruki Okada
>            Priority: Major
>         Attachments: image-2020-11-06-11-15-21-781.png, image-2020-11-06-11-15-38-390.png, image-2020-11-06-11-17-09-910.png
>
>
> h2. Our environment
>  * Kafka version: 2.4.1
> h2. Phenomenon
>  * Produce response time 99th (remote scope) degrades to 500ms, which is 20 times worse than usual
>  ** Meanwhile, the cluster was running replica reassignment to service-in new machine to recover replicas which held by failed (Hardware issue) broker machine
> !image-2020-11-06-11-15-21-781.png|width=292,height=166!
> h2. Analysis
> Let's say
>  * broker-X: The broker we observed produce latency degradation
>  * broker-Y: The broker under servicing-in
> broker-Y was catching up replicas of partitions:
>  * partition-A: has relatively small log size
>  * partition-B: has large log size
> (actually, broker-Y was catching-up many other partitions. I noted only two partitions here to make explanation simple)
> broker-X was the leader for both partition-A and partition-B.
> We found that both partition-A and partition-B are assigned to same ReplicaFetcherThread of broker-Y, and produce latency started to degrade right after broker-Y finished catching up partition-A.
> !image-2020-11-06-11-17-09-910.png|width=476,height=174!
> Besides, we observed disk reads on broker-X during service-in. (This is natural since old segments are likely not in page cache)
> !image-2020-11-06-11-15-38-390.png|width=292,height=193!
> So we suspected that:
>  * In-sync replica fetch (partition-A) was involved by lagging replica fetch (partition-B), which should be slow because it causes actual disk reads
>  ** Since ReplicaFetcherThread sends fetch requests in blocking manner, next fetch request can't be sent until one fetch request completes
>  ** => Causes in-sync replica fetch for partitions assigned to same replica fetcher thread to delay
>  ** => Causes remote scope produce latency degradation
> h2. Possible fix
> We think this issue can be addressed by designating part of ReplicaFetcherThread (or creating another thread pool) for lagging replica catching-up, but not so sure this is the appropriate way.
> Please give your opinions about this issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)