You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Dong Lin (JIRA)" <ji...@apache.org> on 2018/07/22 18:07:00 UTC

[jira] [Comment Edited] (KAFKA-7152) replica should be in-sync if its LEO equals leader's LEO

    [ https://issues.apache.org/jira/browse/KAFKA-7152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552105#comment-16552105 ] 

Dong Lin edited comment on KAFKA-7152 at 7/22/18 6:06 PM:
----------------------------------------------------------

Thanks for the clarification [~ijuma]. Yeah I understand immediate result of this patch, i.e. ISR won't shrink if a topic is not produced often and a replica thread dies. But I am trying to understand what is the concern with this behavior.

Let's assume replica.lag.time.max.ms uses the default value of 10 sec. In this case maybeShrinkIsr() will be called every 5 sec. We can assume for now that everything else is fast and take negligible time.

Regarding 1, you are right for a ProduceRequest which is sent long after the thread dies, the produce delay (and end-to-end latency) will be higher if we don't shrink ISR. More specifically, if we shrink ISR, leader won't wait for the problematic follower and ProduceRequest can succeed with 0 sec delay, whereas if we don't shrink ISRE, the ProduceRequest can succeed within 5 sec because ISR will be shrunk immediately when maybeShrinkIsr() is called.

But if we look at arbitrary ProduceRequest sent by the user, this would include the ProduceRequest that can potentially be sent immediately after the thread dies. For such ProduceRequest, the max ProduceRequest delay would also be 5 sec. So the overall max produce delay (and end-to-end latency) does not change with this patch. And since only the first ProduceRequest may be affected, it probably does not affect the 99th and average produce delay. So this patch probably does not negatively affect the overall user experience regarding produce delay.

Also, I personally feel that ReplicaFetcherThread will die only if there is bug. In this case, it is probably not a big deal to have extra 5 seconds delay, given that there maybe bigger issue caused by bug, partition has been inactive for much longer than 5 seconds, and it only affects the first ProduceRequest.

 

Regarding 2, yes we should be able to monitor whether the ReplicaFetcherThread is running. It can currently be achieved by monitoring ISR but it is probably not a good way for this purpose. In general when ISR is shrink, system admin does not directly know whether it is because lag or because ReplicaFetcherThread dies, which means it can take a long time before system admin takes action. And it does not directly tell which broker has the issue.

 

In order to solver this problem, maybe it is better to add metrics in the broker to monitor the NumOfflineThread, FetchRequestRate and FetchRequestLocalTime in the follower (not the leader). The NumOfflineThread should increase whenever a thread has died abnormally in the broker. And system admin can be alerted if this metric increases above 0. The other two metrics can very be useful in debugging replication performance issue in the follower.

 

So it seems that in the short term this patch won't change the performance guarantee (e.g. max, average, 99th produce delay) for user. And it will slightly increase the produce delay when there is bug Kafka. And in the long term we probably want to add more metrics in Kafka to detect such bug and let system admin proactively resolves the issue. Does this sound good?

 

 

 

 


was (Author: lindong):
Thanks for the clarification [~ijuma]. Yeah I understand immediate result of this patch, i.e. ISR won't if a topic that is not produced often and a replica thread dies. But I am trying to understand what is the concern with this behavior.

Let's assume replica.lag.time.max.ms uses the default value of 10 sec. In this case maybeShrinkIsr() will be called every 5 sec. We can assume for now that everything else is fast and take negligible time.

Regarding 1, you are right for a ProduceRequest which is sent long after the thread dies, the produce delay (and end-to-end latency) will be higher if we don't shrink ISR. More specifically, if we shrink ISR, leader won't wait for the problematic follower and ProduceRequest can succeed with 0 sec delay, whereas if we don't shrink ISRE, the ProduceRequest can succeed within 5 sec because ISR will be shrunk immediately when maybeShrinkIsr() is called.

But if we look at arbitrary ProduceRequest sent by the user, this would include the ProduceRequest that can potentially be sent immediately after the thread dies. For such ProduceRequest, the max ProduceRequest delay would also be 5 sec. So the overall max produce delay (and end-to-end latency) does not change with this patch. And since only the first ProduceRequest may be affected, it probably does not affect the 99th and average produce delay. So this patch probably does not negatively affect the overall user experience regarding produce delay.

Also, I personally feel that ReplicaFetcherThread will die only if there is bug. In this case, it is probably not a big deal to have extra 5 seconds delay, given that there maybe bigger issue caused by bug, partition has been inactive for much longer than 5 seconds, and it only affects the first ProduceRequest.

 

Regarding 2, yes we should be able to monitor whether the ReplicaFetcherThread is running. It can currently be achieved by monitoring ISR but it is probably not a good way for this purpose. In general when ISR is shrink, system admin does not directly know whether it is because lag or because ReplicaFetcherThread dies, which means it can take a long time before system admin takes action. And it does not directly tell which broker has the issue.

 

In order to solver this problem, maybe it is better to add metrics in the broker to monitor the NumOfflineThread, FetchRequestRate and FetchRequestLocalTime in the follower (not the leader). The NumOfflineThread should increase whenever a thread has died abnormally in the broker. And system admin can be alerted if this metric increases above 0. The other two metrics can very be useful in debugging replication performance issue in the follower.

 

So it seems that in the short term this patch won't change the performance guarantee (e.g. max, average, 99th produce delay) for user. And it will slightly increase the produce delay when there is bug Kafka. And in the long term we probably want to add more metrics in Kafka to detect such bug and let system admin proactively resolves the issue. Does this sound good?

 

 

 

 

> replica should be in-sync if its LEO equals leader's LEO
> --------------------------------------------------------
>
>                 Key: KAFKA-7152
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7152
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Dong Lin
>            Assignee: Zhanxiang (Patrick) Huang
>            Priority: Major
>             Fix For: 2.1.0
>
>
> Currently a replica will be moved out of ISR if follower has not fetched from leader for 10 sec (default replica.lag.time.max.ms). This cases problem in the following scenario:
> Say follower's ReplicaFetchThread needs to fetch 2k partitions from the leader broker. Only 100 out of 2k partitions are actively being produced to and therefore the total bytes in rate for those 2k partitions are small. The following will happen:
>  
> 1) The follower's ReplicaFetcherThread sends FetchRequest for those 2k partitions.
> 2) Because the total bytes-in-rate for those 2k partitions is very small, follower is able to catch up and leader broker adds these 2k partitions to ISR. Follower's lastCaughtUpTimeMs for all partitions are updated to the current time T0.
> 3) Since follower has caught up for all 2k partitions, leader updates 2k partition znodes to include the follower in the ISR. It may take 20 seconds to write 2k partition znodes if each znode write operation takes 10 ms.
> 4) At T0 + 15, maybeShrinkIsr() is invoked on leader broker. Since there is no FetchRequet from the follower for more than 10 seconds after T0, all those 2k partitions will be considered as out of syn and the follower will be removed from ISR.
> 5) The follower receives FetchResponse at least 20 seconds after T0. That means the next FetchRequest from follower to leader will be after T0 + 20.
> The sequence of events described above will loop over time. There will be constant churn of URP in the cluster even if follower can catch up with leader's byte-in-rate. This reduces the cluster availability.
>  
> In order to address this problem, one simple approach is to keep follower in the ISR as long as follower's LEO equals leader's LEO regardless of follower's lastCaughtUpTimeMs. This is particularly useful if there are a lot of inactive partitions in the cluster.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)