You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Aditya Auradkar (JIRA)" <ji...@apache.org> on 2015/02/20 02:30:12 UTC

[jira] [Comment Edited] (KAFKA-1546) Automate replica lag tuning

    [ https://issues.apache.org/jira/browse/KAFKA-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14328427#comment-14328427 ] 

Aditya Auradkar edited comment on KAFKA-1546 at 2/20/15 1:30 AM:
-----------------------------------------------------------------

I agree we should model this in terms of time and not in terms of messages. While I think it is a bit more natural to model replication lag in terms of "will take more than N ms to catch up.", I also agree it is tricky to implement correctly. 

One possible way to model it is to associate an offset with a commit timestamp at the source. For example, assume that a message with offset O is produced on the leader for partition X at timestamp T1. If the time now is T2 and a replica's log end offset is O  (i.e. it is has consumed till O), then the lag can be (T2-T1). Is there any easy way to obtain the source timestamp given an offset? 

If this isn't feasible, then I do think that the heuristic proposed in Neha's comment is a good one.. and I will submit a patch for it.

Also, there are currently 2 checks for replica lag (in ISR).
a. keepInSyncMessages - This tracks replica lag as a function of the number of messages it is trailing behind. I believe we will remove this entirely regardless of the approach we choose.
b. keepInSyncTimeMs - This tracks the amount of time between fetch requests. I think we can remove this as well.




was (Author: aauradkar):
I agree we should model this in terms of time and not in terms of messages. While I think it is a bit more natural to model replication lag in terms of "will take more than N ms to catch up.", I also agree it is tricky to implement correctly. 

One possible way to model it is to associate an offset with a commit timestamp at the source. For example, assume that a message with offset O is produced on the leader for partition X at timestamp T1. If the time now is T2 and a replica's log end offset is O  (i.e. it is has consumed till O), then the lag can be (T2-T1). Is there any easy way to obtain the source timestamp given an offset? 

If this isn't feasible, then I do think that the originally proposed heuristic is a good one.. and I will submit a patch for it.

Also, there are currently 2 checks for replica lag (in ISR).
a. keepInSyncMessages - This tracks replica lag as a function of the number of messages it is trailing behind. I believe we will remove this entirely regardless of the approach we choose.
b. keepInSyncTimeMs - This tracks the amount of time between fetch requests. I think we can remove this as well.



> Automate replica lag tuning
> ---------------------------
>
>                 Key: KAFKA-1546
>                 URL: https://issues.apache.org/jira/browse/KAFKA-1546
>             Project: Kafka
>          Issue Type: Improvement
>          Components: replication
>    Affects Versions: 0.8.0, 0.8.1, 0.8.1.1
>            Reporter: Neha Narkhede
>            Assignee: Aditya Auradkar
>              Labels: newbie++
>
> Currently, there is no good way to tune the replica lag configs to automatically account for high and low volume topics on the same cluster. 
> For the low-volume topic it will take a very long time to detect a lagging
> replica, and for the high-volume topic it will have false-positives.
> One approach to making this easier would be to have the configuration
> be something like replica.lag.max.ms and translate this into a number
> of messages dynamically based on the throughput of the partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)