You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Nikolay Izhikov (Jira)" <ji...@apache.org> on 2022/09/21 10:02:00 UTC

[jira] [Assigned] (KAFKA-13866) Support more advanced time retention policies

     [ https://issues.apache.org/jira/browse/KAFKA-13866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nikolay Izhikov reassigned KAFKA-13866:
---------------------------------------

    Assignee: Nikolay Izhikov

> Support more advanced time retention policies
> ---------------------------------------------
>
>                 Key: KAFKA-13866
>                 URL: https://issues.apache.org/jira/browse/KAFKA-13866
>             Project: Kafka
>          Issue Type: Improvement
>          Components: config, core, log cleaner
>            Reporter: Matthias J. Sax
>            Assignee: Nikolay Izhikov
>            Priority: Major
>              Labels: needs-kip
>
> Time-based retention policy compares the record timestamp to broker wall-clock time. Those semantics are questionable and also lead to issues for data reprocessing: If one want to re-process older data then retention time, it's not possible as broker expire those record aggressively and user need to increate the retention time accordingly.
> Especially for Kafka Stream, we have seen many cases when users got bit by the current behavior.
> It would be best, if Kafka would track _two_ timestamps per record: the record event-time (as the broker do currently), plus the log append-time (which is only tracked currently if the topic is configured with "append-time" tracking, but the issue is, that it overwrite the producer provided record event-time).
> Tracking both timestamps would allow to set a pure wall-clock time retention time plus a pure event-time retention time policy:
>  * Wall-clock time: keep (at least) the date X days after writing
>  * Event-time: keep (at max) the X days worth of event-time data
> Comparing wall-clock time to wall-clock time and event-time to event-time provides much cleaner semantics. The idea is to combine both policies and only expire data if both policies trigger.
> For the event-time policy, the broker would need to track "stream time" as max event-timestamp it has see per partition (similar to how Kafka Streams is tracking "stream time" client side).
> Note the difference between "at least" and "at max" above: for the data-reprocessing case, the max-based event-time policy avoids that the broker would keep a huge history for the reprocessing case.
> It would be part of a KIP discussion on the details how wall-clock/event-time and mix/max policies could be combined. For example, it might also be useful to have the following policy: keep at least X days worth of event-time history no matter how long the data is already stored (ie, there would only be an event-time base expiration but not wall-clock time). It could also be combined with a wall-clock time expiration: delete data only after it's at least X days old and stored for at least Y days.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)