You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by "Matthias J. Sax (Jira)" <ji...@apache.org> on 2022/05/02 18:23:00 UTC

[jira] [Created] (KAFKA-13866) Support more advanced time retention policies

Matthias J. Sax created KAFKA-13866:
---------------------------------------

             Summary: Support more advanced time retention policies
                 Key: KAFKA-13866
                 URL: https://issues.apache.org/jira/browse/KAFKA-13866
             Project: Kafka
          Issue Type: Improvement
          Components: config, core, log cleaner
            Reporter: Matthias J. Sax


Time-based retention policy compares the record timestamp to broker wall-clock time. Those semantics are questionable and also lead to issues for data reprocessing: If one want to re-process older data then retention time, it's not possible as broker expire those record aggressively and user need to increate the retention time accordingly.

Especially for Kafka Stream, we have seen many cases when users got bit by the current behavior.

It would be best, if Kafka would track _two_ timestamps per record: the record event-time (as the broker do currently), plus the log append-time (which is only tracked currently if the topic is configured with "append-time" tracking, but the issue is, that it overwrite the producer provided record event-time).

Tracking both timestamps would allow to set a pure wall-clock time retention time plus a pure event-time retention time policy:
 * Wall-clock time: keep (at least) the date X days after writing
 * Event-time: keep (at max) the X days worth of event-time data

Comparing wall-clock time to wall-clock time and event-time to event-time provides much cleaner semantics. The idea is to combine both policies and only expire data if both policies trigger.

For the event-time policy, the broker would need to track "stream time" as max event-timestamp it has see per partition (similar to how Kafka Streams is tracking "stream time" client side).

Note the difference between "at least" and "at max" above: for the data-reprocessing case, the max-based event-time policy avoids that the broker would keep a huge history for the reprocessing case.

It would be part of a KIP discussion on the details how wall-clock/event-time and mix/max policies could be combined. For example, it might also be useful to have the following policy: keep at least X days worth of event-time history no matter how long the data is already stored (ie, there would only be an event-time base expiration but not wall-clock time). It could also be combined with a wall-clock time expiration: delete data only after it's at least X days old and stored for at least Y days.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)