You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Dmitry Minkovsky <dm...@gmail.com> on 2017/08/07 15:42:33 UTC

Kafka Streams: retention and stream replay

One of the most appealing features of the streams-based architecture is the
ability to replay history. This concept was highlighted in a blog post
[0] just the other day.

Practically, though, I am stuck on the mechanics of replaying data when
that data is also periodically expiring. If your logs expire after some
time, how can you replay state? This may not be a problem for certain kinds
of analysis, especially windowed analysis.

However, lets say your retention topic consists of logical application
events like "user-create" and "user-update". If the "user-create" event is
deleted, subsequent "user-update" events for that user are no longer
replayable. The streams applications transforms "user-create" and
"user-update" events into a compacted entity topic "user". This topic can
be replayed, but that is different from replaying the actual events that
produced the compacted entity.

So how do I make sense of retention and replay?

Thank you,
Dmitry




[0] https://www.confluent.io/blog/messaging-single-source-truth/

Re: Kafka Streams: retention and stream replay

Posted by Guozhang Wang <wa...@gmail.com>.

Hello Dmitry,

The right way to think of reprocessing is that:

1) you can reprocess the source stream from a given Kafka topic only within
the source topic's retention period. For example, if an event happens and
produced to the source topic at time t0, and that topic retention period is
t1, then that event can be reprocessed until t0 + t1.

2) if your input topic's messages are not independent, i.e. it is not an
append-only record stream but messages are logically dependent to each
other (you can read about this in
https://kafka.apache.org/0110/documentation/streams/developer-guide#streams_kstream_ktable),
then you cannot logically reprocess it at ANY given point in history. For
example, if your input source stream is actually a changelog-like stream,
such that a "user-create" event happens at t0, and a "user-update" event
happens at t2, then if you reprocess at a time t1 where t0 < t1 < t2, you
are restarting in an "inconsistent" state of the source stream. So the
right way is to configure your input source topic to not use time-based
retention but use key-based compaction policy.

To prevent it from growing indefinitely, we are considering some point
configurable for retention (
https://cwiki.apache.org/confluence/display/KAFKA/KIP-47+-+Add+timestamp-based+log+deletion+policy)
that users can specify up to which point the processing have been done and
final, and it will never need to be re-processed again so that brokers can
now truncate up to that point.

Guozhang

On Mon, Aug 7, 2017 at 8:42 AM, Dmitry Minkovsky <dm...@gmail.com>
wrote:

> One of the most appealing features of the streams-based architecture is the
> ability to replay history. This concept was highlighted in a blog post
> [0] just the other day.
>
> Practically, though, I am stuck on the mechanics of replaying data when
> that data is also periodically expiring. If your logs expire after some
> time, how can you replay state? This may not be a problem for certain kinds
> of analysis, especially windowed analysis.
>
> However, lets say your retention topic consists of logical application
> events like "user-create" and "user-update". If the "user-create" event is
> deleted, subsequent "user-update" events for that user are no longer
> replayable. The streams applications transforms "user-create" and
> "user-update" events into a compacted entity topic "user". This topic can
> be replayed, but that is different from replaying the actual events that
> produced the compacted entity.
>
> So how do I make sense of retention and replay?
>
> Thank you,
> Dmitry
>
>
>
>
> [0] https://www.confluent.io/blog/messaging-single-source-truth/
>

-- 
-- Guozhang