You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@kafka.apache.org by Victoria Xia <vi...@confluent.io.INVALID> on 2023/05/02 21:06:19 UTC

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Cool KIP, Walker! Thanks for sharing this proposal.

A few clarifications:

1. Is the order that records exit the buffer in necessarily the same as the
order that records enter the buffer in, or no? Based on the description in
the KIP, it sounds like the answer is no, i.e., records will exit the
buffer in increasing timestamp order, which means that they may be ordered
(even for the same key) compared to the input order.

2. What happens if the join grace period is nonzero, and a stream-side
record arrives with a timestamp that is older than the current stream time
minus the grace period? Will this record trigger a join result, or will it
be dropped? Based on the description for what happens when the join grace
period is set to zero, it sounds like the late record will be dropped, even
if the join grace period is nonzero. Is that true?

3. What could cause stream time to advance, for purposes of removing
records from the join buffer? For example, will new records arriving on the
table side of the join cause stream time to advance? From the KIP it sounds
like only stream-side records will advance stream time -- does that mean
that the join processor itself will have to track this stream time?

Also +1 to Lucas's question about what options will be available for
configuring the join buffer. Will users have the option to choose whether
they want the buffer to be in-memory vs persistent?

- Victoria

On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
<lb...@confluent.io.invalid> wrote:

> HI Walker,
>
> thanks for the KIP! We definitely need this. I have two questions:
>
>  - Have you considered allowing the customization of the underlying
> buffer implementation? As I can see, `StreamJoined` lets you customize
> the underlying store via a `WindowStoreSupplier`. Would it make sense
> for `Joined` to have this as well? I can imagine one may want to limit
> the number of records in the buffer, for example. If we hit the
> maximum, the only option would be to drop semantic guarantees, but
> users may still want to do this.
>  - With "second option on the table side" you are referring to
> versioned tables, right? Will the buffer on the stream side behave any
> different whether the table side is versioned or not?
>
> Finally, I think a simple example in the motivation section could help
> non-experts understand the KIP.
>
> Best,
> Lucas
>
> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> <wc...@confluent.io.invalid> wrote:
> >
> > Hello everybody,
> >
> > I have a stream proposal to improve the stream table join by adding a
> grace
> > period and buffer to the stream side of the join to allow processing in
> > timestamp order matching the recent improvements of the versioned tables.
> >
> > Please take a look here <https://cwiki.apache.org/confluence/x/lAs0Dw>
> and
> > share your thoughts.
> >
> > best,
> > Walker
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Victoria Xia <vi...@confluent.io.INVALID>.

Handling the "changelog" topic for the buffer in the same way as
repartition topics makes sense. Thanks for clearing that up!

- Victoria

On Tue, Jun 6, 2023 at 10:22 AM Walker Carlson
<wc...@confluent.io.invalid> wrote:

> Good Point Victoria. I just removed the compacted topic mention from the
> KIP. I agree with Burno about using a normal topic and deleting records
> that have been processed.
>
> On Tue, Jun 6, 2023 at 2:28 AM Bruno Cadonna <ca...@apache.org> wrote:
>
> > Hi,
> >
> > another idea that came to my mind. Instead of using a compacted topic,
> > the buffer could use a non-compacted topic and regularly delete records
> > before a given offset as Streams does for repartition topics.
> >
> > Best,
> > Bruno
> >
> > On 05.06.23 21:48, Bruno Cadonna wrote:
> > > Hi Victoria,
> > >
> > > that is a good point!
> > >
> > > I think, the topic needs to be a compacted topic to be able to get rid
> > > of records that are evicted from the buffer. So the key might be
> > > something with the key, the timestamp, and a sequence number to
> > > distinguish between records with the same key and same timestamp.
> > >
> > > Just an idea! Maybe Walker comes up with something better.
> > >
> > > Best,
> > > Bruno
> > >
> > > On 05.06.23 20:38, Victoria Xia wrote:
> > >> Hi Walker,
> > >>
> > >> Thanks for the latest updates! The KIP looks great. Just one question
> > >> about
> > >> the changelog topic for the join buffer: The KIP says "When a failure
> > >> occurs the buffer will try to recover from an OffsetCheckpoint if
> > >> possible.
> > >> If not it will reload the buffer from a compacted change-log topic."
> > This
> > >> is a new changelog topic that will be introduced specifically for the
> > >> join
> > >> buffer, right? Why is the changelog topic compacted? What are the
> keys?
> > I
> > >> am confused because the buffer contains records from the stream-side
> > >> of the
> > >> join, for which multiple records with the same key should be treated
> as
> > >> separate updates will all must be tracked in the buffer, rather than
> > >> updates which replace each other.
> > >>
> > >> Thanks,
> > >> Victoria
> > >>
> > >> On Mon, Jun 5, 2023 at 1:47 AM Bruno Cadonna <ca...@apache.org>
> > wrote:
> > >>
> > >>> Hi Walker,
> > >>>
> > >>> Thanks once more for the updates to the KIP!
> > >>>
> > >>> Do you also plan to expose metrics for the buffer?
> > >>>
> > >>> Best,
> > >>> Bruno
> > >>>
> > >>> On 02.06.23 17:16, Walker Carlson wrote:
> > >>>> Hello Bruno,
> > >>>>
> > >>>> I think this covers your questions. Let me know what you think
> > >>>>
> > >>>> 2.
> > >>>> We can use a changelog topic. I think we can treat it like any other
> > >>> store
> > >>>> and recover in the usual manner. Also implementation is on disk
> > >>>>
> > >>>> 3.
> > >>>> The description is in the public interfaces description. I will copy
> > it
> > >>>> into the proposed changes as well.
> > >>>>
> > >>>> This is a bit of an implementation detail that I didn't want to add
> > >>>> into
> > >>>> the kip, but the record will be added to the buffer to keep the
> stream
> > >>> time
> > >>>> consistent, it will just be ejected immediately. If of course if
> this
> > >>>> causes performance issues we will skip this step and track stream
> time
> > >>>> separately. I will update the kip to say that stream time advances
> > >>>> when a
> > >>>> stream record enters the node.
> > >>>>
> > >>>> Also, yes, updated.
> > >>>>
> > >>>> 5.
> > >>>> No there is no difference right now, everything gets processed as it
> > >>> comes
> > >>>> in and tries to find a record for its time stamp.
> > >>>>
> > >>>> Walker
> > >>>>
> > >>>> On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org>
> > >>>> wrote:
> > >>>>
> > >>>>> Hi Walker,
> > >>>>>
> > >>>>> Thanks for the updates!
> > >>>>>
> > >>>>> 2.
> > >>>>> It is still not clear to me how a failure is handled. I do not
> > >>>>> understand what you mean by "recover from an OffsetCheckpoint".
> > >>>>>
> > >>>>> My understanding is that the buffer needs to be replicated into its
> > >>>>> own
> > >>>>> Kafka topic. The input topic is not enough. The offset of a record
> is
> > >>>>> added to the offsets to commit once the record is streamed through
> > the
> > >>>>> subtopology. That means once the record is added to the buffer its
> > >>>>> offset is added to the offsets to commit -- independently of
> > >>>>> whether the
> > >>>>> record was evicted from the buffer and sent to the join node or
> not.
> > >>>>> Now, let's assume the following scenario
> > >>>>> 1. a record is read from the input topic and added to the buffer,
> but
> > >>>>> not evicted to be processed by the join node.
> > >>>>> 2. When the processing of the subtopology finishes the offset of
> the
> > >>>>> record is added to the offsets to commit.
> > >>>>> 3. A commit happens.
> > >>>>> 4. A failure happens
> > >>>>>
> > >>>>> After the failure the buffer is empty but the record will not be
> read
> > >>>>> anymore from the input topic since its offset has been already
> > >>>>> committed. The record is lost.
> > >>>>> One solution to avoid the loss is to recreate the buffer from a
> > >>>>> compacted Kafka topic as we do for suppression buffers. I do not
> > >>>>> think,
> > >>>>> we need any offset checkpoint here since, we keep the buffer in
> > >>>>> memory,
> > >>>>> right? Or do you plan to back the buffer with a persistent store?
> > Even
> > >>>>> in that case, a compacted Kafka topic would be needed.
> > >>>>>
> > >>>>>
> > >>>>> 3.
> > >>>>>    From the KIP it is still not clear to me what happens if a
> > >>>>> record is
> > >>>>> outside of the grace period. I guess the record that falls outside
> of
> > >>>>> the grace period will not be added to the buffer, but will be send
> to
> > >>>>> the join node. Since it is outside of the grace period it will also
> > >>>>> not
> > >>>>> increase stream time and it will not trigger an eviction. Also the
> > >>>>> head
> > >>>>> of the buffer will not contain a record that needs to be evicted
> > since
> > >>>>> the the timestamp of the head record will be within the interval
> > >>>>> stream
> > >>>>> time minus grace period. Is this correct? Please add such a
> > >>>>> description
> > >>>>> to the KIP.
> > >>>>> Furthermore, I think there is a mistake in the text:
> > >>>>> "... will dequeue when the record timestamp is greater than stream
> > >>>>> time
> > >>>>> plus the grace period". I guess that should be "... will dequeue
> when
> > >>>>> the record timestamp is less than (or equal?) stream time minus the
> > >>>>> grace period"
> > >>>>>
> > >>>>>
> > >>>>> 5.
> > >>>>> What is the difference between not setting the grace period and
> > >>>>> setting
> > >>>>> it to zero? If there is a difference, why is there a difference?
> > >>>>>
> > >>>>>
> > >>>>> Best,
> > >>>>> Bruno
> > >>>>>
> > >>>>>
> > >>>>> On 01.06.23 23:58, Walker Carlson wrote:
> > >>>>>> Hey Bruno thanks for the feedback.
> > >>>>>>
> > >>>>>> 1)
> > >>>>>> I will add this to the kip, but stream time only advances as the
> > when
> > >>> the
> > >>>>>> buffer receives a new record.
> > >>>>>>
> > >>>>>> 2)
> > >>>>>> You are correct, I will add a failure section on to the kip. Since
> > >>>>>> the
> > >>>>>> records wont change in the buffer from when they are read from the
> > >>> topic
> > >>>>>> they are replicated already.
> > >>>>>>
> > >>>>>> 3)
> > >>>>>> I see that I'm out voted on the dropping of records thing. We will
> > >>>>>> pass
> > >>>>>> them on and try to join them if possible. This might cause some
> null
> > >>>>>> results, but increasing the table history retention should help
> > that.
> > >>>>>>
> > >>>>>> 4)
> > >>>>>> I can add some on the kip. But its pretty directly adding whatever
> > >>>>>> the
> > >>>>>> grace period is to the latency. I don't see a way around it.
> > >>>>>>
> > >>>>>> Walker
> > >>>>>>
> > >>>>>> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org>
> > >>> wrote:
> > >>>>>>
> > >>>>>>> Hi Walker,
> > >>>>>>>
> > >>>>>>> thanks for the KIP!
> > >>>>>>>
> > >>>>>>> Here my feedback:
> > >>>>>>>
> > >>>>>>> 1.
> > >>>>>>> It is still not clear to me when stream time for the buffer
> > >>>>>>> advances.
> > >>>>>>> What is the event that let the stream time advance? In the
> > >>> discussion, I
> > >>>>>>> do not understand what you mean by "The segment store already has
> > an
> > >>>>>>> observed stream time, we advance based on that. That should only
> > >>> advance
> > >>>>>>> based on records that enter the store." Where does this segment
> > >>>>>>> store
> > >>>>>>> come from? Anyways, I think it would be great to also state how
> > >>>>>>> stream
> > >>>>>>> time advances in the KIP.
> > >>>>>>>
> > >>>>>>> 2.
> > >>>>>>> How does the buffer behave in case of a failure? I think I
> > >>>>>>> understand
> > >>>>>>> that the buffer will use an implementation of
> > >>> TimeOrderedKeyValueBuffer
> > >>>>>>> and therefore the records in the buffer will be replicated to a
> > >>>>>>> topic
> > >>> in
> > >>>>>>> Kafka, but I am not completely sure. Could you elaborate on this
> in
> > >>> the
> > >>>>>>> KIP?
> > >>>>>>>
> > >>>>>>> 3.
> > >>>>>>> I agree with Matthias about dropping late records. We use grace
> > >>> periods
> > >>>>>>> in scenarios where we records are grouped like in windowed
> > >>> aggregations
> > >>>>>>> and windowed joins. The stream buffer you propose does not really
> > >>> group
> > >>>>>>> any records. It rather delays records and reorders them. I am not
> > >>>>>>> sure
> > >>>>>>> if grace period is the right naming/concept to apply here.
> > >>>>>>> Instead of
> > >>>>>>> dropping records that fall outside of the buffer's time interval
> > the
> > >>>>>>> join should skip the buffer and try to join the record
> > >>>>>>> immediately. In
> > >>>>>>> the end, a stream-table join is a unwindowed join, i.e., no
> > grouping
> > >>> is
> > >>>>>>> applied to the records.
> > >>>>>>> What do you and other folks think about this proposal?
> > >>>>>>>
> > >>>>>>> 4.
> > >>>>>>> How does the proposed buffer, affects processing latency? Could
> you
> > >>>>>>> please add some words about this to the KIP?
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Best,
> > >>>>>>> Bruno
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On 31.05.23 01:49, Walker Carlson wrote:
> > >>>>>>>> Thanks for all the additional comments. I will either address
> them
> > >>> here
> > >>>>>>> or
> > >>>>>>>> update the kip accordingly.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> I mentioned a follow kip to add extra features before and in the
> > >>>>>>> responses.
> > >>>>>>>> I will try to briefly summarize what options and optimizations I
> > >>>>>>>> plan
> > >>>>> to
> > >>>>>>>> include. If a concern is not covered in this list I for sure
> talk
> > >>> about
> > >>>>>>> it
> > >>>>>>>> below.
> > >>>>>>>>
> > >>>>>>>> * Allowing non versioned tables to still use the stream buffer
> > >>>>>>>> * Automatically materializing tables instead of forcing the user
> > to
> > >>> do
> > >>>>> it
> > >>>>>>>> * Configurable for in memory buffer
> > >>>>>>>> * Order the records in offset order or in time order
> > >>>>>>>> * Non memory use buffer (offset order, delayed pull from
> stream.)
> > >>>>>>>> * Time synced between stream and table side (maybe)
> > >>>>>>>> * Do not drop late records and process them as they come in
> > >>>>>>>> instead.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> First, Victoria.
> > >>>>>>>>
> > >>>>>>>> 1) (One of your nits covers this, but you are correct it doesn't
> > >>>>>>>> make
> > >>>>>>>> sense. so I removed that part of the example.)
> > >>>>>>>> For those examples with the "bad" join results I said without
> > >>> buffering
> > >>>>>>> the
> > >>>>>>>> stream it would look like that, but that was incomplete. If the
> > >>>>>>>> look
> > >>> up
> > >>>>>>> was
> > >>>>>>>> simply looking at the latest version of the table when the
> stream
> > >>>>> records
> > >>>>>>>> came in then the results were possible. If we are using the
> > >>>>>>>> point in
> > >>>>> time
> > >>>>>>>> lookup that versioned tables let us then you are correct the
> > future
> > >>>>>>> results
> > >>>>>>>> are not possible.
> > >>>>>>>>
> > >>>>>>>> 2) I'll get to this later as Matthias brought up something
> > related.
> > >>>>>>>>
> > >>>>>>>> To your additional thoughts, I agree that we need to call those
> > >>> things
> > >>>>>>> out
> > >>>>>>>> in the documentation. I'm writing up a follow up kip with a lot
> of
> > >>> the
> > >>>>>>>> ideas we have discussed so that we can improve this feature
> beyond
> > >>> the
> > >>>>>>> base
> > >>>>>>>> implementation if it's needed.
> > >>>>>>>>
> > >>>>>>>> I addressed the nits in the kip. I somehow missed the table
> stream
> > >>>>> table
> > >>>>>>>> join processor improvement, it makes your first question make a
> > lot
> > >>>>> more
> > >>>>>>>> sense.  Table history retention is a much cleaner way to
> > >>>>>>>> describe it.
> > >>>>>>>>
> > >>>>>>>> As to your mention of the syncing the time for the table and
> > >>>>>>>> stream.
> > >>>>>>>> Matthias mentioned that as well. I will address both here. I
> > >>>>>>>> plan to
> > >>>>>>> bring
> > >>>>>>>> that up in the future, but for now we will leave it out. I
> > >>>>>>>> suppose it
> > >>>>>>> will
> > >>>>>>>> be more useful after the table history retention is separable
> from
> > >>> the
> > >>>>>>>> table grace period.
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> To address Matthias comments.
> > >>>>>>>>
> > >>>>>>>> You are correct by saying the in memory store shouldn't cause
> any
> > >>>>>>> semantic
> > >>>>>>>> concerns. My concern would be more with if we limited the number
> > of
> > >>>>>>> records
> > >>>>>>>> on the buffer and what we would do if we hit said limits,
> > (emitting
> > >>>>> those
> > >>>>>>>> records might be an issue, throwing an error and halting would
> > >>>>>>>> not).
> > >>> I
> > >>>>>>>> think we can leave this discussion to the follow up kip along
> > >>>>>>>> with a
> > >>>>> few
> > >>>>>>>> other options.
> > >>>>>>>>
> > >>>>>>>> I will go through your proposals now.
> > >>>>>>>>
> > >>>>>>>>       - don't support non-versioned KTables
> > >>>>>>>>
> > >>>>>>>> Sure, we can always expand this later on. Will include as part
> > >>>>>>>> of the
> > >>>>> of
> > >>>>>>>> the improvement kip
> > >>>>>>>>
> > >>>>>>>>       - if grace period is added, users need to explicitly
> > >>>>>>>> materialize
> > >>>>> the
> > >>>>>>>> table as version (either directly, or upstream. Upstream only
> > works
> > >>> if
> > >>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> > >>>>>>>>
> > >>>>>>>> again, that works for me for now, if we find a use we can always
> > >>>>>>>> add
> > >>>>>>> later.
> > >>>>>>>>
> > >>>>>>>>       - the table's history retention time must be larger than
> the
> > >>> grace
> > >>>>>>>> period (should be easy to check at runtime, when we build the
> > >>> topology)
> > >>>>>>>>
> > >>>>>>>> agreed
> > >>>>>>>>
> > >>>>>>>>       - because switching from non-versioned to version stores
> > >>>>>>>> is not
> > >>>>>>>> backward compatibly (cf KIP-914), users need to take care of
> this
> > >>>>>>>> themselves, and this also implies that adding grace period is
> not
> > a
> > >>>>>>>> backward compatible change (even only if via indirect means)
> > >>>>>>>>
> > >>>>>>>> sure, this works
> > >>>>>>>>
> > >>>>>>>> As to the dropping of late records, I'm not sure. One one hand I
> > >>>>>>>> like
> > >>>>> not
> > >>>>>>>> dropping things. But on the other I struggle to see how a user
> can
> > >>>>> filter
> > >>>>>>>> out late records that might have incomplete join results. The
> > point
> > >>> in
> > >>>>>>> time
> > >>>>>>>> look up will aggressively expire old data and if new data has
> been
> > >>>>>>> replaced
> > >>>>>>>> it will return null if outside of the retention. This seems like
> > it
> > >>>>> could
> > >>>>>>>> corrupt the integrity of the join output. Seeing that we drop
> late
> > >>>>>>> records
> > >>>>>>>> on the table side as well I would think it makes sense to drop
> > late
> > >>>>>>> records
> > >>>>>>>> on the stream buffer. I could be convinced otherwise I suppose,
> I
> > >>> could
> > >>>>>>> see
> > >>>>>>>> adding this as an option in a follow up kip. It would be very
> > >>>>>>>> easy to
> > >>>>>>>> implement either way. For now unless no one else objects I'm
> > >>>>>>>> going to
> > >>>>>>> stick
> > >>>>>>>> with dropping the records for the sake of getting this kip
> > >>>>>>>> passed. It
> > >>>>> is
> > >>>>>>>> functionally a small change to make and we can update later if
> you
> > >>> feel
> > >>>>>>>> strongly about it.
> > >>>>>>>>
> > >>>>>>>> For the ordering. I have to say that it would be more
> > >>>>>>>> complicated to
> > >>>>>>>> implement it to be in offset order, if the goal it to get as
> > >>>>>>>> many of
> > >>>>> the
> > >>>>>>>> records validly joined as possible. Because we would process as
> > >>> things
> > >>>>>>> left
> > >>>>>>>> the buffer a sufficiency early enough record could hold up
> records
> > >>> that
> > >>>>>>>> would otherwise be valid past the table history retention. To
> fix
> > >>> this
> > >>>>> we
> > >>>>>>>> could process by timestamp then store in a second queue and emit
> > by
> > >>>>>>> offset,
> > >>>>>>>> but that would be a lot more complicated. If we didn't care
> > >>>>>>>> about not
> > >>>>>>>> missing some valid joins we could just have no store and pull
> from
> > >>> the
> > >>>>>>>> topic at a delay only caring about the timestamp of the next
> > >>>>>>>> offset.
> > >>>>> For
> > >>>>>>>> now I want to stick with the timestamp ordering as it makes much
> > >>>>>>>> more
> > >>>>>>> sense
> > >>>>>>>> to me, but would propose we add both of the other options I have
> > >>>>>>>> laid
> > >>>>> out
> > >>>>>>>> here in the follow up kip.
> > >>>>>>>>
> > >>>>>>>> Lastly, I think having an empty store with zero grace period
> > >>>>>>>> would be
> > >>>>>>> super
> > >>>>>>>> simple and not costly, so we might as well make it even if
> nothing
> > >>> gets
> > >>>>>>>> entered.
> > >>>>>>>>
> > >>>>>>>> I hope that address all your concerns,
> > >>>>>>>>
> > >>>>>>>> Walker
> > >>>>>>>>
> > >>>>>>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <
> mjsax@apache.org
> > >
> > >>>>>>> wrote:
> > >>>>>>>>
> > >>>>>>>>> Walker,
> > >>>>>>>>>
> > >>>>>>>>> thanks for the updates. The KIP itself reads fine (of course
> > >>> Victoria
> > >>>>>>>>> made good comments about some phrases), but there is a couple
> of
> > >>>>> things
> > >>>>>>>>> from your latest reply I don't understand, and that I still
> think
> > >>> need
> > >>>>>>>>> some more discussions.
> > >>>>>>>>>
> > >>>>>>>>> Lukas, asked about in-memory option and `WindowStoreSupplier`
> and
> > >>> you
> > >>>>>>>>> mention "semantic concerns". There should not be any semantic
> > >>>>> difference
> > >>>>>>>>> from the underlying buffer implementation, so I am not sure
> > >>>>>>>>> what you
> > >>>>>>>>> mean here (also the relationship to suppress() is unclear to
> me)?
> > >>> -- I
> > >>>>>>>>> am ok to not make it configurable for now. We can always do it
> > >>>>>>>>> via a
> > >>>>>>>>> follow up KIP, and keep interface changes limited for now.
> > >>>>>>>>>
> > >>>>>>>>> Does it really make sense to allow a grace period if the table
> is
> > >>>>>>>>> non-versioned? You also say: "If table is not materialized it
> > will
> > >>>>>>>>> materialize it as versioned." -- What history retention time
> > would
> > >>> we
> > >>>>>>>>> pick for this case (also asked by Victoria)? Or should we
> > >>>>>>>>> rather not
> > >>>>>>>>> support this and force the user to materialize the table
> > >>>>>>>>> explicitly,
> > >>>>> and
> > >>>>>>>>> thus explicitly picking a history retention time? It's tradeoff
> > >>>>> between
> > >>>>>>>>> usability and guiding uses that there will be a significant
> > impact
> > >>> on
> > >>>>>>>>> disk usage. There is also compatibility concerns: If the table
> is
> > >>> not
> > >>>>>>>>> explicitly materialized in the old program, we would already
> > >>>>>>>>> need to
> > >>>>>>>>> materialize it also in the old program (of course, we would
> use a
> > >>>>>>>>> non-versioned store so far). Thus, if somebody adds a grace
> > >>>>>>>>> period,
> > >>> we
> > >>>>>>>>> cannot just switch the store type, as it would be a breaking
> > >>>>>>>>> change,
> > >>>>>>>>> potentially required an application re-set, or following the
> > >>>>>>>>> upgrade
> > >>>>>>>>> path for versioned state stores, and also changing the program
> to
> > >>>>>>>>> explicitly materialize using a versioned store. Also note, that
> > we
> > >>>>> might
> > >>>>>>>>> not materialize the actual join table, but only an upstream
> > table,
> > >>> and
> > >>>>>>>>> use `ValueGetter` to access the upstream data.
> > >>>>>>>>>
> > >>>>>>>>> To this end, as you already mentioned, history retention of the
> > >>> table
> > >>>>>>>>> should be at least grace period. You proposed to include this
> in
> > a
> > >>>>>>>>> follow up KIP, but I am wondering if it's a fundamental
> > >>>>>>>>> requirement
> > >>>>> and
> > >>>>>>>>> thus we should put a check in place right away and reject an
> > >>>>>>>>> invalid
> > >>>>>>>>> configuration? (It always easier to lift restriction than to
> > >>> introduce
> > >>>>>>>>> them later.) This would also imply that a non-versioned table
> > >>>>>>>>> cannot
> > >>>>> be
> > >>>>>>>>> supported, because it does not have a history retention that is
> > >>> larger
> > >>>>>>>>> than grace period, and maybe also answer the requirement about
> > >>>>>>>>> materialization: as we already always materialize something on
> > the
> > >>>>>>>>> tablet side as non-versioned store right now, it seems
> > >>>>>>>>> difficult to
> > >>>>>>>>> migrate the store to a versioned store. Ie, it might be ok to
> > push
> > >>> the
> > >>>>>>>>> burden onto the user and say: if you start using grace period,
> > you
> > >>>>> also
> > >>>>>>>>> need to manually switch from non-versioned to versioned
> KTables.
> > >>> Doing
> > >>>>>>>>> stuff automatically under the hood if very complex for this
> > >>>>>>>>> case, we
> > >>>>> if
> > >>>>>>>>> we push the burden onto the user, it might be ok to not
> > complicate
> > >>>>> this
> > >>>>>>>>> KIP significantly.
> > >>>>>>>>>
> > >>>>>>>>> To summarize the last two paragraphs, I would propose to:
> > >>>>>>>>>       - don't support non-versioned KTables
> > >>>>>>>>>       - if grace period is added, users need to explicitly
> > >>> materialize
> > >>>>> the
> > >>>>>>>>> table as version (either directly, or upstream. Upstream only
> > >>>>>>>>> works
> > >>> if
> > >>>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> > >>>>>>>>>       - the table's history retention time must be larger than
> > the
> > >>> grace
> > >>>>>>>>> period (should be easy to check at runtime, when we build the
> > >>>>> topology)
> > >>>>>>>>>       - because switching from non-versioned to version stores
> > >>>>>>>>> is not
> > >>>>>>>>> backward compatibly (cf KIP-914), users need to take care of
> this
> > >>>>>>>>> themselves, and this also implies that adding grace period is
> > >>>>>>>>> not a
> > >>>>>>>>> backward compatible change (even only if via indirect means)
> > >>>>>>>>>
> > >>>>>>>>> About dropping late records: wondering if we should never drop
> a
> > >>>>>>>>> stream-side record for a left-join, even if it's late? In
> > general,
> > >>> one
> > >>>>>>>>> thing I observed over the years is, that it's easier to keep
> > stuff
> > >>> and
> > >>>>>>>>> let users filter explicitly downstream (or make it
> configurable),
> > >>>>>>>>> instead of dropping pro-actively, because users have no good
> > >>>>>>>>> way to
> > >>>>>>>>> resurrect record that got already dropped.
> > >>>>>>>>>
> > >>>>>>>>> For ordering, sounds reasonable to me only start with one
> > >>>>>>>>> implementation, and maybe make it configurable as a follow up.
> > >>>>> However,
> > >>>>>>>>> I am wondering if starting with offset order might be the
> better
> > >>>>> option
> > >>>>>>>>> as it seems to align more with what we do so far? So instead of
> > >>>>> storing
> > >>>>>>>>> record ordered by timestamp, we can just store them ordered by
> > >>> offset,
> > >>>>>>>>> and still "poll" from the buffer based on the head records
> > >>> timestamp.
> > >>>>> Or
> > >>>>>>>>> would this complicate the implementation significantly?
> > >>>>>>>>>
> > >>>>>>>>> I also think it's ok to not "sync" stream-time between the
> > >>>>>>>>> table and
> > >>>>> the
> > >>>>>>>>> stream in this KIP, but we should consider doing this as a
> > >>>>>>>>> follow up
> > >>>>>>>>> change (not sure if we would need a KIP or not for a change
> this
> > >>>>> this).
> > >>>>>>>>>
> > >>>>>>>>> About increasing/decreasing grace period: what you describe
> make
> > >>> sense
> > >>>>>>>>> to me. If decreased, the next record would just trigger
> emitting
> > a
> > >>> lot
> > >>>>>>>>> of records, and for increase, the buffer would just need to
> "fill
> > >>> up"
> > >>>>>>>>> again. For reprocessing getting a different result with a
> > >>>>>>>>> different
> > >>>>>>>>> grace period is expected, so that's ok IMHO. -- There seems to
> be
> > >>> one
> > >>>>>>>>> special corner case: grace period zero. For this case, we
> > actually
> > >>>>> don't
> > >>>>>>>>> need any store, and the stream-side could be stateless. I think
> > it
> > >>> can
> > >>>>>>>>> have the same behavior, but if we want to "add / remove" the
> > store
> > >>>>>>>>> dynamically, we need to add specific code for it. For example,
> > >>>>>>>>> even
> > >>> if
> > >>>>>>>>> we start up with a grace period of zero, we would need to check
> > if
> > >>>>> there
> > >>>>>>>>> is a local store, and still emit everything in it, before we
> can
> > >>> ditch
> > >>>>>>>>> the store (not sure if that's even easily done at all). Or: we
> > >>>>>>>>> would
> > >>>>>>>>> need to have a store for _all_ cases, even if grace period is
> > zero
> > >>>>> (the
> > >>>>>>>>> store would be empty all the time though), to avoid super
> complex
> > >>>>> code?
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> -Matthias
> > >>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> > >>>>>>>>>> Hi Walker,
> > >>>>>>>>>>
> > >>>>>>>>>> thanks for your responses. That makes sense. I guess there is
> > >>> always
> > >>>>>>>>>> the option to make the implementation more configurable later
> > on,
> > >>> if
> > >>>>>>>>>> users request it. Also thanks for the clarifications. From my
> > >>>>>>>>>> side,
> > >>>>>>>>>> the KIP is good to go.
> > >>>>>>>>>>
> > >>>>>>>>>> Cheers,
> > >>>>>>>>>> Lucas
> > >>>>>>>>>>
> > >>>>>>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> > >>>>>>>>>> <vi...@confluent.io.invalid> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Thanks for the updates, Walker! Looks great, though I do
> have a
> > >>>>> couple
> > >>>>>>>>>>> questions about the latest updates:
> > >>>>>>>>>>>
> > >>>>>>>>>>>         1. The new example says that without stream-side
> > >>>>>>>>>>> buffering,
> > >>>>> "ex"
> > >>>>>>> and
> > >>>>>>>>>>>         "fy" are possible join results. How could those join
> > >>> results
> > >>>>>>>>> happen? The
> > >>>>>>>>>>>         example versioned table suggests that table record
> > >>>>>>>>>>> "x" has
> > >>>>>>>>> timestamp 2, and
> > >>>>>>>>>>>         table record "y" has timestamp 3. If stream record
> > >>>>>>>>>>> "e" has
> > >>>>>>>>> timestamp 1,
> > >>>>>>>>>>>         then it can never be joined against record "x", and
> > >>> similarly
> > >>>>> for
> > >>>>>>>>> stream
> > >>>>>>>>>>>         record "f" with timestamp 2 being joined against "y".
> > >>>>>>>>>>>         2. I see in your replies above that "If table is not
> > >>>>>>> materialized it
> > >>>>>>>>>>>         will materialize it as versioned" but I don't see
> this
> > >>> called
> > >>>>> out
> > >>>>>>>>> in the
> > >>>>>>>>>>>         KIP -- seems worth calling out. Also, what will the
> > >>>>>>>>>>> history
> > >>>>>>>>> retention for
> > >>>>>>>>>>>         the versioned table be? Will it be the same as the
> join
> > >>> grace
> > >>>>>>>>> period, or
> > >>>>>>>>>>>         will it be greater?
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> And some additional thoughts:
> > >>>>>>>>>>>
> > >>>>>>>>>>> Sounds like there are a few things users should watch out for
> > >>>>>>>>>>> when
> > >>>>>>>>> enabling
> > >>>>>>>>>>> the stream-side buffer:
> > >>>>>>>>>>>
> > >>>>>>>>>>>         - Records will get "stuck" if there are no newer
> > >>>>>>>>>>> records to
> > >>>>>>> advance
> > >>>>>>>>>>>         stream time.
> > >>>>>>>>>>>         - If there are large gaps between the timestamps of
> > >>>>> stream-side
> > >>>>>>>>> records,
> > >>>>>>>>>>>         then it's possible that versioned store history
> > >>>>>>>>>>> retention
> > >>> will
> > >>>>>>> have
> > >>>>>>>>> expired
> > >>>>>>>>>>>         by the time a record is evicted from the join buffer,
> > >>> leading
> > >>>>> to
> > >>>>>>> a
> > >>>>>>>>> join
> > >>>>>>>>>>>         "miss." For example, if the join grace period and
> table
> > >>>>> history
> > >>>>>>>>> retention
> > >>>>>>>>>>>         are both 10, and records come in the order:
> > >>>>>>>>>>>
> > >>>>>>>>>>>         table side t0 with ts=0
> > >>>>>>>>>>>         stream side s1 with ts=1 <-- enters buffer
> > >>>>>>>>>>>         table side t10 with ts=10
> > >>>>>>>>>>>         table side t20 with ts=20
> > >>>>>>>>>>>         stream side s21 with ts=21 <-- evicts record s1 from
> > >>> buffer,
> > >>>>> but
> > >>>>>>>>>>>         versioned store no longer contains data for ts=1 due
> to
> > >>>>> history
> > >>>>>>>>> retention
> > >>>>>>>>>>>         having elapsed
> > >>>>>>>>>>>
> > >>>>>>>>>>>         This will result in the join result (s1, null) even
> > >>>>>>>>>>> though
> > >>> it
> > >>>>>>>>> should've
> > >>>>>>>>>>>         been (s1, t0), due to t0 having been expired from the
> > >>>>> versioned
> > >>>>>>>>> store
> > >>>>>>>>>>>         already.
> > >>>>>>>>>>>         - Out-of-order records from the stream-side will be
> > >>> reordered,
> > >>>>>>> and
> > >>>>>>>>> late
> > >>>>>>>>>>>         records will be dropped.
> > >>>>>>>>>>>
> > >>>>>>>>>>> I don't think any of these are reasons to not go forward with
> > >>>>>>>>>>> this
> > >>>>>>> KIP,
> > >>>>>>>>> but
> > >>>>>>>>>>> it'd be good to call them out in the eventual documentation
> to
> > >>>>>>> decrease
> > >>>>>>>>> the
> > >>>>>>>>>>> chance users get tripped up.
> > >>>>>>>>>>>
> > >>>>>>>>>>>> We could maybe do an improvement later to advance stream
> time
> > >>> from
> > >>>>>>>>> table
> > >>>>>>>>>>> side as well, but that might be debatable as we might get
> more
> > >>> late
> > >>>>>>>>> records.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Yes, the likelihood of late records increases but also the
> > >>>>> likelihood
> > >>>>>>> of
> > >>>>>>>>>>> "join misses" due to versioned store history retention having
> > >>>>> elapsed
> > >>>>>>>>>>> decreases, which feels important for certain use cases.
> Either
> > >>> way,
> > >>>>>>>>> agreed
> > >>>>>>>>>>> that it can be a discussion for the future as incorporating
> > this
> > >>>>> would
> > >>>>>>>>>>> substantially complicate the implementation.
> > >>>>>>>>>>>
> > >>>>>>>>>>> Also a couple nits:
> > >>>>>>>>>>>
> > >>>>>>>>>>>         - The KIP currently says "We recently added versioned
> > >>> tables
> > >>>>>>> which
> > >>>>>>>>> allow
> > >>>>>>>>>>>         the table side of the a join [...] but it is not
> taken
> > >>>>> advantage
> > >>>>>>> of
> > >>>>>>>>> in
> > >>>>>>>>>>>         joins," but this doesn't seem true? If the table of a
> > >>>>>>> stream-table
> > >>>>>>>>> join is
> > >>>>>>>>>>>         versioned, then the DSL's stream-table join processor
> > >>>>>>>>>>> will
> > >>>>>>>>> automatically
> > >>>>>>>>>>>         perform timestamped lookups into the table, in order
> to
> > >>> take
> > >>>>>>>>> advantage of
> > >>>>>>>>>>>         the new timestamp-aware store to provide better join
> > >>>>> semantics.
> > >>>>>>>>>>>         - The KIP mentions "grace period" for versioned
> > >>>>>>>>>>> stores in a
> > >>>>>>> number
> > >>>>>>>>> of
> > >>>>>>>>>>>         places but I think you actually mean "history
> > >>>>>>>>>>> retention"?
> > >>> The
> > >>>>> two
> > >>>>>>>>> happen to
> > >>>>>>>>>>>         be the same today (it is not an option for users to
> > >>> configure
> > >>>>> the
> > >>>>>>>>> two
> > >>>>>>>>>>>         separately) but this need not be true in the future.
> > >>> "History
> > >>>>>>>>> retention"
> > >>>>>>>>>>>         governs how far back in time reads may occur, which
> > >>>>>>>>>>> is the
> > >>>>>>> relevant
> > >>>>>>>>>>>         parameter for performing lookups as part of the
> > >>> stream-table
> > >>>>>>> join.
> > >>>>>>>>> "Grace
> > >>>>>>>>>>>         period" in the context of versioned stores refers to
> > how
> > >>> far
> > >>>>> back
> > >>>>>>>>> in time
> > >>>>>>>>>>>         out-of-order writes may occur, which probably isn't
> > >>> directly
> > >>>>>>>>> relevant for
> > >>>>>>>>>>>         introducing a stream-side buffer, though it's also
> > >>>>>>>>>>> possible
> > >>>>> I've
> > >>>>>>>>> overlooked
> > >>>>>>>>>>>         something. (As a bonus, switching from "table grace
> > >>> period" in
> > >>>>>>> the
> > >>>>>>>>> KIP to
> > >>>>>>>>>>>         "table history retention" also helps to
> > >>>>>>>>>>> clarify/distinguish
> > >>>>> that
> > >>>>>>>>> it's a
> > >>>>>>>>>>>         different parameter from the "join grace period,"
> > >>>>>>>>>>> which I
> > >>>>> could
> > >>>>>>> see
> > >>>>>>>>> being
> > >>>>>>>>>>>         confusing to readers. :) )
> > >>>>>>>>>>>
> > >>>>>>>>>>>
> > >>>>>>>>>>> Cheers,
> > >>>>>>>>>>> Victoria
> > >>>>>>>>>>>
> > >>>>>>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> > >>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
> > >>>>>>>>>>>
> > >>>>>>>>>>>> Hey all,
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Thanks for the comments, they gave me a lot to think about.
> > >>>>>>>>>>>> I'll
> > >>>>> try
> > >>>>>>> to
> > >>>>>>>>>>>> address them all inorder. I have made some updates to the
> kip
> > >>>>> related
> > >>>>>>>>> to
> > >>>>>>>>>>>> them, but I mention where below.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Lucas
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Good idea about the example. I added a simple one.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1) I have thought about including options for the underlying
> > >>> buffer
> > >>>>>>>>>>>> configuration. One of which might be adding an in memory
> > >>>>>>>>>>>> option.
> > >>> My
> > >>>>>>>>> biggest
> > >>>>>>>>>>>> concern is about the semantic guarantees. This isn't like
> > >>> suppress
> > >>>>> or
> > >>>>>>>>> with
> > >>>>>>>>>>>> windows where producing incomplete results is repetitively
> > >>>>> harmless.
> > >>>>>>>>> Here
> > >>>>>>>>>>>> we would be possibly producing incorrect results. I also
> would
> > >>> like
> > >>>>>>> to
> > >>>>>>>>> keep
> > >>>>>>>>>>>> the interface changes as simple as I can. Making more than
> > this
> > >>>>>>> change
> > >>>>>>>>> to
> > >>>>>>>>>>>> Joined I feel could make this more complicated than it needs
> > to
> > >>> be.
> > >>>>>>> If
> > >>>>>>>>> we
> > >>>>>>>>>>>> really want to I could see adding a grace() option with a
> > >>>>>>> BufferConifg
> > >>>>>>>>> in
> > >>>>>>>>>>>> there or something, but I would rather not.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 2) The buffer will be independent of if the table is
> > >>>>>>>>>>>> versioned or
> > >>>>>>> not.
> > >>>>>>>>> If
> > >>>>>>>>>>>> table is not materialized it will materialize it as
> > >>>>>>>>>>>> versioned. It
> > >>>>>>> might
> > >>>>>>>>>>>> make sense to do a follow up kip where we force the
> retention
> > >>>>> period
> > >>>>>>>>> of
> > >>>>>>>>>>>> the versioned to be greater than whatever the max of the
> > stream
> > >>>>>>> buffer
> > >>>>>>>>> is.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Victoria
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1) Yes, records will exit in timestamp order not in offset
> > >>>>>>>>>>>> order.
> > >>>>>>>>>>>> 2) Late records will be dropped (Late as out of the grace
> > >>> period).
> > >>>>>>>>>     From my
> > >>>>>>>>>>>> understanding that is the point of a grace period, no?
> Doesn't
> > >>> the
> > >>>>>>> same
> > >>>>>>>>>>>> thing happen with versioned stores?
> > >>>>>>>>>>>> 3) The segment store already has an observed stream time, we
> > >>>>> advance
> > >>>>>>>>> based
> > >>>>>>>>>>>> on that. That should only advance based on records that
> > >>>>>>>>>>>> enter the
> > >>>>>>>>> store. So
> > >>>>>>>>>>>> yes, only stream side records. We could maybe do an
> > improvement
> > >>>>> later
> > >>>>>>>>> to
> > >>>>>>>>>>>> advance stream time from table side as well, but that might
> be
> > >>>>>>>>> debatable as
> > >>>>>>>>>>>> we might get more late records. Anyways I would rather have
> > >>>>>>>>>>>> that
> > >>>>> as a
> > >>>>>>>>>>>> separate discussion.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> in memory option? We can do that, for the buffer I plan to
> use
> > >>> the
> > >>>>>>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in
> > >>> memory
> > >>>>>>>>>>>> implantation, so it would be simple.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I said more in my answer to Lucas's question. The concern I
> > >>>>>>>>>>>> have
> > >>>>> with
> > >>>>>>>>>>>> buffer configs or in memory is complicating the interface.
> > Also
> > >>>>>>>>> semantic
> > >>>>>>>>>>>> guarantees but in memory shouldn't effect that
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Matthias
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 1) fixed out of order vs late terminology in the kip.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 2) I was referring to having a stream. So after this kip we
> > can
> > >>>>> have
> > >>>>>>> a
> > >>>>>>>>>>>> buffered stream or a normal one. For the table we can use a
> > >>>>> versioned
> > >>>>>>>>> table
> > >>>>>>>>>>>> or a normal table.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 3 Good call out. I clarified this as "If the table side
> uses a
> > >>>>>>>>> materialized
> > >>>>>>>>>>>> version store, it can store multiple versions of each record
> > >>> within
> > >>>>>>> its
> > >>>>>>>>>>>> defined grace period." and modified the rest of the
> paragraph
> > a
> > >>>>> bit.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 4) I get the preserving off offset ordering, but if the
> > >>>>>>>>>>>> stream is
> > >>>>>>>>> buffered
> > >>>>>>>>>>>> to join on timestamp instead of offset doesn't it already
> seem
> > >>> like
> > >>>>>>> we
> > >>>>>>>>> care
> > >>>>>>>>>>>> more about time in this case?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> If we end up adding more options it might make sense to do
> > >>>>>>>>>>>> this.
> > >>>>>>> Maybe
> > >>>>>>>>>>>> offset order processing can be a follow up?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> I'll add a section for this in Rejected Alternatives. I
> > >>>>>>>>>>>> think it
> > >>>>>>> makes
> > >>>>>>>>>>>> sense to do something like this but maybe in a follow up.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> 5) I hadn't thought about this. I suppose if they changed
> > >>>>>>>>>>>> this in
> > >>>>> an
> > >>>>>>>>>>>> upgrade the next record would either evict a lot of records
> > (if
> > >>> the
> > >>>>>>>>> grace
> > >>>>>>>>>>>> period decreased) or there would be a pause until the new
> > grace
> > >>>>>>> period
> > >>>>>>>>>>>> reached. Increasing is a bit more problematic, especially if
> > >>>>>>>>>>>> the
> > >>>>>>> table
> > >>>>>>>>>>>> grace period and retention time stays the same. If the data
> is
> > >>>>>>>>> reprocessed
> > >>>>>>>>>>>> after a change like that then there would be different
> > results,
> > >>>>> but I
> > >>>>>>>>> feel
> > >>>>>>>>>>>> like that would be expected after such a change.
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> What do you think should happen?
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Hopefully this answers your questions!
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> Walker
> > >>>>>>>>>>>>
> > >>>>>>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <
> > >>> mjsax@apache.org>
> > >>>>>>>>> wrote:
> > >>>>>>>>>>>>
> > >>>>>>>>>>>>> Thanks for the KIP! Also some question/comments from my
> side:
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 10) Notation: you use the term "late data" but I think you
> > >>>>>>>>>>>>> mean
> > >>>>>>>>>>>>> out-of-order. We reserve the term "late" to records that
> > >>>>>>>>>>>>> arrive
> > >>>>>>> after
> > >>>>>>>>>>>>> grace period passed, and thus, "late == out-of-order data
> > that
> > >>> is
> > >>>>>>>>>>>> dropped".
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 20) "There is only one option from the stream side and only
> > >>>>> recently
> > >>>>>>>>> is
> > >>>>>>>>>>>>> there a second option on the table side."
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> What are those options? Victoria already asked about the
> > table
> > >>>>> side,
> > >>>>>>>>> but
> > >>>>>>>>>>>>> I am also not sure what option you mean for the stream
> side?
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 30) "If the table side uses a materialized version store
> the
> > >>> value
> > >>>>>>> is
> > >>>>>>>>>>>>> the latest by stream time rather than by offset within its
> > >>> defined
> > >>>>>>>>> grace
> > >>>>>>>>>>>>> period."
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The phrase "the value is the latest by stream time" is
> > >>>>>>>>>>>>> confusing
> > >>>>> --
> > >>>>>>> in
> > >>>>>>>>>>>>> the end, a versioned stores multiple versions, not just
> one.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 40) I am also wondering about ordering. In general, KS
> > >>>>>>>>>>>>> tries to
> > >>>>>>>>> preserve
> > >>>>>>>>>>>>> offset-order during processing (with some exception, when
> > >>>>>>>>>>>>> offset
> > >>>>>>> order
> > >>>>>>>>>>>>> preservation is not clearly defined). Given that the
> > >>>>>>>>>>>>> stream-side
> > >>>>>>>>> buffer
> > >>>>>>>>>>>>> is really just a "linear buffer", we could easily preserve
> > >>>>>>>>> offset-order.
> > >>>>>>>>>>>>> But I also see a benefit of re-ordering and emitting
> > >>> out-of-order
> > >>>>>>> data
> > >>>>>>>>>>>>> right away when read (instead of blocking them behind
> > in-order
> > >>>>>>> records
> > >>>>>>>>>>>>> that are not ready yet). -- It might even be a possibility,
> > to
> > >>> let
> > >>>>>>>>> users
> > >>>>>>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets"
> (name
> > >>> just
> > >>>>> a
> > >>>>>>>>>>>>> placeholder).
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> The KIP should explain this in more detail and also discuss
> > >>>>>>> different
> > >>>>>>>>>>>>> options and mention them in "Rejected alternatives" in case
> > we
> > >>>>> don't
> > >>>>>>>>>>>>> want to include them.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> 50) What happens when users change the grace period?
> > >>>>>>>>>>>>> Especially,
> > >>>>>>> when
> > >>>>>>>>>>>>> they turn it on/off (but also increasing/decreasing is an
> > >>>>>>> interesting
> > >>>>>>>>>>>>> point)? I think we should try to support this if possible;
> > the
> > >>>>>>>>>>>>> "Compatibility" section needs to cover switching on/off in
> > >>>>>>>>>>>>> more
> > >>>>>>>>> detail.
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> -Matthias
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
> > >>>>>>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> A few clarifications:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 1. Is the order that records exit the buffer in
> > >>>>>>>>>>>>>> necessarily the
> > >>>>>>> same
> > >>>>>>>>> as
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> order that records enter the buffer in, or no? Based on
> the
> > >>>>>>>>> description
> > >>>>>>>>>>>>> in
> > >>>>>>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records
> will
> > >>> exit
> > >>>>>>> the
> > >>>>>>>>>>>>>> buffer in increasing timestamp order, which means that
> > >>>>>>>>>>>>>> they may
> > >>>>> be
> > >>>>>>>>>>>>> ordered
> > >>>>>>>>>>>>>> (even for the same key) compared to the input order.
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
> > >>>>>>>>> stream-side
> > >>>>>>>>>>>>>> record arrives with a timestamp that is older than the
> > >>>>>>>>>>>>>> current
> > >>>>>>> stream
> > >>>>>>>>>>>>> time
> > >>>>>>>>>>>>>> minus the grace period? Will this record trigger a join
> > >>>>>>>>>>>>>> result,
> > >>>>> or
> > >>>>>>>>> will
> > >>>>>>>>>>>>> it
> > >>>>>>>>>>>>>> be dropped? Based on the description for what happens when
> > >>>>>>>>>>>>>> the
> > >>>>> join
> > >>>>>>>>>>>> grace
> > >>>>>>>>>>>>>> period is set to zero, it sounds like the late record will
> > be
> > >>>>>>>>> dropped,
> > >>>>>>>>>>>>> even
> > >>>>>>>>>>>>>> if the join grace period is nonzero. Is that true?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> 3. What could cause stream time to advance, for purposes
> of
> > >>>>>>> removing
> > >>>>>>>>>>>>>> records from the join buffer? For example, will new
> records
> > >>>>>>> arriving
> > >>>>>>>>> on
> > >>>>>>>>>>>>> the
> > >>>>>>>>>>>>>> table side of the join cause stream time to advance? From
> > the
> > >>> KIP
> > >>>>>>> it
> > >>>>>>>>>>>>> sounds
> > >>>>>>>>>>>>>> like only stream-side records will advance stream time --
> > >>>>>>>>>>>>>> does
> > >>>>> that
> > >>>>>>>>>>>> mean
> > >>>>>>>>>>>>>> that the join processor itself will have to track this
> > stream
> > >>>>> time?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> Also +1 to Lucas's question about what options will be
> > >>> available
> > >>>>>>> for
> > >>>>>>>>>>>>>> configuring the join buffer. Will users have the option to
> > >>> choose
> > >>>>>>>>>>>> whether
> > >>>>>>>>>>>>>> they want the buffer to be in-memory vs persistent?
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> - Victoria
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> > >>>>>>>>>>>>>> <lb...@confluent.io.invalid> wrote:
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> HI Walker,
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> thanks for the KIP! We definitely need this. I have two
> > >>>>> questions:
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>        - Have you considered allowing the customization
> > >>>>>>>>>>>>>>> of the
> > >>>>>>>>> underlying
> > >>>>>>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets
> > you
> > >>>>>>>>> customize
> > >>>>>>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would
> it
> > >>> make
> > >>>>>>>>> sense
> > >>>>>>>>>>>>>>> for `Joined` to have this as well? I can imagine one may
> > >>>>>>>>>>>>>>> want
> > >>> to
> > >>>>>>>>> limit
> > >>>>>>>>>>>>>>> the number of records in the buffer, for example. If we
> hit
> > >>> the
> > >>>>>>>>>>>>>>> maximum, the only option would be to drop semantic
> > >>>>>>>>>>>>>>> guarantees,
> > >>>>> but
> > >>>>>>>>>>>>>>> users may still want to do this.
> > >>>>>>>>>>>>>>>        - With "second option on the table side" you are
> > >>> referring
> > >>>>> to
> > >>>>>>>>>>>>>>> versioned tables, right? Will the buffer on the stream
> side
> > >>>>> behave
> > >>>>>>>>> any
> > >>>>>>>>>>>>>>> different whether the table side is versioned or not?
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Finally, I think a simple example in the motivation
> section
> > >>>>> could
> > >>>>>>>>> help
> > >>>>>>>>>>>>>>> non-experts understand the KIP.
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> Best,
> > >>>>>>>>>>>>>>> Lucas
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> > >>>>>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Hello everybody,
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> I have a stream proposal to improve the stream table
> > >>>>>>>>>>>>>>>> join by
> > >>>>>>>>> adding a
> > >>>>>>>>>>>>>>> grace
> > >>>>>>>>>>>>>>>> period and buffer to the stream side of the join to
> allow
> > >>>>>>>>> processing
> > >>>>>>>>>>>> in
> > >>>>>>>>>>>>>>>> timestamp order matching the recent improvements of the
> > >>>>> versioned
> > >>>>>>>>>>>>> tables.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> Please take a look here <
> > >>>>>>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
> > >>>>>>>>>>>>>>> and
> > >>>>>>>>>>>>>>>> share your thoughts.
> > >>>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>>> best,
> > >>>>>>>>>>>>>>>> Walker
> > >>>>>>>>>>>>>>>
> > >>>>>>>>>>>>>>
> > >>>>>>>>>>>>>
> > >>>>>>>>>>>>
> > >>>>>>>>>
> > >>>>>>>>
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Good Point Victoria. I just removed the compacted topic mention from the
KIP. I agree with Burno about using a normal topic and deleting records
that have been processed.

On Tue, Jun 6, 2023 at 2:28 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi,
>
> another idea that came to my mind. Instead of using a compacted topic,
> the buffer could use a non-compacted topic and regularly delete records
> before a given offset as Streams does for repartition topics.
>
> Best,
> Bruno
>
> On 05.06.23 21:48, Bruno Cadonna wrote:
> > Hi Victoria,
> >
> > that is a good point!
> >
> > I think, the topic needs to be a compacted topic to be able to get rid
> > of records that are evicted from the buffer. So the key might be
> > something with the key, the timestamp, and a sequence number to
> > distinguish between records with the same key and same timestamp.
> >
> > Just an idea! Maybe Walker comes up with something better.
> >
> > Best,
> > Bruno
> >
> > On 05.06.23 20:38, Victoria Xia wrote:
> >> Hi Walker,
> >>
> >> Thanks for the latest updates! The KIP looks great. Just one question
> >> about
> >> the changelog topic for the join buffer: The KIP says "When a failure
> >> occurs the buffer will try to recover from an OffsetCheckpoint if
> >> possible.
> >> If not it will reload the buffer from a compacted change-log topic."
> This
> >> is a new changelog topic that will be introduced specifically for the
> >> join
> >> buffer, right? Why is the changelog topic compacted? What are the keys?
> I
> >> am confused because the buffer contains records from the stream-side
> >> of the
> >> join, for which multiple records with the same key should be treated as
> >> separate updates will all must be tracked in the buffer, rather than
> >> updates which replace each other.
> >>
> >> Thanks,
> >> Victoria
> >>
> >> On Mon, Jun 5, 2023 at 1:47 AM Bruno Cadonna <ca...@apache.org>
> wrote:
> >>
> >>> Hi Walker,
> >>>
> >>> Thanks once more for the updates to the KIP!
> >>>
> >>> Do you also plan to expose metrics for the buffer?
> >>>
> >>> Best,
> >>> Bruno
> >>>
> >>> On 02.06.23 17:16, Walker Carlson wrote:
> >>>> Hello Bruno,
> >>>>
> >>>> I think this covers your questions. Let me know what you think
> >>>>
> >>>> 2.
> >>>> We can use a changelog topic. I think we can treat it like any other
> >>> store
> >>>> and recover in the usual manner. Also implementation is on disk
> >>>>
> >>>> 3.
> >>>> The description is in the public interfaces description. I will copy
> it
> >>>> into the proposed changes as well.
> >>>>
> >>>> This is a bit of an implementation detail that I didn't want to add
> >>>> into
> >>>> the kip, but the record will be added to the buffer to keep the stream
> >>> time
> >>>> consistent, it will just be ejected immediately. If of course if this
> >>>> causes performance issues we will skip this step and track stream time
> >>>> separately. I will update the kip to say that stream time advances
> >>>> when a
> >>>> stream record enters the node.
> >>>>
> >>>> Also, yes, updated.
> >>>>
> >>>> 5.
> >>>> No there is no difference right now, everything gets processed as it
> >>> comes
> >>>> in and tries to find a record for its time stamp.
> >>>>
> >>>> Walker
> >>>>
> >>>> On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org>
> >>>> wrote:
> >>>>
> >>>>> Hi Walker,
> >>>>>
> >>>>> Thanks for the updates!
> >>>>>
> >>>>> 2.
> >>>>> It is still not clear to me how a failure is handled. I do not
> >>>>> understand what you mean by "recover from an OffsetCheckpoint".
> >>>>>
> >>>>> My understanding is that the buffer needs to be replicated into its
> >>>>> own
> >>>>> Kafka topic. The input topic is not enough. The offset of a record is
> >>>>> added to the offsets to commit once the record is streamed through
> the
> >>>>> subtopology. That means once the record is added to the buffer its
> >>>>> offset is added to the offsets to commit -- independently of
> >>>>> whether the
> >>>>> record was evicted from the buffer and sent to the join node or not.
> >>>>> Now, let's assume the following scenario
> >>>>> 1. a record is read from the input topic and added to the buffer, but
> >>>>> not evicted to be processed by the join node.
> >>>>> 2. When the processing of the subtopology finishes the offset of the
> >>>>> record is added to the offsets to commit.
> >>>>> 3. A commit happens.
> >>>>> 4. A failure happens
> >>>>>
> >>>>> After the failure the buffer is empty but the record will not be read
> >>>>> anymore from the input topic since its offset has been already
> >>>>> committed. The record is lost.
> >>>>> One solution to avoid the loss is to recreate the buffer from a
> >>>>> compacted Kafka topic as we do for suppression buffers. I do not
> >>>>> think,
> >>>>> we need any offset checkpoint here since, we keep the buffer in
> >>>>> memory,
> >>>>> right? Or do you plan to back the buffer with a persistent store?
> Even
> >>>>> in that case, a compacted Kafka topic would be needed.
> >>>>>
> >>>>>
> >>>>> 3.
> >>>>>    From the KIP it is still not clear to me what happens if a
> >>>>> record is
> >>>>> outside of the grace period. I guess the record that falls outside of
> >>>>> the grace period will not be added to the buffer, but will be send to
> >>>>> the join node. Since it is outside of the grace period it will also
> >>>>> not
> >>>>> increase stream time and it will not trigger an eviction. Also the
> >>>>> head
> >>>>> of the buffer will not contain a record that needs to be evicted
> since
> >>>>> the the timestamp of the head record will be within the interval
> >>>>> stream
> >>>>> time minus grace period. Is this correct? Please add such a
> >>>>> description
> >>>>> to the KIP.
> >>>>> Furthermore, I think there is a mistake in the text:
> >>>>> "... will dequeue when the record timestamp is greater than stream
> >>>>> time
> >>>>> plus the grace period". I guess that should be "... will dequeue when
> >>>>> the record timestamp is less than (or equal?) stream time minus the
> >>>>> grace period"
> >>>>>
> >>>>>
> >>>>> 5.
> >>>>> What is the difference between not setting the grace period and
> >>>>> setting
> >>>>> it to zero? If there is a difference, why is there a difference?
> >>>>>
> >>>>>
> >>>>> Best,
> >>>>> Bruno
> >>>>>
> >>>>>
> >>>>> On 01.06.23 23:58, Walker Carlson wrote:
> >>>>>> Hey Bruno thanks for the feedback.
> >>>>>>
> >>>>>> 1)
> >>>>>> I will add this to the kip, but stream time only advances as the
> when
> >>> the
> >>>>>> buffer receives a new record.
> >>>>>>
> >>>>>> 2)
> >>>>>> You are correct, I will add a failure section on to the kip. Since
> >>>>>> the
> >>>>>> records wont change in the buffer from when they are read from the
> >>> topic
> >>>>>> they are replicated already.
> >>>>>>
> >>>>>> 3)
> >>>>>> I see that I'm out voted on the dropping of records thing. We will
> >>>>>> pass
> >>>>>> them on and try to join them if possible. This might cause some null
> >>>>>> results, but increasing the table history retention should help
> that.
> >>>>>>
> >>>>>> 4)
> >>>>>> I can add some on the kip. But its pretty directly adding whatever
> >>>>>> the
> >>>>>> grace period is to the latency. I don't see a way around it.
> >>>>>>
> >>>>>> Walker
> >>>>>>
> >>>>>> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org>
> >>> wrote:
> >>>>>>
> >>>>>>> Hi Walker,
> >>>>>>>
> >>>>>>> thanks for the KIP!
> >>>>>>>
> >>>>>>> Here my feedback:
> >>>>>>>
> >>>>>>> 1.
> >>>>>>> It is still not clear to me when stream time for the buffer
> >>>>>>> advances.
> >>>>>>> What is the event that let the stream time advance? In the
> >>> discussion, I
> >>>>>>> do not understand what you mean by "The segment store already has
> an
> >>>>>>> observed stream time, we advance based on that. That should only
> >>> advance
> >>>>>>> based on records that enter the store." Where does this segment
> >>>>>>> store
> >>>>>>> come from? Anyways, I think it would be great to also state how
> >>>>>>> stream
> >>>>>>> time advances in the KIP.
> >>>>>>>
> >>>>>>> 2.
> >>>>>>> How does the buffer behave in case of a failure? I think I
> >>>>>>> understand
> >>>>>>> that the buffer will use an implementation of
> >>> TimeOrderedKeyValueBuffer
> >>>>>>> and therefore the records in the buffer will be replicated to a
> >>>>>>> topic
> >>> in
> >>>>>>> Kafka, but I am not completely sure. Could you elaborate on this in
> >>> the
> >>>>>>> KIP?
> >>>>>>>
> >>>>>>> 3.
> >>>>>>> I agree with Matthias about dropping late records. We use grace
> >>> periods
> >>>>>>> in scenarios where we records are grouped like in windowed
> >>> aggregations
> >>>>>>> and windowed joins. The stream buffer you propose does not really
> >>> group
> >>>>>>> any records. It rather delays records and reorders them. I am not
> >>>>>>> sure
> >>>>>>> if grace period is the right naming/concept to apply here.
> >>>>>>> Instead of
> >>>>>>> dropping records that fall outside of the buffer's time interval
> the
> >>>>>>> join should skip the buffer and try to join the record
> >>>>>>> immediately. In
> >>>>>>> the end, a stream-table join is a unwindowed join, i.e., no
> grouping
> >>> is
> >>>>>>> applied to the records.
> >>>>>>> What do you and other folks think about this proposal?
> >>>>>>>
> >>>>>>> 4.
> >>>>>>> How does the proposed buffer, affects processing latency? Could you
> >>>>>>> please add some words about this to the KIP?
> >>>>>>>
> >>>>>>>
> >>>>>>> Best,
> >>>>>>> Bruno
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 31.05.23 01:49, Walker Carlson wrote:
> >>>>>>>> Thanks for all the additional comments. I will either address them
> >>> here
> >>>>>>> or
> >>>>>>>> update the kip accordingly.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> I mentioned a follow kip to add extra features before and in the
> >>>>>>> responses.
> >>>>>>>> I will try to briefly summarize what options and optimizations I
> >>>>>>>> plan
> >>>>> to
> >>>>>>>> include. If a concern is not covered in this list I for sure talk
> >>> about
> >>>>>>> it
> >>>>>>>> below.
> >>>>>>>>
> >>>>>>>> * Allowing non versioned tables to still use the stream buffer
> >>>>>>>> * Automatically materializing tables instead of forcing the user
> to
> >>> do
> >>>>> it
> >>>>>>>> * Configurable for in memory buffer
> >>>>>>>> * Order the records in offset order or in time order
> >>>>>>>> * Non memory use buffer (offset order, delayed pull from stream.)
> >>>>>>>> * Time synced between stream and table side (maybe)
> >>>>>>>> * Do not drop late records and process them as they come in
> >>>>>>>> instead.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> First, Victoria.
> >>>>>>>>
> >>>>>>>> 1) (One of your nits covers this, but you are correct it doesn't
> >>>>>>>> make
> >>>>>>>> sense. so I removed that part of the example.)
> >>>>>>>> For those examples with the "bad" join results I said without
> >>> buffering
> >>>>>>> the
> >>>>>>>> stream it would look like that, but that was incomplete. If the
> >>>>>>>> look
> >>> up
> >>>>>>> was
> >>>>>>>> simply looking at the latest version of the table when the stream
> >>>>> records
> >>>>>>>> came in then the results were possible. If we are using the
> >>>>>>>> point in
> >>>>> time
> >>>>>>>> lookup that versioned tables let us then you are correct the
> future
> >>>>>>> results
> >>>>>>>> are not possible.
> >>>>>>>>
> >>>>>>>> 2) I'll get to this later as Matthias brought up something
> related.
> >>>>>>>>
> >>>>>>>> To your additional thoughts, I agree that we need to call those
> >>> things
> >>>>>>> out
> >>>>>>>> in the documentation. I'm writing up a follow up kip with a lot of
> >>> the
> >>>>>>>> ideas we have discussed so that we can improve this feature beyond
> >>> the
> >>>>>>> base
> >>>>>>>> implementation if it's needed.
> >>>>>>>>
> >>>>>>>> I addressed the nits in the kip. I somehow missed the table stream
> >>>>> table
> >>>>>>>> join processor improvement, it makes your first question make a
> lot
> >>>>> more
> >>>>>>>> sense.  Table history retention is a much cleaner way to
> >>>>>>>> describe it.
> >>>>>>>>
> >>>>>>>> As to your mention of the syncing the time for the table and
> >>>>>>>> stream.
> >>>>>>>> Matthias mentioned that as well. I will address both here. I
> >>>>>>>> plan to
> >>>>>>> bring
> >>>>>>>> that up in the future, but for now we will leave it out. I
> >>>>>>>> suppose it
> >>>>>>> will
> >>>>>>>> be more useful after the table history retention is separable from
> >>> the
> >>>>>>>> table grace period.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> To address Matthias comments.
> >>>>>>>>
> >>>>>>>> You are correct by saying the in memory store shouldn't cause any
> >>>>>>> semantic
> >>>>>>>> concerns. My concern would be more with if we limited the number
> of
> >>>>>>> records
> >>>>>>>> on the buffer and what we would do if we hit said limits,
> (emitting
> >>>>> those
> >>>>>>>> records might be an issue, throwing an error and halting would
> >>>>>>>> not).
> >>> I
> >>>>>>>> think we can leave this discussion to the follow up kip along
> >>>>>>>> with a
> >>>>> few
> >>>>>>>> other options.
> >>>>>>>>
> >>>>>>>> I will go through your proposals now.
> >>>>>>>>
> >>>>>>>>       - don't support non-versioned KTables
> >>>>>>>>
> >>>>>>>> Sure, we can always expand this later on. Will include as part
> >>>>>>>> of the
> >>>>> of
> >>>>>>>> the improvement kip
> >>>>>>>>
> >>>>>>>>       - if grace period is added, users need to explicitly
> >>>>>>>> materialize
> >>>>> the
> >>>>>>>> table as version (either directly, or upstream. Upstream only
> works
> >>> if
> >>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>>>>>>>
> >>>>>>>> again, that works for me for now, if we find a use we can always
> >>>>>>>> add
> >>>>>>> later.
> >>>>>>>>
> >>>>>>>>       - the table's history retention time must be larger than the
> >>> grace
> >>>>>>>> period (should be easy to check at runtime, when we build the
> >>> topology)
> >>>>>>>>
> >>>>>>>> agreed
> >>>>>>>>
> >>>>>>>>       - because switching from non-versioned to version stores
> >>>>>>>> is not
> >>>>>>>> backward compatibly (cf KIP-914), users need to take care of this
> >>>>>>>> themselves, and this also implies that adding grace period is not
> a
> >>>>>>>> backward compatible change (even only if via indirect means)
> >>>>>>>>
> >>>>>>>> sure, this works
> >>>>>>>>
> >>>>>>>> As to the dropping of late records, I'm not sure. One one hand I
> >>>>>>>> like
> >>>>> not
> >>>>>>>> dropping things. But on the other I struggle to see how a user can
> >>>>> filter
> >>>>>>>> out late records that might have incomplete join results. The
> point
> >>> in
> >>>>>>> time
> >>>>>>>> look up will aggressively expire old data and if new data has been
> >>>>>>> replaced
> >>>>>>>> it will return null if outside of the retention. This seems like
> it
> >>>>> could
> >>>>>>>> corrupt the integrity of the join output. Seeing that we drop late
> >>>>>>> records
> >>>>>>>> on the table side as well I would think it makes sense to drop
> late
> >>>>>>> records
> >>>>>>>> on the stream buffer. I could be convinced otherwise I suppose, I
> >>> could
> >>>>>>> see
> >>>>>>>> adding this as an option in a follow up kip. It would be very
> >>>>>>>> easy to
> >>>>>>>> implement either way. For now unless no one else objects I'm
> >>>>>>>> going to
> >>>>>>> stick
> >>>>>>>> with dropping the records for the sake of getting this kip
> >>>>>>>> passed. It
> >>>>> is
> >>>>>>>> functionally a small change to make and we can update later if you
> >>> feel
> >>>>>>>> strongly about it.
> >>>>>>>>
> >>>>>>>> For the ordering. I have to say that it would be more
> >>>>>>>> complicated to
> >>>>>>>> implement it to be in offset order, if the goal it to get as
> >>>>>>>> many of
> >>>>> the
> >>>>>>>> records validly joined as possible. Because we would process as
> >>> things
> >>>>>>> left
> >>>>>>>> the buffer a sufficiency early enough record could hold up records
> >>> that
> >>>>>>>> would otherwise be valid past the table history retention. To fix
> >>> this
> >>>>> we
> >>>>>>>> could process by timestamp then store in a second queue and emit
> by
> >>>>>>> offset,
> >>>>>>>> but that would be a lot more complicated. If we didn't care
> >>>>>>>> about not
> >>>>>>>> missing some valid joins we could just have no store and pull from
> >>> the
> >>>>>>>> topic at a delay only caring about the timestamp of the next
> >>>>>>>> offset.
> >>>>> For
> >>>>>>>> now I want to stick with the timestamp ordering as it makes much
> >>>>>>>> more
> >>>>>>> sense
> >>>>>>>> to me, but would propose we add both of the other options I have
> >>>>>>>> laid
> >>>>> out
> >>>>>>>> here in the follow up kip.
> >>>>>>>>
> >>>>>>>> Lastly, I think having an empty store with zero grace period
> >>>>>>>> would be
> >>>>>>> super
> >>>>>>>> simple and not costly, so we might as well make it even if nothing
> >>> gets
> >>>>>>>> entered.
> >>>>>>>>
> >>>>>>>> I hope that address all your concerns,
> >>>>>>>>
> >>>>>>>> Walker
> >>>>>>>>
> >>>>>>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mjsax@apache.org
> >
> >>>>>>> wrote:
> >>>>>>>>
> >>>>>>>>> Walker,
> >>>>>>>>>
> >>>>>>>>> thanks for the updates. The KIP itself reads fine (of course
> >>> Victoria
> >>>>>>>>> made good comments about some phrases), but there is a couple of
> >>>>> things
> >>>>>>>>> from your latest reply I don't understand, and that I still think
> >>> need
> >>>>>>>>> some more discussions.
> >>>>>>>>>
> >>>>>>>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and
> >>> you
> >>>>>>>>> mention "semantic concerns". There should not be any semantic
> >>>>> difference
> >>>>>>>>> from the underlying buffer implementation, so I am not sure
> >>>>>>>>> what you
> >>>>>>>>> mean here (also the relationship to suppress() is unclear to me)?
> >>> -- I
> >>>>>>>>> am ok to not make it configurable for now. We can always do it
> >>>>>>>>> via a
> >>>>>>>>> follow up KIP, and keep interface changes limited for now.
> >>>>>>>>>
> >>>>>>>>> Does it really make sense to allow a grace period if the table is
> >>>>>>>>> non-versioned? You also say: "If table is not materialized it
> will
> >>>>>>>>> materialize it as versioned." -- What history retention time
> would
> >>> we
> >>>>>>>>> pick for this case (also asked by Victoria)? Or should we
> >>>>>>>>> rather not
> >>>>>>>>> support this and force the user to materialize the table
> >>>>>>>>> explicitly,
> >>>>> and
> >>>>>>>>> thus explicitly picking a history retention time? It's tradeoff
> >>>>> between
> >>>>>>>>> usability and guiding uses that there will be a significant
> impact
> >>> on
> >>>>>>>>> disk usage. There is also compatibility concerns: If the table is
> >>> not
> >>>>>>>>> explicitly materialized in the old program, we would already
> >>>>>>>>> need to
> >>>>>>>>> materialize it also in the old program (of course, we would use a
> >>>>>>>>> non-versioned store so far). Thus, if somebody adds a grace
> >>>>>>>>> period,
> >>> we
> >>>>>>>>> cannot just switch the store type, as it would be a breaking
> >>>>>>>>> change,
> >>>>>>>>> potentially required an application re-set, or following the
> >>>>>>>>> upgrade
> >>>>>>>>> path for versioned state stores, and also changing the program to
> >>>>>>>>> explicitly materialize using a versioned store. Also note, that
> we
> >>>>> might
> >>>>>>>>> not materialize the actual join table, but only an upstream
> table,
> >>> and
> >>>>>>>>> use `ValueGetter` to access the upstream data.
> >>>>>>>>>
> >>>>>>>>> To this end, as you already mentioned, history retention of the
> >>> table
> >>>>>>>>> should be at least grace period. You proposed to include this in
> a
> >>>>>>>>> follow up KIP, but I am wondering if it's a fundamental
> >>>>>>>>> requirement
> >>>>> and
> >>>>>>>>> thus we should put a check in place right away and reject an
> >>>>>>>>> invalid
> >>>>>>>>> configuration? (It always easier to lift restriction than to
> >>> introduce
> >>>>>>>>> them later.) This would also imply that a non-versioned table
> >>>>>>>>> cannot
> >>>>> be
> >>>>>>>>> supported, because it does not have a history retention that is
> >>> larger
> >>>>>>>>> than grace period, and maybe also answer the requirement about
> >>>>>>>>> materialization: as we already always materialize something on
> the
> >>>>>>>>> tablet side as non-versioned store right now, it seems
> >>>>>>>>> difficult to
> >>>>>>>>> migrate the store to a versioned store. Ie, it might be ok to
> push
> >>> the
> >>>>>>>>> burden onto the user and say: if you start using grace period,
> you
> >>>>> also
> >>>>>>>>> need to manually switch from non-versioned to versioned KTables.
> >>> Doing
> >>>>>>>>> stuff automatically under the hood if very complex for this
> >>>>>>>>> case, we
> >>>>> if
> >>>>>>>>> we push the burden onto the user, it might be ok to not
> complicate
> >>>>> this
> >>>>>>>>> KIP significantly.
> >>>>>>>>>
> >>>>>>>>> To summarize the last two paragraphs, I would propose to:
> >>>>>>>>>       - don't support non-versioned KTables
> >>>>>>>>>       - if grace period is added, users need to explicitly
> >>> materialize
> >>>>> the
> >>>>>>>>> table as version (either directly, or upstream. Upstream only
> >>>>>>>>> works
> >>> if
> >>>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>>>>>>>>       - the table's history retention time must be larger than
> the
> >>> grace
> >>>>>>>>> period (should be easy to check at runtime, when we build the
> >>>>> topology)
> >>>>>>>>>       - because switching from non-versioned to version stores
> >>>>>>>>> is not
> >>>>>>>>> backward compatibly (cf KIP-914), users need to take care of this
> >>>>>>>>> themselves, and this also implies that adding grace period is
> >>>>>>>>> not a
> >>>>>>>>> backward compatible change (even only if via indirect means)
> >>>>>>>>>
> >>>>>>>>> About dropping late records: wondering if we should never drop a
> >>>>>>>>> stream-side record for a left-join, even if it's late? In
> general,
> >>> one
> >>>>>>>>> thing I observed over the years is, that it's easier to keep
> stuff
> >>> and
> >>>>>>>>> let users filter explicitly downstream (or make it configurable),
> >>>>>>>>> instead of dropping pro-actively, because users have no good
> >>>>>>>>> way to
> >>>>>>>>> resurrect record that got already dropped.
> >>>>>>>>>
> >>>>>>>>> For ordering, sounds reasonable to me only start with one
> >>>>>>>>> implementation, and maybe make it configurable as a follow up.
> >>>>> However,
> >>>>>>>>> I am wondering if starting with offset order might be the better
> >>>>> option
> >>>>>>>>> as it seems to align more with what we do so far? So instead of
> >>>>> storing
> >>>>>>>>> record ordered by timestamp, we can just store them ordered by
> >>> offset,
> >>>>>>>>> and still "poll" from the buffer based on the head records
> >>> timestamp.
> >>>>> Or
> >>>>>>>>> would this complicate the implementation significantly?
> >>>>>>>>>
> >>>>>>>>> I also think it's ok to not "sync" stream-time between the
> >>>>>>>>> table and
> >>>>> the
> >>>>>>>>> stream in this KIP, but we should consider doing this as a
> >>>>>>>>> follow up
> >>>>>>>>> change (not sure if we would need a KIP or not for a change this
> >>>>> this).
> >>>>>>>>>
> >>>>>>>>> About increasing/decreasing grace period: what you describe make
> >>> sense
> >>>>>>>>> to me. If decreased, the next record would just trigger emitting
> a
> >>> lot
> >>>>>>>>> of records, and for increase, the buffer would just need to "fill
> >>> up"
> >>>>>>>>> again. For reprocessing getting a different result with a
> >>>>>>>>> different
> >>>>>>>>> grace period is expected, so that's ok IMHO. -- There seems to be
> >>> one
> >>>>>>>>> special corner case: grace period zero. For this case, we
> actually
> >>>>> don't
> >>>>>>>>> need any store, and the stream-side could be stateless. I think
> it
> >>> can
> >>>>>>>>> have the same behavior, but if we want to "add / remove" the
> store
> >>>>>>>>> dynamically, we need to add specific code for it. For example,
> >>>>>>>>> even
> >>> if
> >>>>>>>>> we start up with a grace period of zero, we would need to check
> if
> >>>>> there
> >>>>>>>>> is a local store, and still emit everything in it, before we can
> >>> ditch
> >>>>>>>>> the store (not sure if that's even easily done at all). Or: we
> >>>>>>>>> would
> >>>>>>>>> need to have a store for _all_ cases, even if grace period is
> zero
> >>>>> (the
> >>>>>>>>> store would be empty all the time though), to avoid super complex
> >>>>> code?
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> -Matthias
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> >>>>>>>>>> Hi Walker,
> >>>>>>>>>>
> >>>>>>>>>> thanks for your responses. That makes sense. I guess there is
> >>> always
> >>>>>>>>>> the option to make the implementation more configurable later
> on,
> >>> if
> >>>>>>>>>> users request it. Also thanks for the clarifications. From my
> >>>>>>>>>> side,
> >>>>>>>>>> the KIP is good to go.
> >>>>>>>>>>
> >>>>>>>>>> Cheers,
> >>>>>>>>>> Lucas
> >>>>>>>>>>
> >>>>>>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> >>>>>>>>>> <vi...@confluent.io.invalid> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks for the updates, Walker! Looks great, though I do have a
> >>>>> couple
> >>>>>>>>>>> questions about the latest updates:
> >>>>>>>>>>>
> >>>>>>>>>>>         1. The new example says that without stream-side
> >>>>>>>>>>> buffering,
> >>>>> "ex"
> >>>>>>> and
> >>>>>>>>>>>         "fy" are possible join results. How could those join
> >>> results
> >>>>>>>>> happen? The
> >>>>>>>>>>>         example versioned table suggests that table record
> >>>>>>>>>>> "x" has
> >>>>>>>>> timestamp 2, and
> >>>>>>>>>>>         table record "y" has timestamp 3. If stream record
> >>>>>>>>>>> "e" has
> >>>>>>>>> timestamp 1,
> >>>>>>>>>>>         then it can never be joined against record "x", and
> >>> similarly
> >>>>> for
> >>>>>>>>> stream
> >>>>>>>>>>>         record "f" with timestamp 2 being joined against "y".
> >>>>>>>>>>>         2. I see in your replies above that "If table is not
> >>>>>>> materialized it
> >>>>>>>>>>>         will materialize it as versioned" but I don't see this
> >>> called
> >>>>> out
> >>>>>>>>> in the
> >>>>>>>>>>>         KIP -- seems worth calling out. Also, what will the
> >>>>>>>>>>> history
> >>>>>>>>> retention for
> >>>>>>>>>>>         the versioned table be? Will it be the same as the join
> >>> grace
> >>>>>>>>> period, or
> >>>>>>>>>>>         will it be greater?
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> And some additional thoughts:
> >>>>>>>>>>>
> >>>>>>>>>>> Sounds like there are a few things users should watch out for
> >>>>>>>>>>> when
> >>>>>>>>> enabling
> >>>>>>>>>>> the stream-side buffer:
> >>>>>>>>>>>
> >>>>>>>>>>>         - Records will get "stuck" if there are no newer
> >>>>>>>>>>> records to
> >>>>>>> advance
> >>>>>>>>>>>         stream time.
> >>>>>>>>>>>         - If there are large gaps between the timestamps of
> >>>>> stream-side
> >>>>>>>>> records,
> >>>>>>>>>>>         then it's possible that versioned store history
> >>>>>>>>>>> retention
> >>> will
> >>>>>>> have
> >>>>>>>>> expired
> >>>>>>>>>>>         by the time a record is evicted from the join buffer,
> >>> leading
> >>>>> to
> >>>>>>> a
> >>>>>>>>> join
> >>>>>>>>>>>         "miss." For example, if the join grace period and table
> >>>>> history
> >>>>>>>>> retention
> >>>>>>>>>>>         are both 10, and records come in the order:
> >>>>>>>>>>>
> >>>>>>>>>>>         table side t0 with ts=0
> >>>>>>>>>>>         stream side s1 with ts=1 <-- enters buffer
> >>>>>>>>>>>         table side t10 with ts=10
> >>>>>>>>>>>         table side t20 with ts=20
> >>>>>>>>>>>         stream side s21 with ts=21 <-- evicts record s1 from
> >>> buffer,
> >>>>> but
> >>>>>>>>>>>         versioned store no longer contains data for ts=1 due to
> >>>>> history
> >>>>>>>>> retention
> >>>>>>>>>>>         having elapsed
> >>>>>>>>>>>
> >>>>>>>>>>>         This will result in the join result (s1, null) even
> >>>>>>>>>>> though
> >>> it
> >>>>>>>>> should've
> >>>>>>>>>>>         been (s1, t0), due to t0 having been expired from the
> >>>>> versioned
> >>>>>>>>> store
> >>>>>>>>>>>         already.
> >>>>>>>>>>>         - Out-of-order records from the stream-side will be
> >>> reordered,
> >>>>>>> and
> >>>>>>>>> late
> >>>>>>>>>>>         records will be dropped.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't think any of these are reasons to not go forward with
> >>>>>>>>>>> this
> >>>>>>> KIP,
> >>>>>>>>> but
> >>>>>>>>>>> it'd be good to call them out in the eventual documentation to
> >>>>>>> decrease
> >>>>>>>>> the
> >>>>>>>>>>> chance users get tripped up.
> >>>>>>>>>>>
> >>>>>>>>>>>> We could maybe do an improvement later to advance stream time
> >>> from
> >>>>>>>>> table
> >>>>>>>>>>> side as well, but that might be debatable as we might get more
> >>> late
> >>>>>>>>> records.
> >>>>>>>>>>>
> >>>>>>>>>>> Yes, the likelihood of late records increases but also the
> >>>>> likelihood
> >>>>>>> of
> >>>>>>>>>>> "join misses" due to versioned store history retention having
> >>>>> elapsed
> >>>>>>>>>>> decreases, which feels important for certain use cases. Either
> >>> way,
> >>>>>>>>> agreed
> >>>>>>>>>>> that it can be a discussion for the future as incorporating
> this
> >>>>> would
> >>>>>>>>>>> substantially complicate the implementation.
> >>>>>>>>>>>
> >>>>>>>>>>> Also a couple nits:
> >>>>>>>>>>>
> >>>>>>>>>>>         - The KIP currently says "We recently added versioned
> >>> tables
> >>>>>>> which
> >>>>>>>>> allow
> >>>>>>>>>>>         the table side of the a join [...] but it is not taken
> >>>>> advantage
> >>>>>>> of
> >>>>>>>>> in
> >>>>>>>>>>>         joins," but this doesn't seem true? If the table of a
> >>>>>>> stream-table
> >>>>>>>>> join is
> >>>>>>>>>>>         versioned, then the DSL's stream-table join processor
> >>>>>>>>>>> will
> >>>>>>>>> automatically
> >>>>>>>>>>>         perform timestamped lookups into the table, in order to
> >>> take
> >>>>>>>>> advantage of
> >>>>>>>>>>>         the new timestamp-aware store to provide better join
> >>>>> semantics.
> >>>>>>>>>>>         - The KIP mentions "grace period" for versioned
> >>>>>>>>>>> stores in a
> >>>>>>> number
> >>>>>>>>> of
> >>>>>>>>>>>         places but I think you actually mean "history
> >>>>>>>>>>> retention"?
> >>> The
> >>>>> two
> >>>>>>>>> happen to
> >>>>>>>>>>>         be the same today (it is not an option for users to
> >>> configure
> >>>>> the
> >>>>>>>>> two
> >>>>>>>>>>>         separately) but this need not be true in the future.
> >>> "History
> >>>>>>>>> retention"
> >>>>>>>>>>>         governs how far back in time reads may occur, which
> >>>>>>>>>>> is the
> >>>>>>> relevant
> >>>>>>>>>>>         parameter for performing lookups as part of the
> >>> stream-table
> >>>>>>> join.
> >>>>>>>>> "Grace
> >>>>>>>>>>>         period" in the context of versioned stores refers to
> how
> >>> far
> >>>>> back
> >>>>>>>>> in time
> >>>>>>>>>>>         out-of-order writes may occur, which probably isn't
> >>> directly
> >>>>>>>>> relevant for
> >>>>>>>>>>>         introducing a stream-side buffer, though it's also
> >>>>>>>>>>> possible
> >>>>> I've
> >>>>>>>>> overlooked
> >>>>>>>>>>>         something. (As a bonus, switching from "table grace
> >>> period" in
> >>>>>>> the
> >>>>>>>>> KIP to
> >>>>>>>>>>>         "table history retention" also helps to
> >>>>>>>>>>> clarify/distinguish
> >>>>> that
> >>>>>>>>> it's a
> >>>>>>>>>>>         different parameter from the "join grace period,"
> >>>>>>>>>>> which I
> >>>>> could
> >>>>>>> see
> >>>>>>>>> being
> >>>>>>>>>>>         confusing to readers. :) )
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Cheers,
> >>>>>>>>>>> Victoria
> >>>>>>>>>>>
> >>>>>>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> >>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> Hey all,
> >>>>>>>>>>>>
> >>>>>>>>>>>> Thanks for the comments, they gave me a lot to think about.
> >>>>>>>>>>>> I'll
> >>>>> try
> >>>>>>> to
> >>>>>>>>>>>> address them all inorder. I have made some updates to the kip
> >>>>> related
> >>>>>>>>> to
> >>>>>>>>>>>> them, but I mention where below.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Lucas
> >>>>>>>>>>>>
> >>>>>>>>>>>> Good idea about the example. I added a simple one.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) I have thought about including options for the underlying
> >>> buffer
> >>>>>>>>>>>> configuration. One of which might be adding an in memory
> >>>>>>>>>>>> option.
> >>> My
> >>>>>>>>> biggest
> >>>>>>>>>>>> concern is about the semantic guarantees. This isn't like
> >>> suppress
> >>>>> or
> >>>>>>>>> with
> >>>>>>>>>>>> windows where producing incomplete results is repetitively
> >>>>> harmless.
> >>>>>>>>> Here
> >>>>>>>>>>>> we would be possibly producing incorrect results. I also would
> >>> like
> >>>>>>> to
> >>>>>>>>> keep
> >>>>>>>>>>>> the interface changes as simple as I can. Making more than
> this
> >>>>>>> change
> >>>>>>>>> to
> >>>>>>>>>>>> Joined I feel could make this more complicated than it needs
> to
> >>> be.
> >>>>>>> If
> >>>>>>>>> we
> >>>>>>>>>>>> really want to I could see adding a grace() option with a
> >>>>>>> BufferConifg
> >>>>>>>>> in
> >>>>>>>>>>>> there or something, but I would rather not.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2) The buffer will be independent of if the table is
> >>>>>>>>>>>> versioned or
> >>>>>>> not.
> >>>>>>>>> If
> >>>>>>>>>>>> table is not materialized it will materialize it as
> >>>>>>>>>>>> versioned. It
> >>>>>>> might
> >>>>>>>>>>>> make sense to do a follow up kip where we force the retention
> >>>>> period
> >>>>>>>>> of
> >>>>>>>>>>>> the versioned to be greater than whatever the max of the
> stream
> >>>>>>> buffer
> >>>>>>>>> is.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Victoria
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) Yes, records will exit in timestamp order not in offset
> >>>>>>>>>>>> order.
> >>>>>>>>>>>> 2) Late records will be dropped (Late as out of the grace
> >>> period).
> >>>>>>>>>     From my
> >>>>>>>>>>>> understanding that is the point of a grace period, no? Doesn't
> >>> the
> >>>>>>> same
> >>>>>>>>>>>> thing happen with versioned stores?
> >>>>>>>>>>>> 3) The segment store already has an observed stream time, we
> >>>>> advance
> >>>>>>>>> based
> >>>>>>>>>>>> on that. That should only advance based on records that
> >>>>>>>>>>>> enter the
> >>>>>>>>> store. So
> >>>>>>>>>>>> yes, only stream side records. We could maybe do an
> improvement
> >>>>> later
> >>>>>>>>> to
> >>>>>>>>>>>> advance stream time from table side as well, but that might be
> >>>>>>>>> debatable as
> >>>>>>>>>>>> we might get more late records. Anyways I would rather have
> >>>>>>>>>>>> that
> >>>>> as a
> >>>>>>>>>>>> separate discussion.
> >>>>>>>>>>>>
> >>>>>>>>>>>> in memory option? We can do that, for the buffer I plan to use
> >>> the
> >>>>>>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in
> >>> memory
> >>>>>>>>>>>> implantation, so it would be simple.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I said more in my answer to Lucas's question. The concern I
> >>>>>>>>>>>> have
> >>>>> with
> >>>>>>>>>>>> buffer configs or in memory is complicating the interface.
> Also
> >>>>>>>>> semantic
> >>>>>>>>>>>> guarantees but in memory shouldn't effect that
> >>>>>>>>>>>>
> >>>>>>>>>>>> Matthias
> >>>>>>>>>>>>
> >>>>>>>>>>>> 1) fixed out of order vs late terminology in the kip.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 2) I was referring to having a stream. So after this kip we
> can
> >>>>> have
> >>>>>>> a
> >>>>>>>>>>>> buffered stream or a normal one. For the table we can use a
> >>>>> versioned
> >>>>>>>>> table
> >>>>>>>>>>>> or a normal table.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 3 Good call out. I clarified this as "If the table side uses a
> >>>>>>>>> materialized
> >>>>>>>>>>>> version store, it can store multiple versions of each record
> >>> within
> >>>>>>> its
> >>>>>>>>>>>> defined grace period." and modified the rest of the paragraph
> a
> >>>>> bit.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 4) I get the preserving off offset ordering, but if the
> >>>>>>>>>>>> stream is
> >>>>>>>>> buffered
> >>>>>>>>>>>> to join on timestamp instead of offset doesn't it already seem
> >>> like
> >>>>>>> we
> >>>>>>>>> care
> >>>>>>>>>>>> more about time in this case?
> >>>>>>>>>>>>
> >>>>>>>>>>>> If we end up adding more options it might make sense to do
> >>>>>>>>>>>> this.
> >>>>>>> Maybe
> >>>>>>>>>>>> offset order processing can be a follow up?
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll add a section for this in Rejected Alternatives. I
> >>>>>>>>>>>> think it
> >>>>>>> makes
> >>>>>>>>>>>> sense to do something like this but maybe in a follow up.
> >>>>>>>>>>>>
> >>>>>>>>>>>> 5) I hadn't thought about this. I suppose if they changed
> >>>>>>>>>>>> this in
> >>>>> an
> >>>>>>>>>>>> upgrade the next record would either evict a lot of records
> (if
> >>> the
> >>>>>>>>> grace
> >>>>>>>>>>>> period decreased) or there would be a pause until the new
> grace
> >>>>>>> period
> >>>>>>>>>>>> reached. Increasing is a bit more problematic, especially if
> >>>>>>>>>>>> the
> >>>>>>> table
> >>>>>>>>>>>> grace period and retention time stays the same. If the data is
> >>>>>>>>> reprocessed
> >>>>>>>>>>>> after a change like that then there would be different
> results,
> >>>>> but I
> >>>>>>>>> feel
> >>>>>>>>>>>> like that would be expected after such a change.
> >>>>>>>>>>>>
> >>>>>>>>>>>> What do you think should happen?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Hopefully this answers your questions!
> >>>>>>>>>>>>
> >>>>>>>>>>>> Walker
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <
> >>> mjsax@apache.org>
> >>>>>>>>> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>>> Thanks for the KIP! Also some question/comments from my side:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 10) Notation: you use the term "late data" but I think you
> >>>>>>>>>>>>> mean
> >>>>>>>>>>>>> out-of-order. We reserve the term "late" to records that
> >>>>>>>>>>>>> arrive
> >>>>>>> after
> >>>>>>>>>>>>> grace period passed, and thus, "late == out-of-order data
> that
> >>> is
> >>>>>>>>>>>> dropped".
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 20) "There is only one option from the stream side and only
> >>>>> recently
> >>>>>>>>> is
> >>>>>>>>>>>>> there a second option on the table side."
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> What are those options? Victoria already asked about the
> table
> >>>>> side,
> >>>>>>>>> but
> >>>>>>>>>>>>> I am also not sure what option you mean for the stream side?
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 30) "If the table side uses a materialized version store the
> >>> value
> >>>>>>> is
> >>>>>>>>>>>>> the latest by stream time rather than by offset within its
> >>> defined
> >>>>>>>>> grace
> >>>>>>>>>>>>> period."
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The phrase "the value is the latest by stream time" is
> >>>>>>>>>>>>> confusing
> >>>>> --
> >>>>>>> in
> >>>>>>>>>>>>> the end, a versioned stores multiple versions, not just one.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 40) I am also wondering about ordering. In general, KS
> >>>>>>>>>>>>> tries to
> >>>>>>>>> preserve
> >>>>>>>>>>>>> offset-order during processing (with some exception, when
> >>>>>>>>>>>>> offset
> >>>>>>> order
> >>>>>>>>>>>>> preservation is not clearly defined). Given that the
> >>>>>>>>>>>>> stream-side
> >>>>>>>>> buffer
> >>>>>>>>>>>>> is really just a "linear buffer", we could easily preserve
> >>>>>>>>> offset-order.
> >>>>>>>>>>>>> But I also see a benefit of re-ordering and emitting
> >>> out-of-order
> >>>>>>> data
> >>>>>>>>>>>>> right away when read (instead of blocking them behind
> in-order
> >>>>>>> records
> >>>>>>>>>>>>> that are not ready yet). -- It might even be a possibility,
> to
> >>> let
> >>>>>>>>> users
> >>>>>>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name
> >>> just
> >>>>> a
> >>>>>>>>>>>>> placeholder).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The KIP should explain this in more detail and also discuss
> >>>>>>> different
> >>>>>>>>>>>>> options and mention them in "Rejected alternatives" in case
> we
> >>>>> don't
> >>>>>>>>>>>>> want to include them.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> 50) What happens when users change the grace period?
> >>>>>>>>>>>>> Especially,
> >>>>>>> when
> >>>>>>>>>>>>> they turn it on/off (but also increasing/decreasing is an
> >>>>>>> interesting
> >>>>>>>>>>>>> point)? I think we should try to support this if possible;
> the
> >>>>>>>>>>>>> "Compatibility" section needs to cover switching on/off in
> >>>>>>>>>>>>> more
> >>>>>>>>> detail.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> -Matthias
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
> >>>>>>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> A few clarifications:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 1. Is the order that records exit the buffer in
> >>>>>>>>>>>>>> necessarily the
> >>>>>>> same
> >>>>>>>>> as
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> order that records enter the buffer in, or no? Based on the
> >>>>>>>>> description
> >>>>>>>>>>>>> in
> >>>>>>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will
> >>> exit
> >>>>>>> the
> >>>>>>>>>>>>>> buffer in increasing timestamp order, which means that
> >>>>>>>>>>>>>> they may
> >>>>> be
> >>>>>>>>>>>>> ordered
> >>>>>>>>>>>>>> (even for the same key) compared to the input order.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
> >>>>>>>>> stream-side
> >>>>>>>>>>>>>> record arrives with a timestamp that is older than the
> >>>>>>>>>>>>>> current
> >>>>>>> stream
> >>>>>>>>>>>>> time
> >>>>>>>>>>>>>> minus the grace period? Will this record trigger a join
> >>>>>>>>>>>>>> result,
> >>>>> or
> >>>>>>>>> will
> >>>>>>>>>>>>> it
> >>>>>>>>>>>>>> be dropped? Based on the description for what happens when
> >>>>>>>>>>>>>> the
> >>>>> join
> >>>>>>>>>>>> grace
> >>>>>>>>>>>>>> period is set to zero, it sounds like the late record will
> be
> >>>>>>>>> dropped,
> >>>>>>>>>>>>> even
> >>>>>>>>>>>>>> if the join grace period is nonzero. Is that true?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> 3. What could cause stream time to advance, for purposes of
> >>>>>>> removing
> >>>>>>>>>>>>>> records from the join buffer? For example, will new records
> >>>>>>> arriving
> >>>>>>>>> on
> >>>>>>>>>>>>> the
> >>>>>>>>>>>>>> table side of the join cause stream time to advance? From
> the
> >>> KIP
> >>>>>>> it
> >>>>>>>>>>>>> sounds
> >>>>>>>>>>>>>> like only stream-side records will advance stream time --
> >>>>>>>>>>>>>> does
> >>>>> that
> >>>>>>>>>>>> mean
> >>>>>>>>>>>>>> that the join processor itself will have to track this
> stream
> >>>>> time?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Also +1 to Lucas's question about what options will be
> >>> available
> >>>>>>> for
> >>>>>>>>>>>>>> configuring the join buffer. Will users have the option to
> >>> choose
> >>>>>>>>>>>> whether
> >>>>>>>>>>>>>> they want the buffer to be in-memory vs persistent?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> - Victoria
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> >>>>>>>>>>>>>> <lb...@confluent.io.invalid> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> HI Walker,
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> thanks for the KIP! We definitely need this. I have two
> >>>>> questions:
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>        - Have you considered allowing the customization
> >>>>>>>>>>>>>>> of the
> >>>>>>>>> underlying
> >>>>>>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets
> you
> >>>>>>>>> customize
> >>>>>>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it
> >>> make
> >>>>>>>>> sense
> >>>>>>>>>>>>>>> for `Joined` to have this as well? I can imagine one may
> >>>>>>>>>>>>>>> want
> >>> to
> >>>>>>>>> limit
> >>>>>>>>>>>>>>> the number of records in the buffer, for example. If we hit
> >>> the
> >>>>>>>>>>>>>>> maximum, the only option would be to drop semantic
> >>>>>>>>>>>>>>> guarantees,
> >>>>> but
> >>>>>>>>>>>>>>> users may still want to do this.
> >>>>>>>>>>>>>>>        - With "second option on the table side" you are
> >>> referring
> >>>>> to
> >>>>>>>>>>>>>>> versioned tables, right? Will the buffer on the stream side
> >>>>> behave
> >>>>>>>>> any
> >>>>>>>>>>>>>>> different whether the table side is versioned or not?
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Finally, I think a simple example in the motivation section
> >>>>> could
> >>>>>>>>> help
> >>>>>>>>>>>>>>> non-experts understand the KIP.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Best,
> >>>>>>>>>>>>>>> Lucas
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> >>>>>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Hello everybody,
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> I have a stream proposal to improve the stream table
> >>>>>>>>>>>>>>>> join by
> >>>>>>>>> adding a
> >>>>>>>>>>>>>>> grace
> >>>>>>>>>>>>>>>> period and buffer to the stream side of the join to allow
> >>>>>>>>> processing
> >>>>>>>>>>>> in
> >>>>>>>>>>>>>>>> timestamp order matching the recent improvements of the
> >>>>> versioned
> >>>>>>>>>>>>> tables.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Please take a look here <
> >>>>>>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
> >>>>>>>>>>>>>>> and
> >>>>>>>>>>>>>>>> share your thoughts.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> best,
> >>>>>>>>>>>>>>>> Walker
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Bruno Cadonna <ca...@apache.org>.

Hi,

another idea that came to my mind. Instead of using a compacted topic, 
the buffer could use a non-compacted topic and regularly delete records 
before a given offset as Streams does for repartition topics.

Best,
Bruno

On 05.06.23 21:48, Bruno Cadonna wrote:
> Hi Victoria,
> 
> that is a good point!
> 
> I think, the topic needs to be a compacted topic to be able to get rid 
> of records that are evicted from the buffer. So the key might be 
> something with the key, the timestamp, and a sequence number to 
> distinguish between records with the same key and same timestamp.
> 
> Just an idea! Maybe Walker comes up with something better.
> 
> Best,
> Bruno
> 
> On 05.06.23 20:38, Victoria Xia wrote:
>> Hi Walker,
>>
>> Thanks for the latest updates! The KIP looks great. Just one question 
>> about
>> the changelog topic for the join buffer: The KIP says "When a failure
>> occurs the buffer will try to recover from an OffsetCheckpoint if 
>> possible.
>> If not it will reload the buffer from a compacted change-log topic." This
>> is a new changelog topic that will be introduced specifically for the 
>> join
>> buffer, right? Why is the changelog topic compacted? What are the keys? I
>> am confused because the buffer contains records from the stream-side 
>> of the
>> join, for which multiple records with the same key should be treated as
>> separate updates will all must be tracked in the buffer, rather than
>> updates which replace each other.
>>
>> Thanks,
>> Victoria
>>
>> On Mon, Jun 5, 2023 at 1:47 AM Bruno Cadonna <ca...@apache.org> wrote:
>>
>>> Hi Walker,
>>>
>>> Thanks once more for the updates to the KIP!
>>>
>>> Do you also plan to expose metrics for the buffer?
>>>
>>> Best,
>>> Bruno
>>>
>>> On 02.06.23 17:16, Walker Carlson wrote:
>>>> Hello Bruno,
>>>>
>>>> I think this covers your questions. Let me know what you think
>>>>
>>>> 2.
>>>> We can use a changelog topic. I think we can treat it like any other
>>> store
>>>> and recover in the usual manner. Also implementation is on disk
>>>>
>>>> 3.
>>>> The description is in the public interfaces description. I will copy it
>>>> into the proposed changes as well.
>>>>
>>>> This is a bit of an implementation detail that I didn't want to add 
>>>> into
>>>> the kip, but the record will be added to the buffer to keep the stream
>>> time
>>>> consistent, it will just be ejected immediately. If of course if this
>>>> causes performance issues we will skip this step and track stream time
>>>> separately. I will update the kip to say that stream time advances 
>>>> when a
>>>> stream record enters the node.
>>>>
>>>> Also, yes, updated.
>>>>
>>>> 5.
>>>> No there is no difference right now, everything gets processed as it
>>> comes
>>>> in and tries to find a record for its time stamp.
>>>>
>>>> Walker
>>>>
>>>> On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org> 
>>>> wrote:
>>>>
>>>>> Hi Walker,
>>>>>
>>>>> Thanks for the updates!
>>>>>
>>>>> 2.
>>>>> It is still not clear to me how a failure is handled. I do not
>>>>> understand what you mean by "recover from an OffsetCheckpoint".
>>>>>
>>>>> My understanding is that the buffer needs to be replicated into its 
>>>>> own
>>>>> Kafka topic. The input topic is not enough. The offset of a record is
>>>>> added to the offsets to commit once the record is streamed through the
>>>>> subtopology. That means once the record is added to the buffer its
>>>>> offset is added to the offsets to commit -- independently of 
>>>>> whether the
>>>>> record was evicted from the buffer and sent to the join node or not.
>>>>> Now, let's assume the following scenario
>>>>> 1. a record is read from the input topic and added to the buffer, but
>>>>> not evicted to be processed by the join node.
>>>>> 2. When the processing of the subtopology finishes the offset of the
>>>>> record is added to the offsets to commit.
>>>>> 3. A commit happens.
>>>>> 4. A failure happens
>>>>>
>>>>> After the failure the buffer is empty but the record will not be read
>>>>> anymore from the input topic since its offset has been already
>>>>> committed. The record is lost.
>>>>> One solution to avoid the loss is to recreate the buffer from a
>>>>> compacted Kafka topic as we do for suppression buffers. I do not 
>>>>> think,
>>>>> we need any offset checkpoint here since, we keep the buffer in 
>>>>> memory,
>>>>> right? Or do you plan to back the buffer with a persistent store? Even
>>>>> in that case, a compacted Kafka topic would be needed.
>>>>>
>>>>>
>>>>> 3.
>>>>>    From the KIP it is still not clear to me what happens if a 
>>>>> record is
>>>>> outside of the grace period. I guess the record that falls outside of
>>>>> the grace period will not be added to the buffer, but will be send to
>>>>> the join node. Since it is outside of the grace period it will also 
>>>>> not
>>>>> increase stream time and it will not trigger an eviction. Also the 
>>>>> head
>>>>> of the buffer will not contain a record that needs to be evicted since
>>>>> the the timestamp of the head record will be within the interval 
>>>>> stream
>>>>> time minus grace period. Is this correct? Please add such a 
>>>>> description
>>>>> to the KIP.
>>>>> Furthermore, I think there is a mistake in the text:
>>>>> "... will dequeue when the record timestamp is greater than stream 
>>>>> time
>>>>> plus the grace period". I guess that should be "... will dequeue when
>>>>> the record timestamp is less than (or equal?) stream time minus the
>>>>> grace period"
>>>>>
>>>>>
>>>>> 5.
>>>>> What is the difference between not setting the grace period and 
>>>>> setting
>>>>> it to zero? If there is a difference, why is there a difference?
>>>>>
>>>>>
>>>>> Best,
>>>>> Bruno
>>>>>
>>>>>
>>>>> On 01.06.23 23:58, Walker Carlson wrote:
>>>>>> Hey Bruno thanks for the feedback.
>>>>>>
>>>>>> 1)
>>>>>> I will add this to the kip, but stream time only advances as the when
>>> the
>>>>>> buffer receives a new record.
>>>>>>
>>>>>> 2)
>>>>>> You are correct, I will add a failure section on to the kip. Since 
>>>>>> the
>>>>>> records wont change in the buffer from when they are read from the
>>> topic
>>>>>> they are replicated already.
>>>>>>
>>>>>> 3)
>>>>>> I see that I'm out voted on the dropping of records thing. We will 
>>>>>> pass
>>>>>> them on and try to join them if possible. This might cause some null
>>>>>> results, but increasing the table history retention should help that.
>>>>>>
>>>>>> 4)
>>>>>> I can add some on the kip. But its pretty directly adding whatever 
>>>>>> the
>>>>>> grace period is to the latency. I don't see a way around it.
>>>>>>
>>>>>> Walker
>>>>>>
>>>>>> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org>
>>> wrote:
>>>>>>
>>>>>>> Hi Walker,
>>>>>>>
>>>>>>> thanks for the KIP!
>>>>>>>
>>>>>>> Here my feedback:
>>>>>>>
>>>>>>> 1.
>>>>>>> It is still not clear to me when stream time for the buffer 
>>>>>>> advances.
>>>>>>> What is the event that let the stream time advance? In the
>>> discussion, I
>>>>>>> do not understand what you mean by "The segment store already has an
>>>>>>> observed stream time, we advance based on that. That should only
>>> advance
>>>>>>> based on records that enter the store." Where does this segment 
>>>>>>> store
>>>>>>> come from? Anyways, I think it would be great to also state how 
>>>>>>> stream
>>>>>>> time advances in the KIP.
>>>>>>>
>>>>>>> 2.
>>>>>>> How does the buffer behave in case of a failure? I think I 
>>>>>>> understand
>>>>>>> that the buffer will use an implementation of
>>> TimeOrderedKeyValueBuffer
>>>>>>> and therefore the records in the buffer will be replicated to a 
>>>>>>> topic
>>> in
>>>>>>> Kafka, but I am not completely sure. Could you elaborate on this in
>>> the
>>>>>>> KIP?
>>>>>>>
>>>>>>> 3.
>>>>>>> I agree with Matthias about dropping late records. We use grace
>>> periods
>>>>>>> in scenarios where we records are grouped like in windowed
>>> aggregations
>>>>>>> and windowed joins. The stream buffer you propose does not really
>>> group
>>>>>>> any records. It rather delays records and reorders them. I am not 
>>>>>>> sure
>>>>>>> if grace period is the right naming/concept to apply here. 
>>>>>>> Instead of
>>>>>>> dropping records that fall outside of the buffer's time interval the
>>>>>>> join should skip the buffer and try to join the record 
>>>>>>> immediately. In
>>>>>>> the end, a stream-table join is a unwindowed join, i.e., no grouping
>>> is
>>>>>>> applied to the records.
>>>>>>> What do you and other folks think about this proposal?
>>>>>>>
>>>>>>> 4.
>>>>>>> How does the proposed buffer, affects processing latency? Could you
>>>>>>> please add some words about this to the KIP?
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>> Bruno
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 31.05.23 01:49, Walker Carlson wrote:
>>>>>>>> Thanks for all the additional comments. I will either address them
>>> here
>>>>>>> or
>>>>>>>> update the kip accordingly.
>>>>>>>>
>>>>>>>>
>>>>>>>> I mentioned a follow kip to add extra features before and in the
>>>>>>> responses.
>>>>>>>> I will try to briefly summarize what options and optimizations I 
>>>>>>>> plan
>>>>> to
>>>>>>>> include. If a concern is not covered in this list I for sure talk
>>> about
>>>>>>> it
>>>>>>>> below.
>>>>>>>>
>>>>>>>> * Allowing non versioned tables to still use the stream buffer
>>>>>>>> * Automatically materializing tables instead of forcing the user to
>>> do
>>>>> it
>>>>>>>> * Configurable for in memory buffer
>>>>>>>> * Order the records in offset order or in time order
>>>>>>>> * Non memory use buffer (offset order, delayed pull from stream.)
>>>>>>>> * Time synced between stream and table side (maybe)
>>>>>>>> * Do not drop late records and process them as they come in 
>>>>>>>> instead.
>>>>>>>>
>>>>>>>>
>>>>>>>> First, Victoria.
>>>>>>>>
>>>>>>>> 1) (One of your nits covers this, but you are correct it doesn't 
>>>>>>>> make
>>>>>>>> sense. so I removed that part of the example.)
>>>>>>>> For those examples with the "bad" join results I said without
>>> buffering
>>>>>>> the
>>>>>>>> stream it would look like that, but that was incomplete. If the 
>>>>>>>> look
>>> up
>>>>>>> was
>>>>>>>> simply looking at the latest version of the table when the stream
>>>>> records
>>>>>>>> came in then the results were possible. If we are using the 
>>>>>>>> point in
>>>>> time
>>>>>>>> lookup that versioned tables let us then you are correct the future
>>>>>>> results
>>>>>>>> are not possible.
>>>>>>>>
>>>>>>>> 2) I'll get to this later as Matthias brought up something related.
>>>>>>>>
>>>>>>>> To your additional thoughts, I agree that we need to call those
>>> things
>>>>>>> out
>>>>>>>> in the documentation. I'm writing up a follow up kip with a lot of
>>> the
>>>>>>>> ideas we have discussed so that we can improve this feature beyond
>>> the
>>>>>>> base
>>>>>>>> implementation if it's needed.
>>>>>>>>
>>>>>>>> I addressed the nits in the kip. I somehow missed the table stream
>>>>> table
>>>>>>>> join processor improvement, it makes your first question make a lot
>>>>> more
>>>>>>>> sense.  Table history retention is a much cleaner way to 
>>>>>>>> describe it.
>>>>>>>>
>>>>>>>> As to your mention of the syncing the time for the table and 
>>>>>>>> stream.
>>>>>>>> Matthias mentioned that as well. I will address both here. I 
>>>>>>>> plan to
>>>>>>> bring
>>>>>>>> that up in the future, but for now we will leave it out. I 
>>>>>>>> suppose it
>>>>>>> will
>>>>>>>> be more useful after the table history retention is separable from
>>> the
>>>>>>>> table grace period.
>>>>>>>>
>>>>>>>>
>>>>>>>> To address Matthias comments.
>>>>>>>>
>>>>>>>> You are correct by saying the in memory store shouldn't cause any
>>>>>>> semantic
>>>>>>>> concerns. My concern would be more with if we limited the number of
>>>>>>> records
>>>>>>>> on the buffer and what we would do if we hit said limits, (emitting
>>>>> those
>>>>>>>> records might be an issue, throwing an error and halting would 
>>>>>>>> not).
>>> I
>>>>>>>> think we can leave this discussion to the follow up kip along 
>>>>>>>> with a
>>>>> few
>>>>>>>> other options.
>>>>>>>>
>>>>>>>> I will go through your proposals now.
>>>>>>>>
>>>>>>>>       - don't support non-versioned KTables
>>>>>>>>
>>>>>>>> Sure, we can always expand this later on. Will include as part 
>>>>>>>> of the
>>>>> of
>>>>>>>> the improvement kip
>>>>>>>>
>>>>>>>>       - if grace period is added, users need to explicitly 
>>>>>>>> materialize
>>>>> the
>>>>>>>> table as version (either directly, or upstream. Upstream only works
>>> if
>>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>>>>>
>>>>>>>> again, that works for me for now, if we find a use we can always 
>>>>>>>> add
>>>>>>> later.
>>>>>>>>
>>>>>>>>       - the table's history retention time must be larger than the
>>> grace
>>>>>>>> period (should be easy to check at runtime, when we build the
>>> topology)
>>>>>>>>
>>>>>>>> agreed
>>>>>>>>
>>>>>>>>       - because switching from non-versioned to version stores 
>>>>>>>> is not
>>>>>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>>>>>> themselves, and this also implies that adding grace period is not a
>>>>>>>> backward compatible change (even only if via indirect means)
>>>>>>>>
>>>>>>>> sure, this works
>>>>>>>>
>>>>>>>> As to the dropping of late records, I'm not sure. One one hand I 
>>>>>>>> like
>>>>> not
>>>>>>>> dropping things. But on the other I struggle to see how a user can
>>>>> filter
>>>>>>>> out late records that might have incomplete join results. The point
>>> in
>>>>>>> time
>>>>>>>> look up will aggressively expire old data and if new data has been
>>>>>>> replaced
>>>>>>>> it will return null if outside of the retention. This seems like it
>>>>> could
>>>>>>>> corrupt the integrity of the join output. Seeing that we drop late
>>>>>>> records
>>>>>>>> on the table side as well I would think it makes sense to drop late
>>>>>>> records
>>>>>>>> on the stream buffer. I could be convinced otherwise I suppose, I
>>> could
>>>>>>> see
>>>>>>>> adding this as an option in a follow up kip. It would be very 
>>>>>>>> easy to
>>>>>>>> implement either way. For now unless no one else objects I'm 
>>>>>>>> going to
>>>>>>> stick
>>>>>>>> with dropping the records for the sake of getting this kip 
>>>>>>>> passed. It
>>>>> is
>>>>>>>> functionally a small change to make and we can update later if you
>>> feel
>>>>>>>> strongly about it.
>>>>>>>>
>>>>>>>> For the ordering. I have to say that it would be more 
>>>>>>>> complicated to
>>>>>>>> implement it to be in offset order, if the goal it to get as 
>>>>>>>> many of
>>>>> the
>>>>>>>> records validly joined as possible. Because we would process as
>>> things
>>>>>>> left
>>>>>>>> the buffer a sufficiency early enough record could hold up records
>>> that
>>>>>>>> would otherwise be valid past the table history retention. To fix
>>> this
>>>>> we
>>>>>>>> could process by timestamp then store in a second queue and emit by
>>>>>>> offset,
>>>>>>>> but that would be a lot more complicated. If we didn't care 
>>>>>>>> about not
>>>>>>>> missing some valid joins we could just have no store and pull from
>>> the
>>>>>>>> topic at a delay only caring about the timestamp of the next 
>>>>>>>> offset.
>>>>> For
>>>>>>>> now I want to stick with the timestamp ordering as it makes much 
>>>>>>>> more
>>>>>>> sense
>>>>>>>> to me, but would propose we add both of the other options I have 
>>>>>>>> laid
>>>>> out
>>>>>>>> here in the follow up kip.
>>>>>>>>
>>>>>>>> Lastly, I think having an empty store with zero grace period 
>>>>>>>> would be
>>>>>>> super
>>>>>>>> simple and not costly, so we might as well make it even if nothing
>>> gets
>>>>>>>> entered.
>>>>>>>>
>>>>>>>> I hope that address all your concerns,
>>>>>>>>
>>>>>>>> Walker
>>>>>>>>
>>>>>>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Walker,
>>>>>>>>>
>>>>>>>>> thanks for the updates. The KIP itself reads fine (of course
>>> Victoria
>>>>>>>>> made good comments about some phrases), but there is a couple of
>>>>> things
>>>>>>>>> from your latest reply I don't understand, and that I still think
>>> need
>>>>>>>>> some more discussions.
>>>>>>>>>
>>>>>>>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and
>>> you
>>>>>>>>> mention "semantic concerns". There should not be any semantic
>>>>> difference
>>>>>>>>> from the underlying buffer implementation, so I am not sure 
>>>>>>>>> what you
>>>>>>>>> mean here (also the relationship to suppress() is unclear to me)?
>>> -- I
>>>>>>>>> am ok to not make it configurable for now. We can always do it 
>>>>>>>>> via a
>>>>>>>>> follow up KIP, and keep interface changes limited for now.
>>>>>>>>>
>>>>>>>>> Does it really make sense to allow a grace period if the table is
>>>>>>>>> non-versioned? You also say: "If table is not materialized it will
>>>>>>>>> materialize it as versioned." -- What history retention time would
>>> we
>>>>>>>>> pick for this case (also asked by Victoria)? Or should we 
>>>>>>>>> rather not
>>>>>>>>> support this and force the user to materialize the table 
>>>>>>>>> explicitly,
>>>>> and
>>>>>>>>> thus explicitly picking a history retention time? It's tradeoff
>>>>> between
>>>>>>>>> usability and guiding uses that there will be a significant impact
>>> on
>>>>>>>>> disk usage. There is also compatibility concerns: If the table is
>>> not
>>>>>>>>> explicitly materialized in the old program, we would already 
>>>>>>>>> need to
>>>>>>>>> materialize it also in the old program (of course, we would use a
>>>>>>>>> non-versioned store so far). Thus, if somebody adds a grace 
>>>>>>>>> period,
>>> we
>>>>>>>>> cannot just switch the store type, as it would be a breaking 
>>>>>>>>> change,
>>>>>>>>> potentially required an application re-set, or following the 
>>>>>>>>> upgrade
>>>>>>>>> path for versioned state stores, and also changing the program to
>>>>>>>>> explicitly materialize using a versioned store. Also note, that we
>>>>> might
>>>>>>>>> not materialize the actual join table, but only an upstream table,
>>> and
>>>>>>>>> use `ValueGetter` to access the upstream data.
>>>>>>>>>
>>>>>>>>> To this end, as you already mentioned, history retention of the
>>> table
>>>>>>>>> should be at least grace period. You proposed to include this in a
>>>>>>>>> follow up KIP, but I am wondering if it's a fundamental 
>>>>>>>>> requirement
>>>>> and
>>>>>>>>> thus we should put a check in place right away and reject an 
>>>>>>>>> invalid
>>>>>>>>> configuration? (It always easier to lift restriction than to
>>> introduce
>>>>>>>>> them later.) This would also imply that a non-versioned table 
>>>>>>>>> cannot
>>>>> be
>>>>>>>>> supported, because it does not have a history retention that is
>>> larger
>>>>>>>>> than grace period, and maybe also answer the requirement about
>>>>>>>>> materialization: as we already always materialize something on the
>>>>>>>>> tablet side as non-versioned store right now, it seems 
>>>>>>>>> difficult to
>>>>>>>>> migrate the store to a versioned store. Ie, it might be ok to push
>>> the
>>>>>>>>> burden onto the user and say: if you start using grace period, you
>>>>> also
>>>>>>>>> need to manually switch from non-versioned to versioned KTables.
>>> Doing
>>>>>>>>> stuff automatically under the hood if very complex for this 
>>>>>>>>> case, we
>>>>> if
>>>>>>>>> we push the burden onto the user, it might be ok to not complicate
>>>>> this
>>>>>>>>> KIP significantly.
>>>>>>>>>
>>>>>>>>> To summarize the last two paragraphs, I would propose to:
>>>>>>>>>       - don't support non-versioned KTables
>>>>>>>>>       - if grace period is added, users need to explicitly
>>> materialize
>>>>> the
>>>>>>>>> table as version (either directly, or upstream. Upstream only 
>>>>>>>>> works
>>> if
>>>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>>>>>>       - the table's history retention time must be larger than the
>>> grace
>>>>>>>>> period (should be easy to check at runtime, when we build the
>>>>> topology)
>>>>>>>>>       - because switching from non-versioned to version stores 
>>>>>>>>> is not
>>>>>>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>>>>>>> themselves, and this also implies that adding grace period is 
>>>>>>>>> not a
>>>>>>>>> backward compatible change (even only if via indirect means)
>>>>>>>>>
>>>>>>>>> About dropping late records: wondering if we should never drop a
>>>>>>>>> stream-side record for a left-join, even if it's late? In general,
>>> one
>>>>>>>>> thing I observed over the years is, that it's easier to keep stuff
>>> and
>>>>>>>>> let users filter explicitly downstream (or make it configurable),
>>>>>>>>> instead of dropping pro-actively, because users have no good 
>>>>>>>>> way to
>>>>>>>>> resurrect record that got already dropped.
>>>>>>>>>
>>>>>>>>> For ordering, sounds reasonable to me only start with one
>>>>>>>>> implementation, and maybe make it configurable as a follow up.
>>>>> However,
>>>>>>>>> I am wondering if starting with offset order might be the better
>>>>> option
>>>>>>>>> as it seems to align more with what we do so far? So instead of
>>>>> storing
>>>>>>>>> record ordered by timestamp, we can just store them ordered by
>>> offset,
>>>>>>>>> and still "poll" from the buffer based on the head records
>>> timestamp.
>>>>> Or
>>>>>>>>> would this complicate the implementation significantly?
>>>>>>>>>
>>>>>>>>> I also think it's ok to not "sync" stream-time between the 
>>>>>>>>> table and
>>>>> the
>>>>>>>>> stream in this KIP, but we should consider doing this as a 
>>>>>>>>> follow up
>>>>>>>>> change (not sure if we would need a KIP or not for a change this
>>>>> this).
>>>>>>>>>
>>>>>>>>> About increasing/decreasing grace period: what you describe make
>>> sense
>>>>>>>>> to me. If decreased, the next record would just trigger emitting a
>>> lot
>>>>>>>>> of records, and for increase, the buffer would just need to "fill
>>> up"
>>>>>>>>> again. For reprocessing getting a different result with a 
>>>>>>>>> different
>>>>>>>>> grace period is expected, so that's ok IMHO. -- There seems to be
>>> one
>>>>>>>>> special corner case: grace period zero. For this case, we actually
>>>>> don't
>>>>>>>>> need any store, and the stream-side could be stateless. I think it
>>> can
>>>>>>>>> have the same behavior, but if we want to "add / remove" the store
>>>>>>>>> dynamically, we need to add specific code for it. For example, 
>>>>>>>>> even
>>> if
>>>>>>>>> we start up with a grace period of zero, we would need to check if
>>>>> there
>>>>>>>>> is a local store, and still emit everything in it, before we can
>>> ditch
>>>>>>>>> the store (not sure if that's even easily done at all). Or: we 
>>>>>>>>> would
>>>>>>>>> need to have a store for _all_ cases, even if grace period is zero
>>>>> (the
>>>>>>>>> store would be empty all the time though), to avoid super complex
>>>>> code?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -Matthias
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
>>>>>>>>>> Hi Walker,
>>>>>>>>>>
>>>>>>>>>> thanks for your responses. That makes sense. I guess there is
>>> always
>>>>>>>>>> the option to make the implementation more configurable later on,
>>> if
>>>>>>>>>> users request it. Also thanks for the clarifications. From my 
>>>>>>>>>> side,
>>>>>>>>>> the KIP is good to go.
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Lucas
>>>>>>>>>>
>>>>>>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
>>>>>>>>>> <vi...@confluent.io.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the updates, Walker! Looks great, though I do have a
>>>>> couple
>>>>>>>>>>> questions about the latest updates:
>>>>>>>>>>>
>>>>>>>>>>>         1. The new example says that without stream-side 
>>>>>>>>>>> buffering,
>>>>> "ex"
>>>>>>> and
>>>>>>>>>>>         "fy" are possible join results. How could those join
>>> results
>>>>>>>>> happen? The
>>>>>>>>>>>         example versioned table suggests that table record 
>>>>>>>>>>> "x" has
>>>>>>>>> timestamp 2, and
>>>>>>>>>>>         table record "y" has timestamp 3. If stream record 
>>>>>>>>>>> "e" has
>>>>>>>>> timestamp 1,
>>>>>>>>>>>         then it can never be joined against record "x", and
>>> similarly
>>>>> for
>>>>>>>>> stream
>>>>>>>>>>>         record "f" with timestamp 2 being joined against "y".
>>>>>>>>>>>         2. I see in your replies above that "If table is not
>>>>>>> materialized it
>>>>>>>>>>>         will materialize it as versioned" but I don't see this
>>> called
>>>>> out
>>>>>>>>> in the
>>>>>>>>>>>         KIP -- seems worth calling out. Also, what will the 
>>>>>>>>>>> history
>>>>>>>>> retention for
>>>>>>>>>>>         the versioned table be? Will it be the same as the join
>>> grace
>>>>>>>>> period, or
>>>>>>>>>>>         will it be greater?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> And some additional thoughts:
>>>>>>>>>>>
>>>>>>>>>>> Sounds like there are a few things users should watch out for 
>>>>>>>>>>> when
>>>>>>>>> enabling
>>>>>>>>>>> the stream-side buffer:
>>>>>>>>>>>
>>>>>>>>>>>         - Records will get "stuck" if there are no newer 
>>>>>>>>>>> records to
>>>>>>> advance
>>>>>>>>>>>         stream time.
>>>>>>>>>>>         - If there are large gaps between the timestamps of
>>>>> stream-side
>>>>>>>>> records,
>>>>>>>>>>>         then it's possible that versioned store history 
>>>>>>>>>>> retention
>>> will
>>>>>>> have
>>>>>>>>> expired
>>>>>>>>>>>         by the time a record is evicted from the join buffer,
>>> leading
>>>>> to
>>>>>>> a
>>>>>>>>> join
>>>>>>>>>>>         "miss." For example, if the join grace period and table
>>>>> history
>>>>>>>>> retention
>>>>>>>>>>>         are both 10, and records come in the order:
>>>>>>>>>>>
>>>>>>>>>>>         table side t0 with ts=0
>>>>>>>>>>>         stream side s1 with ts=1 <-- enters buffer
>>>>>>>>>>>         table side t10 with ts=10
>>>>>>>>>>>         table side t20 with ts=20
>>>>>>>>>>>         stream side s21 with ts=21 <-- evicts record s1 from
>>> buffer,
>>>>> but
>>>>>>>>>>>         versioned store no longer contains data for ts=1 due to
>>>>> history
>>>>>>>>> retention
>>>>>>>>>>>         having elapsed
>>>>>>>>>>>
>>>>>>>>>>>         This will result in the join result (s1, null) even 
>>>>>>>>>>> though
>>> it
>>>>>>>>> should've
>>>>>>>>>>>         been (s1, t0), due to t0 having been expired from the
>>>>> versioned
>>>>>>>>> store
>>>>>>>>>>>         already.
>>>>>>>>>>>         - Out-of-order records from the stream-side will be
>>> reordered,
>>>>>>> and
>>>>>>>>> late
>>>>>>>>>>>         records will be dropped.
>>>>>>>>>>>
>>>>>>>>>>> I don't think any of these are reasons to not go forward with 
>>>>>>>>>>> this
>>>>>>> KIP,
>>>>>>>>> but
>>>>>>>>>>> it'd be good to call them out in the eventual documentation to
>>>>>>> decrease
>>>>>>>>> the
>>>>>>>>>>> chance users get tripped up.
>>>>>>>>>>>
>>>>>>>>>>>> We could maybe do an improvement later to advance stream time
>>> from
>>>>>>>>> table
>>>>>>>>>>> side as well, but that might be debatable as we might get more
>>> late
>>>>>>>>> records.
>>>>>>>>>>>
>>>>>>>>>>> Yes, the likelihood of late records increases but also the
>>>>> likelihood
>>>>>>> of
>>>>>>>>>>> "join misses" due to versioned store history retention having
>>>>> elapsed
>>>>>>>>>>> decreases, which feels important for certain use cases. Either
>>> way,
>>>>>>>>> agreed
>>>>>>>>>>> that it can be a discussion for the future as incorporating this
>>>>> would
>>>>>>>>>>> substantially complicate the implementation.
>>>>>>>>>>>
>>>>>>>>>>> Also a couple nits:
>>>>>>>>>>>
>>>>>>>>>>>         - The KIP currently says "We recently added versioned
>>> tables
>>>>>>> which
>>>>>>>>> allow
>>>>>>>>>>>         the table side of the a join [...] but it is not taken
>>>>> advantage
>>>>>>> of
>>>>>>>>> in
>>>>>>>>>>>         joins," but this doesn't seem true? If the table of a
>>>>>>> stream-table
>>>>>>>>> join is
>>>>>>>>>>>         versioned, then the DSL's stream-table join processor 
>>>>>>>>>>> will
>>>>>>>>> automatically
>>>>>>>>>>>         perform timestamped lookups into the table, in order to
>>> take
>>>>>>>>> advantage of
>>>>>>>>>>>         the new timestamp-aware store to provide better join
>>>>> semantics.
>>>>>>>>>>>         - The KIP mentions "grace period" for versioned 
>>>>>>>>>>> stores in a
>>>>>>> number
>>>>>>>>> of
>>>>>>>>>>>         places but I think you actually mean "history 
>>>>>>>>>>> retention"?
>>> The
>>>>> two
>>>>>>>>> happen to
>>>>>>>>>>>         be the same today (it is not an option for users to
>>> configure
>>>>> the
>>>>>>>>> two
>>>>>>>>>>>         separately) but this need not be true in the future.
>>> "History
>>>>>>>>> retention"
>>>>>>>>>>>         governs how far back in time reads may occur, which 
>>>>>>>>>>> is the
>>>>>>> relevant
>>>>>>>>>>>         parameter for performing lookups as part of the
>>> stream-table
>>>>>>> join.
>>>>>>>>> "Grace
>>>>>>>>>>>         period" in the context of versioned stores refers to how
>>> far
>>>>> back
>>>>>>>>> in time
>>>>>>>>>>>         out-of-order writes may occur, which probably isn't
>>> directly
>>>>>>>>> relevant for
>>>>>>>>>>>         introducing a stream-side buffer, though it's also 
>>>>>>>>>>> possible
>>>>> I've
>>>>>>>>> overlooked
>>>>>>>>>>>         something. (As a bonus, switching from "table grace
>>> period" in
>>>>>>> the
>>>>>>>>> KIP to
>>>>>>>>>>>         "table history retention" also helps to 
>>>>>>>>>>> clarify/distinguish
>>>>> that
>>>>>>>>> it's a
>>>>>>>>>>>         different parameter from the "join grace period," 
>>>>>>>>>>> which I
>>>>> could
>>>>>>> see
>>>>>>>>> being
>>>>>>>>>>>         confusing to readers. :) )
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Cheers,
>>>>>>>>>>> Victoria
>>>>>>>>>>>
>>>>>>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hey all,
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the comments, they gave me a lot to think about. 
>>>>>>>>>>>> I'll
>>>>> try
>>>>>>> to
>>>>>>>>>>>> address them all inorder. I have made some updates to the kip
>>>>> related
>>>>>>>>> to
>>>>>>>>>>>> them, but I mention where below.
>>>>>>>>>>>>
>>>>>>>>>>>> Lucas
>>>>>>>>>>>>
>>>>>>>>>>>> Good idea about the example. I added a simple one.
>>>>>>>>>>>>
>>>>>>>>>>>> 1) I have thought about including options for the underlying
>>> buffer
>>>>>>>>>>>> configuration. One of which might be adding an in memory 
>>>>>>>>>>>> option.
>>> My
>>>>>>>>> biggest
>>>>>>>>>>>> concern is about the semantic guarantees. This isn't like
>>> suppress
>>>>> or
>>>>>>>>> with
>>>>>>>>>>>> windows where producing incomplete results is repetitively
>>>>> harmless.
>>>>>>>>> Here
>>>>>>>>>>>> we would be possibly producing incorrect results. I also would
>>> like
>>>>>>> to
>>>>>>>>> keep
>>>>>>>>>>>> the interface changes as simple as I can. Making more than this
>>>>>>> change
>>>>>>>>> to
>>>>>>>>>>>> Joined I feel could make this more complicated than it needs to
>>> be.
>>>>>>> If
>>>>>>>>> we
>>>>>>>>>>>> really want to I could see adding a grace() option with a
>>>>>>> BufferConifg
>>>>>>>>> in
>>>>>>>>>>>> there or something, but I would rather not.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) The buffer will be independent of if the table is 
>>>>>>>>>>>> versioned or
>>>>>>> not.
>>>>>>>>> If
>>>>>>>>>>>> table is not materialized it will materialize it as 
>>>>>>>>>>>> versioned. It
>>>>>>> might
>>>>>>>>>>>> make sense to do a follow up kip where we force the retention
>>>>> period
>>>>>>>>> of
>>>>>>>>>>>> the versioned to be greater than whatever the max of the stream
>>>>>>> buffer
>>>>>>>>> is.
>>>>>>>>>>>>
>>>>>>>>>>>> Victoria
>>>>>>>>>>>>
>>>>>>>>>>>> 1) Yes, records will exit in timestamp order not in offset 
>>>>>>>>>>>> order.
>>>>>>>>>>>> 2) Late records will be dropped (Late as out of the grace
>>> period).
>>>>>>>>>     From my
>>>>>>>>>>>> understanding that is the point of a grace period, no? Doesn't
>>> the
>>>>>>> same
>>>>>>>>>>>> thing happen with versioned stores?
>>>>>>>>>>>> 3) The segment store already has an observed stream time, we
>>>>> advance
>>>>>>>>> based
>>>>>>>>>>>> on that. That should only advance based on records that 
>>>>>>>>>>>> enter the
>>>>>>>>> store. So
>>>>>>>>>>>> yes, only stream side records. We could maybe do an improvement
>>>>> later
>>>>>>>>> to
>>>>>>>>>>>> advance stream time from table side as well, but that might be
>>>>>>>>> debatable as
>>>>>>>>>>>> we might get more late records. Anyways I would rather have 
>>>>>>>>>>>> that
>>>>> as a
>>>>>>>>>>>> separate discussion.
>>>>>>>>>>>>
>>>>>>>>>>>> in memory option? We can do that, for the buffer I plan to use
>>> the
>>>>>>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in
>>> memory
>>>>>>>>>>>> implantation, so it would be simple.
>>>>>>>>>>>>
>>>>>>>>>>>> I said more in my answer to Lucas's question. The concern I 
>>>>>>>>>>>> have
>>>>> with
>>>>>>>>>>>> buffer configs or in memory is complicating the interface. Also
>>>>>>>>> semantic
>>>>>>>>>>>> guarantees but in memory shouldn't effect that
>>>>>>>>>>>>
>>>>>>>>>>>> Matthias
>>>>>>>>>>>>
>>>>>>>>>>>> 1) fixed out of order vs late terminology in the kip.
>>>>>>>>>>>>
>>>>>>>>>>>> 2) I was referring to having a stream. So after this kip we can
>>>>> have
>>>>>>> a
>>>>>>>>>>>> buffered stream or a normal one. For the table we can use a
>>>>> versioned
>>>>>>>>> table
>>>>>>>>>>>> or a normal table.
>>>>>>>>>>>>
>>>>>>>>>>>> 3 Good call out. I clarified this as "If the table side uses a
>>>>>>>>> materialized
>>>>>>>>>>>> version store, it can store multiple versions of each record
>>> within
>>>>>>> its
>>>>>>>>>>>> defined grace period." and modified the rest of the paragraph a
>>>>> bit.
>>>>>>>>>>>>
>>>>>>>>>>>> 4) I get the preserving off offset ordering, but if the 
>>>>>>>>>>>> stream is
>>>>>>>>> buffered
>>>>>>>>>>>> to join on timestamp instead of offset doesn't it already seem
>>> like
>>>>>>> we
>>>>>>>>> care
>>>>>>>>>>>> more about time in this case?
>>>>>>>>>>>>
>>>>>>>>>>>> If we end up adding more options it might make sense to do 
>>>>>>>>>>>> this.
>>>>>>> Maybe
>>>>>>>>>>>> offset order processing can be a follow up?
>>>>>>>>>>>>
>>>>>>>>>>>> I'll add a section for this in Rejected Alternatives. I 
>>>>>>>>>>>> think it
>>>>>>> makes
>>>>>>>>>>>> sense to do something like this but maybe in a follow up.
>>>>>>>>>>>>
>>>>>>>>>>>> 5) I hadn't thought about this. I suppose if they changed 
>>>>>>>>>>>> this in
>>>>> an
>>>>>>>>>>>> upgrade the next record would either evict a lot of records (if
>>> the
>>>>>>>>> grace
>>>>>>>>>>>> period decreased) or there would be a pause until the new grace
>>>>>>> period
>>>>>>>>>>>> reached. Increasing is a bit more problematic, especially if 
>>>>>>>>>>>> the
>>>>>>> table
>>>>>>>>>>>> grace period and retention time stays the same. If the data is
>>>>>>>>> reprocessed
>>>>>>>>>>>> after a change like that then there would be different results,
>>>>> but I
>>>>>>>>> feel
>>>>>>>>>>>> like that would be expected after such a change.
>>>>>>>>>>>>
>>>>>>>>>>>> What do you think should happen?
>>>>>>>>>>>>
>>>>>>>>>>>> Hopefully this answers your questions!
>>>>>>>>>>>>
>>>>>>>>>>>> Walker
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <
>>> mjsax@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for the KIP! Also some question/comments from my side:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 10) Notation: you use the term "late data" but I think you 
>>>>>>>>>>>>> mean
>>>>>>>>>>>>> out-of-order. We reserve the term "late" to records that 
>>>>>>>>>>>>> arrive
>>>>>>> after
>>>>>>>>>>>>> grace period passed, and thus, "late == out-of-order data that
>>> is
>>>>>>>>>>>> dropped".
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 20) "There is only one option from the stream side and only
>>>>> recently
>>>>>>>>> is
>>>>>>>>>>>>> there a second option on the table side."
>>>>>>>>>>>>>
>>>>>>>>>>>>> What are those options? Victoria already asked about the table
>>>>> side,
>>>>>>>>> but
>>>>>>>>>>>>> I am also not sure what option you mean for the stream side?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 30) "If the table side uses a materialized version store the
>>> value
>>>>>>> is
>>>>>>>>>>>>> the latest by stream time rather than by offset within its
>>> defined
>>>>>>>>> grace
>>>>>>>>>>>>> period."
>>>>>>>>>>>>>
>>>>>>>>>>>>> The phrase "the value is the latest by stream time" is 
>>>>>>>>>>>>> confusing
>>>>> -- 
>>>>>>> in
>>>>>>>>>>>>> the end, a versioned stores multiple versions, not just one.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 40) I am also wondering about ordering. In general, KS 
>>>>>>>>>>>>> tries to
>>>>>>>>> preserve
>>>>>>>>>>>>> offset-order during processing (with some exception, when 
>>>>>>>>>>>>> offset
>>>>>>> order
>>>>>>>>>>>>> preservation is not clearly defined). Given that the 
>>>>>>>>>>>>> stream-side
>>>>>>>>> buffer
>>>>>>>>>>>>> is really just a "linear buffer", we could easily preserve
>>>>>>>>> offset-order.
>>>>>>>>>>>>> But I also see a benefit of re-ordering and emitting
>>> out-of-order
>>>>>>> data
>>>>>>>>>>>>> right away when read (instead of blocking them behind in-order
>>>>>>> records
>>>>>>>>>>>>> that are not ready yet). -- It might even be a possibility, to
>>> let
>>>>>>>>> users
>>>>>>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name
>>> just
>>>>> a
>>>>>>>>>>>>> placeholder).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The KIP should explain this in more detail and also discuss
>>>>>>> different
>>>>>>>>>>>>> options and mention them in "Rejected alternatives" in case we
>>>>> don't
>>>>>>>>>>>>> want to include them.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 50) What happens when users change the grace period? 
>>>>>>>>>>>>> Especially,
>>>>>>> when
>>>>>>>>>>>>> they turn it on/off (but also increasing/decreasing is an
>>>>>>> interesting
>>>>>>>>>>>>> point)? I think we should try to support this if possible; the
>>>>>>>>>>>>> "Compatibility" section needs to cover switching on/off in 
>>>>>>>>>>>>> more
>>>>>>>>> detail.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
>>>>>>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> A few clarifications:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 1. Is the order that records exit the buffer in 
>>>>>>>>>>>>>> necessarily the
>>>>>>> same
>>>>>>>>> as
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> order that records enter the buffer in, or no? Based on the
>>>>>>>>> description
>>>>>>>>>>>>> in
>>>>>>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will
>>> exit
>>>>>>> the
>>>>>>>>>>>>>> buffer in increasing timestamp order, which means that 
>>>>>>>>>>>>>> they may
>>>>> be
>>>>>>>>>>>>> ordered
>>>>>>>>>>>>>> (even for the same key) compared to the input order.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
>>>>>>>>> stream-side
>>>>>>>>>>>>>> record arrives with a timestamp that is older than the 
>>>>>>>>>>>>>> current
>>>>>>> stream
>>>>>>>>>>>>> time
>>>>>>>>>>>>>> minus the grace period? Will this record trigger a join 
>>>>>>>>>>>>>> result,
>>>>> or
>>>>>>>>> will
>>>>>>>>>>>>> it
>>>>>>>>>>>>>> be dropped? Based on the description for what happens when 
>>>>>>>>>>>>>> the
>>>>> join
>>>>>>>>>>>> grace
>>>>>>>>>>>>>> period is set to zero, it sounds like the late record will be
>>>>>>>>> dropped,
>>>>>>>>>>>>> even
>>>>>>>>>>>>>> if the join grace period is nonzero. Is that true?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 3. What could cause stream time to advance, for purposes of
>>>>>>> removing
>>>>>>>>>>>>>> records from the join buffer? For example, will new records
>>>>>>> arriving
>>>>>>>>> on
>>>>>>>>>>>>> the
>>>>>>>>>>>>>> table side of the join cause stream time to advance? From the
>>> KIP
>>>>>>> it
>>>>>>>>>>>>> sounds
>>>>>>>>>>>>>> like only stream-side records will advance stream time -- 
>>>>>>>>>>>>>> does
>>>>> that
>>>>>>>>>>>> mean
>>>>>>>>>>>>>> that the join processor itself will have to track this stream
>>>>> time?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also +1 to Lucas's question about what options will be
>>> available
>>>>>>> for
>>>>>>>>>>>>>> configuring the join buffer. Will users have the option to
>>> choose
>>>>>>>>>>>> whether
>>>>>>>>>>>>>> they want the buffer to be in-memory vs persistent?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - Victoria
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
>>>>>>>>>>>>>> <lb...@confluent.io.invalid> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> HI Walker,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> thanks for the KIP! We definitely need this. I have two
>>>>> questions:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>        - Have you considered allowing the customization 
>>>>>>>>>>>>>>> of the
>>>>>>>>> underlying
>>>>>>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
>>>>>>>>> customize
>>>>>>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it
>>> make
>>>>>>>>> sense
>>>>>>>>>>>>>>> for `Joined` to have this as well? I can imagine one may 
>>>>>>>>>>>>>>> want
>>> to
>>>>>>>>> limit
>>>>>>>>>>>>>>> the number of records in the buffer, for example. If we hit
>>> the
>>>>>>>>>>>>>>> maximum, the only option would be to drop semantic 
>>>>>>>>>>>>>>> guarantees,
>>>>> but
>>>>>>>>>>>>>>> users may still want to do this.
>>>>>>>>>>>>>>>        - With "second option on the table side" you are
>>> referring
>>>>> to
>>>>>>>>>>>>>>> versioned tables, right? Will the buffer on the stream side
>>>>> behave
>>>>>>>>> any
>>>>>>>>>>>>>>> different whether the table side is versioned or not?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Finally, I think a simple example in the motivation section
>>>>> could
>>>>>>>>> help
>>>>>>>>>>>>>>> non-experts understand the KIP.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>>> Lucas
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>>>>>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hello everybody,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have a stream proposal to improve the stream table 
>>>>>>>>>>>>>>>> join by
>>>>>>>>> adding a
>>>>>>>>>>>>>>> grace
>>>>>>>>>>>>>>>> period and buffer to the stream side of the join to allow
>>>>>>>>> processing
>>>>>>>>>>>> in
>>>>>>>>>>>>>>>> timestamp order matching the recent improvements of the
>>>>> versioned
>>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Please take a look here <
>>>>>>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>>> share your thoughts.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>>> Walker
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Victoria,

that is a good point!

I think, the topic needs to be a compacted topic to be able to get rid 
of records that are evicted from the buffer. So the key might be 
something with the key, the timestamp, and a sequence number to 
distinguish between records with the same key and same timestamp.

Just an idea! Maybe Walker comes up with something better.

Best,
Bruno

On 05.06.23 20:38, Victoria Xia wrote:
> Hi Walker,
> 
> Thanks for the latest updates! The KIP looks great. Just one question about
> the changelog topic for the join buffer: The KIP says "When a failure
> occurs the buffer will try to recover from an OffsetCheckpoint if possible.
> If not it will reload the buffer from a compacted change-log topic." This
> is a new changelog topic that will be introduced specifically for the join
> buffer, right? Why is the changelog topic compacted? What are the keys? I
> am confused because the buffer contains records from the stream-side of the
> join, for which multiple records with the same key should be treated as
> separate updates will all must be tracked in the buffer, rather than
> updates which replace each other.
> 
> Thanks,
> Victoria
> 
> On Mon, Jun 5, 2023 at 1:47 AM Bruno Cadonna <ca...@apache.org> wrote:
> 
>> Hi Walker,
>>
>> Thanks once more for the updates to the KIP!
>>
>> Do you also plan to expose metrics for the buffer?
>>
>> Best,
>> Bruno
>>
>> On 02.06.23 17:16, Walker Carlson wrote:
>>> Hello Bruno,
>>>
>>> I think this covers your questions. Let me know what you think
>>>
>>> 2.
>>> We can use a changelog topic. I think we can treat it like any other
>> store
>>> and recover in the usual manner. Also implementation is on disk
>>>
>>> 3.
>>> The description is in the public interfaces description. I will copy it
>>> into the proposed changes as well.
>>>
>>> This is a bit of an implementation detail that I didn't want to add into
>>> the kip, but the record will be added to the buffer to keep the stream
>> time
>>> consistent, it will just be ejected immediately. If of course if this
>>> causes performance issues we will skip this step and track stream time
>>> separately. I will update the kip to say that stream time advances when a
>>> stream record enters the node.
>>>
>>> Also, yes, updated.
>>>
>>> 5.
>>> No there is no difference right now, everything gets processed as it
>> comes
>>> in and tries to find a record for its time stamp.
>>>
>>> Walker
>>>
>>> On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org> wrote:
>>>
>>>> Hi Walker,
>>>>
>>>> Thanks for the updates!
>>>>
>>>> 2.
>>>> It is still not clear to me how a failure is handled. I do not
>>>> understand what you mean by "recover from an OffsetCheckpoint".
>>>>
>>>> My understanding is that the buffer needs to be replicated into its own
>>>> Kafka topic. The input topic is not enough. The offset of a record is
>>>> added to the offsets to commit once the record is streamed through the
>>>> subtopology. That means once the record is added to the buffer its
>>>> offset is added to the offsets to commit -- independently of whether the
>>>> record was evicted from the buffer and sent to the join node or not.
>>>> Now, let's assume the following scenario
>>>> 1. a record is read from the input topic and added to the buffer, but
>>>> not evicted to be processed by the join node.
>>>> 2. When the processing of the subtopology finishes the offset of the
>>>> record is added to the offsets to commit.
>>>> 3. A commit happens.
>>>> 4. A failure happens
>>>>
>>>> After the failure the buffer is empty but the record will not be read
>>>> anymore from the input topic since its offset has been already
>>>> committed. The record is lost.
>>>> One solution to avoid the loss is to recreate the buffer from a
>>>> compacted Kafka topic as we do for suppression buffers. I do not think,
>>>> we need any offset checkpoint here since, we keep the buffer in memory,
>>>> right? Or do you plan to back the buffer with a persistent store? Even
>>>> in that case, a compacted Kafka topic would be needed.
>>>>
>>>>
>>>> 3.
>>>>    From the KIP it is still not clear to me what happens if a record is
>>>> outside of the grace period. I guess the record that falls outside of
>>>> the grace period will not be added to the buffer, but will be send to
>>>> the join node. Since it is outside of the grace period it will also not
>>>> increase stream time and it will not trigger an eviction. Also the head
>>>> of the buffer will not contain a record that needs to be evicted since
>>>> the the timestamp of the head record will be within the interval stream
>>>> time minus grace period. Is this correct? Please add such a description
>>>> to the KIP.
>>>> Furthermore, I think there is a mistake in the text:
>>>> "... will dequeue when the record timestamp is greater than stream time
>>>> plus the grace period". I guess that should be "... will dequeue when
>>>> the record timestamp is less than (or equal?) stream time minus the
>>>> grace period"
>>>>
>>>>
>>>> 5.
>>>> What is the difference between not setting the grace period and setting
>>>> it to zero? If there is a difference, why is there a difference?
>>>>
>>>>
>>>> Best,
>>>> Bruno
>>>>
>>>>
>>>> On 01.06.23 23:58, Walker Carlson wrote:
>>>>> Hey Bruno thanks for the feedback.
>>>>>
>>>>> 1)
>>>>> I will add this to the kip, but stream time only advances as the when
>> the
>>>>> buffer receives a new record.
>>>>>
>>>>> 2)
>>>>> You are correct, I will add a failure section on to the kip. Since the
>>>>> records wont change in the buffer from when they are read from the
>> topic
>>>>> they are replicated already.
>>>>>
>>>>> 3)
>>>>> I see that I'm out voted on the dropping of records thing. We will pass
>>>>> them on and try to join them if possible. This might cause some null
>>>>> results, but increasing the table history retention should help that.
>>>>>
>>>>> 4)
>>>>> I can add some on the kip. But its pretty directly adding whatever the
>>>>> grace period is to the latency. I don't see a way around it.
>>>>>
>>>>> Walker
>>>>>
>>>>> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org>
>> wrote:
>>>>>
>>>>>> Hi Walker,
>>>>>>
>>>>>> thanks for the KIP!
>>>>>>
>>>>>> Here my feedback:
>>>>>>
>>>>>> 1.
>>>>>> It is still not clear to me when stream time for the buffer advances.
>>>>>> What is the event that let the stream time advance? In the
>> discussion, I
>>>>>> do not understand what you mean by "The segment store already has an
>>>>>> observed stream time, we advance based on that. That should only
>> advance
>>>>>> based on records that enter the store." Where does this segment store
>>>>>> come from? Anyways, I think it would be great to also state how stream
>>>>>> time advances in the KIP.
>>>>>>
>>>>>> 2.
>>>>>> How does the buffer behave in case of a failure? I think I understand
>>>>>> that the buffer will use an implementation of
>> TimeOrderedKeyValueBuffer
>>>>>> and therefore the records in the buffer will be replicated to a topic
>> in
>>>>>> Kafka, but I am not completely sure. Could you elaborate on this in
>> the
>>>>>> KIP?
>>>>>>
>>>>>> 3.
>>>>>> I agree with Matthias about dropping late records. We use grace
>> periods
>>>>>> in scenarios where we records are grouped like in windowed
>> aggregations
>>>>>> and windowed joins. The stream buffer you propose does not really
>> group
>>>>>> any records. It rather delays records and reorders them. I am not sure
>>>>>> if grace period is the right naming/concept to apply here. Instead of
>>>>>> dropping records that fall outside of the buffer's time interval the
>>>>>> join should skip the buffer and try to join the record immediately. In
>>>>>> the end, a stream-table join is a unwindowed join, i.e., no grouping
>> is
>>>>>> applied to the records.
>>>>>> What do you and other folks think about this proposal?
>>>>>>
>>>>>> 4.
>>>>>> How does the proposed buffer, affects processing latency? Could you
>>>>>> please add some words about this to the KIP?
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Bruno
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 31.05.23 01:49, Walker Carlson wrote:
>>>>>>> Thanks for all the additional comments. I will either address them
>> here
>>>>>> or
>>>>>>> update the kip accordingly.
>>>>>>>
>>>>>>>
>>>>>>> I mentioned a follow kip to add extra features before and in the
>>>>>> responses.
>>>>>>> I will try to briefly summarize what options and optimizations I plan
>>>> to
>>>>>>> include. If a concern is not covered in this list I for sure talk
>> about
>>>>>> it
>>>>>>> below.
>>>>>>>
>>>>>>> * Allowing non versioned tables to still use the stream buffer
>>>>>>> * Automatically materializing tables instead of forcing the user to
>> do
>>>> it
>>>>>>> * Configurable for in memory buffer
>>>>>>> * Order the records in offset order or in time order
>>>>>>> * Non memory use buffer (offset order, delayed pull from stream.)
>>>>>>> * Time synced between stream and table side (maybe)
>>>>>>> * Do not drop late records and process them as they come in instead.
>>>>>>>
>>>>>>>
>>>>>>> First, Victoria.
>>>>>>>
>>>>>>> 1) (One of your nits covers this, but you are correct it doesn't make
>>>>>>> sense. so I removed that part of the example.)
>>>>>>> For those examples with the "bad" join results I said without
>> buffering
>>>>>> the
>>>>>>> stream it would look like that, but that was incomplete. If the look
>> up
>>>>>> was
>>>>>>> simply looking at the latest version of the table when the stream
>>>> records
>>>>>>> came in then the results were possible. If we are using the point in
>>>> time
>>>>>>> lookup that versioned tables let us then you are correct the future
>>>>>> results
>>>>>>> are not possible.
>>>>>>>
>>>>>>> 2) I'll get to this later as Matthias brought up something related.
>>>>>>>
>>>>>>> To your additional thoughts, I agree that we need to call those
>> things
>>>>>> out
>>>>>>> in the documentation. I'm writing up a follow up kip with a lot of
>> the
>>>>>>> ideas we have discussed so that we can improve this feature beyond
>> the
>>>>>> base
>>>>>>> implementation if it's needed.
>>>>>>>
>>>>>>> I addressed the nits in the kip. I somehow missed the table stream
>>>> table
>>>>>>> join processor improvement, it makes your first question make a lot
>>>> more
>>>>>>> sense.  Table history retention is a much cleaner way to describe it.
>>>>>>>
>>>>>>> As to your mention of the syncing the time for the table and stream.
>>>>>>> Matthias mentioned that as well. I will address both here. I plan to
>>>>>> bring
>>>>>>> that up in the future, but for now we will leave it out. I suppose it
>>>>>> will
>>>>>>> be more useful after the table history retention is separable from
>> the
>>>>>>> table grace period.
>>>>>>>
>>>>>>>
>>>>>>> To address Matthias comments.
>>>>>>>
>>>>>>> You are correct by saying the in memory store shouldn't cause any
>>>>>> semantic
>>>>>>> concerns. My concern would be more with if we limited the number of
>>>>>> records
>>>>>>> on the buffer and what we would do if we hit said limits, (emitting
>>>> those
>>>>>>> records might be an issue, throwing an error and halting would not).
>> I
>>>>>>> think we can leave this discussion to the follow up kip along with a
>>>> few
>>>>>>> other options.
>>>>>>>
>>>>>>> I will go through your proposals now.
>>>>>>>
>>>>>>>       - don't support non-versioned KTables
>>>>>>>
>>>>>>> Sure, we can always expand this later on. Will include as part of the
>>>> of
>>>>>>> the improvement kip
>>>>>>>
>>>>>>>       - if grace period is added, users need to explicitly materialize
>>>> the
>>>>>>> table as version (either directly, or upstream. Upstream only works
>> if
>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>>>>
>>>>>>> again, that works for me for now, if we find a use we can always add
>>>>>> later.
>>>>>>>
>>>>>>>       - the table's history retention time must be larger than the
>> grace
>>>>>>> period (should be easy to check at runtime, when we build the
>> topology)
>>>>>>>
>>>>>>> agreed
>>>>>>>
>>>>>>>       - because switching from non-versioned to version stores is not
>>>>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>>>>> themselves, and this also implies that adding grace period is not a
>>>>>>> backward compatible change (even only if via indirect means)
>>>>>>>
>>>>>>> sure, this works
>>>>>>>
>>>>>>> As to the dropping of late records, I'm not sure. One one hand I like
>>>> not
>>>>>>> dropping things. But on the other I struggle to see how a user can
>>>> filter
>>>>>>> out late records that might have incomplete join results. The point
>> in
>>>>>> time
>>>>>>> look up will aggressively expire old data and if new data has been
>>>>>> replaced
>>>>>>> it will return null if outside of the retention. This seems like it
>>>> could
>>>>>>> corrupt the integrity of the join output. Seeing that we drop late
>>>>>> records
>>>>>>> on the table side as well I would think it makes sense to drop late
>>>>>> records
>>>>>>> on the stream buffer. I could be convinced otherwise I suppose, I
>> could
>>>>>> see
>>>>>>> adding this as an option in a follow up kip. It would be very easy to
>>>>>>> implement either way. For now unless no one else objects I'm going to
>>>>>> stick
>>>>>>> with dropping the records for the sake of getting this kip passed. It
>>>> is
>>>>>>> functionally a small change to make and we can update later if you
>> feel
>>>>>>> strongly about it.
>>>>>>>
>>>>>>> For the ordering. I have to say that it would be more complicated to
>>>>>>> implement it to be in offset order, if the goal it to get as many of
>>>> the
>>>>>>> records validly joined as possible. Because we would process as
>> things
>>>>>> left
>>>>>>> the buffer a sufficiency early enough record could hold up records
>> that
>>>>>>> would otherwise be valid past the table history retention. To fix
>> this
>>>> we
>>>>>>> could process by timestamp then store in a second queue and emit by
>>>>>> offset,
>>>>>>> but that would be a lot more complicated. If we didn't care about not
>>>>>>> missing some valid joins we could just have no store and pull from
>> the
>>>>>>> topic at a delay only caring about the timestamp of the next offset.
>>>> For
>>>>>>> now I want to stick with the timestamp ordering as it makes much more
>>>>>> sense
>>>>>>> to me, but would propose we add both of the other options I have laid
>>>> out
>>>>>>> here in the follow up kip.
>>>>>>>
>>>>>>> Lastly, I think having an empty store with zero grace period would be
>>>>>> super
>>>>>>> simple and not costly, so we might as well make it even if nothing
>> gets
>>>>>>> entered.
>>>>>>>
>>>>>>> I hope that address all your concerns,
>>>>>>>
>>>>>>> Walker
>>>>>>>
>>>>>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
>>>>>> wrote:
>>>>>>>
>>>>>>>> Walker,
>>>>>>>>
>>>>>>>> thanks for the updates. The KIP itself reads fine (of course
>> Victoria
>>>>>>>> made good comments about some phrases), but there is a couple of
>>>> things
>>>>>>>> from your latest reply I don't understand, and that I still think
>> need
>>>>>>>> some more discussions.
>>>>>>>>
>>>>>>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and
>> you
>>>>>>>> mention "semantic concerns". There should not be any semantic
>>>> difference
>>>>>>>> from the underlying buffer implementation, so I am not sure what you
>>>>>>>> mean here (also the relationship to suppress() is unclear to me)?
>> -- I
>>>>>>>> am ok to not make it configurable for now. We can always do it via a
>>>>>>>> follow up KIP, and keep interface changes limited for now.
>>>>>>>>
>>>>>>>> Does it really make sense to allow a grace period if the table is
>>>>>>>> non-versioned? You also say: "If table is not materialized it will
>>>>>>>> materialize it as versioned." -- What history retention time would
>> we
>>>>>>>> pick for this case (also asked by Victoria)? Or should we rather not
>>>>>>>> support this and force the user to materialize the table explicitly,
>>>> and
>>>>>>>> thus explicitly picking a history retention time? It's tradeoff
>>>> between
>>>>>>>> usability and guiding uses that there will be a significant impact
>> on
>>>>>>>> disk usage. There is also compatibility concerns: If the table is
>> not
>>>>>>>> explicitly materialized in the old program, we would already need to
>>>>>>>> materialize it also in the old program (of course, we would use a
>>>>>>>> non-versioned store so far). Thus, if somebody adds a grace period,
>> we
>>>>>>>> cannot just switch the store type, as it would be a breaking change,
>>>>>>>> potentially required an application re-set, or following the upgrade
>>>>>>>> path for versioned state stores, and also changing the program to
>>>>>>>> explicitly materialize using a versioned store. Also note, that we
>>>> might
>>>>>>>> not materialize the actual join table, but only an upstream table,
>> and
>>>>>>>> use `ValueGetter` to access the upstream data.
>>>>>>>>
>>>>>>>> To this end, as you already mentioned, history retention of the
>> table
>>>>>>>> should be at least grace period. You proposed to include this in a
>>>>>>>> follow up KIP, but I am wondering if it's a fundamental requirement
>>>> and
>>>>>>>> thus we should put a check in place right away and reject an invalid
>>>>>>>> configuration? (It always easier to lift restriction than to
>> introduce
>>>>>>>> them later.) This would also imply that a non-versioned table cannot
>>>> be
>>>>>>>> supported, because it does not have a history retention that is
>> larger
>>>>>>>> than grace period, and maybe also answer the requirement about
>>>>>>>> materialization: as we already always materialize something on the
>>>>>>>> tablet side as non-versioned store right now, it seems difficult to
>>>>>>>> migrate the store to a versioned store. Ie, it might be ok to push
>> the
>>>>>>>> burden onto the user and say: if you start using grace period, you
>>>> also
>>>>>>>> need to manually switch from non-versioned to versioned KTables.
>> Doing
>>>>>>>> stuff automatically under the hood if very complex for this case, we
>>>> if
>>>>>>>> we push the burden onto the user, it might be ok to not complicate
>>>> this
>>>>>>>> KIP significantly.
>>>>>>>>
>>>>>>>> To summarize the last two paragraphs, I would propose to:
>>>>>>>>       - don't support non-versioned KTables
>>>>>>>>       - if grace period is added, users need to explicitly
>> materialize
>>>> the
>>>>>>>> table as version (either directly, or upstream. Upstream only works
>> if
>>>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>>>>>       - the table's history retention time must be larger than the
>> grace
>>>>>>>> period (should be easy to check at runtime, when we build the
>>>> topology)
>>>>>>>>       - because switching from non-versioned to version stores is not
>>>>>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>>>>>> themselves, and this also implies that adding grace period is not a
>>>>>>>> backward compatible change (even only if via indirect means)
>>>>>>>>
>>>>>>>> About dropping late records: wondering if we should never drop a
>>>>>>>> stream-side record for a left-join, even if it's late? In general,
>> one
>>>>>>>> thing I observed over the years is, that it's easier to keep stuff
>> and
>>>>>>>> let users filter explicitly downstream (or make it configurable),
>>>>>>>> instead of dropping pro-actively, because users have no good way to
>>>>>>>> resurrect record that got already dropped.
>>>>>>>>
>>>>>>>> For ordering, sounds reasonable to me only start with one
>>>>>>>> implementation, and maybe make it configurable as a follow up.
>>>> However,
>>>>>>>> I am wondering if starting with offset order might be the better
>>>> option
>>>>>>>> as it seems to align more with what we do so far? So instead of
>>>> storing
>>>>>>>> record ordered by timestamp, we can just store them ordered by
>> offset,
>>>>>>>> and still "poll" from the buffer based on the head records
>> timestamp.
>>>> Or
>>>>>>>> would this complicate the implementation significantly?
>>>>>>>>
>>>>>>>> I also think it's ok to not "sync" stream-time between the table and
>>>> the
>>>>>>>> stream in this KIP, but we should consider doing this as a follow up
>>>>>>>> change (not sure if we would need a KIP or not for a change this
>>>> this).
>>>>>>>>
>>>>>>>> About increasing/decreasing grace period: what you describe make
>> sense
>>>>>>>> to me. If decreased, the next record would just trigger emitting a
>> lot
>>>>>>>> of records, and for increase, the buffer would just need to "fill
>> up"
>>>>>>>> again. For reprocessing getting a different result with a different
>>>>>>>> grace period is expected, so that's ok IMHO. -- There seems to be
>> one
>>>>>>>> special corner case: grace period zero. For this case, we actually
>>>> don't
>>>>>>>> need any store, and the stream-side could be stateless. I think it
>> can
>>>>>>>> have the same behavior, but if we want to "add / remove" the store
>>>>>>>> dynamically, we need to add specific code for it. For example, even
>> if
>>>>>>>> we start up with a grace period of zero, we would need to check if
>>>> there
>>>>>>>> is a local store, and still emit everything in it, before we can
>> ditch
>>>>>>>> the store (not sure if that's even easily done at all). Or: we would
>>>>>>>> need to have a store for _all_ cases, even if grace period is zero
>>>> (the
>>>>>>>> store would be empty all the time though), to avoid super complex
>>>> code?
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
>>>>>>>>> Hi Walker,
>>>>>>>>>
>>>>>>>>> thanks for your responses. That makes sense. I guess there is
>> always
>>>>>>>>> the option to make the implementation more configurable later on,
>> if
>>>>>>>>> users request it. Also thanks for the clarifications. From my side,
>>>>>>>>> the KIP is good to go.
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Lucas
>>>>>>>>>
>>>>>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
>>>>>>>>> <vi...@confluent.io.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>> Thanks for the updates, Walker! Looks great, though I do have a
>>>> couple
>>>>>>>>>> questions about the latest updates:
>>>>>>>>>>
>>>>>>>>>>         1. The new example says that without stream-side buffering,
>>>> "ex"
>>>>>> and
>>>>>>>>>>         "fy" are possible join results. How could those join
>> results
>>>>>>>> happen? The
>>>>>>>>>>         example versioned table suggests that table record "x" has
>>>>>>>> timestamp 2, and
>>>>>>>>>>         table record "y" has timestamp 3. If stream record "e" has
>>>>>>>> timestamp 1,
>>>>>>>>>>         then it can never be joined against record "x", and
>> similarly
>>>> for
>>>>>>>> stream
>>>>>>>>>>         record "f" with timestamp 2 being joined against "y".
>>>>>>>>>>         2. I see in your replies above that "If table is not
>>>>>> materialized it
>>>>>>>>>>         will materialize it as versioned" but I don't see this
>> called
>>>> out
>>>>>>>> in the
>>>>>>>>>>         KIP -- seems worth calling out. Also, what will the history
>>>>>>>> retention for
>>>>>>>>>>         the versioned table be? Will it be the same as the join
>> grace
>>>>>>>> period, or
>>>>>>>>>>         will it be greater?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And some additional thoughts:
>>>>>>>>>>
>>>>>>>>>> Sounds like there are a few things users should watch out for when
>>>>>>>> enabling
>>>>>>>>>> the stream-side buffer:
>>>>>>>>>>
>>>>>>>>>>         - Records will get "stuck" if there are no newer records to
>>>>>> advance
>>>>>>>>>>         stream time.
>>>>>>>>>>         - If there are large gaps between the timestamps of
>>>> stream-side
>>>>>>>> records,
>>>>>>>>>>         then it's possible that versioned store history retention
>> will
>>>>>> have
>>>>>>>> expired
>>>>>>>>>>         by the time a record is evicted from the join buffer,
>> leading
>>>> to
>>>>>> a
>>>>>>>> join
>>>>>>>>>>         "miss." For example, if the join grace period and table
>>>> history
>>>>>>>> retention
>>>>>>>>>>         are both 10, and records come in the order:
>>>>>>>>>>
>>>>>>>>>>         table side t0 with ts=0
>>>>>>>>>>         stream side s1 with ts=1 <-- enters buffer
>>>>>>>>>>         table side t10 with ts=10
>>>>>>>>>>         table side t20 with ts=20
>>>>>>>>>>         stream side s21 with ts=21 <-- evicts record s1 from
>> buffer,
>>>> but
>>>>>>>>>>         versioned store no longer contains data for ts=1 due to
>>>> history
>>>>>>>> retention
>>>>>>>>>>         having elapsed
>>>>>>>>>>
>>>>>>>>>>         This will result in the join result (s1, null) even though
>> it
>>>>>>>> should've
>>>>>>>>>>         been (s1, t0), due to t0 having been expired from the
>>>> versioned
>>>>>>>> store
>>>>>>>>>>         already.
>>>>>>>>>>         - Out-of-order records from the stream-side will be
>> reordered,
>>>>>> and
>>>>>>>> late
>>>>>>>>>>         records will be dropped.
>>>>>>>>>>
>>>>>>>>>> I don't think any of these are reasons to not go forward with this
>>>>>> KIP,
>>>>>>>> but
>>>>>>>>>> it'd be good to call them out in the eventual documentation to
>>>>>> decrease
>>>>>>>> the
>>>>>>>>>> chance users get tripped up.
>>>>>>>>>>
>>>>>>>>>>> We could maybe do an improvement later to advance stream time
>> from
>>>>>>>> table
>>>>>>>>>> side as well, but that might be debatable as we might get more
>> late
>>>>>>>> records.
>>>>>>>>>>
>>>>>>>>>> Yes, the likelihood of late records increases but also the
>>>> likelihood
>>>>>> of
>>>>>>>>>> "join misses" due to versioned store history retention having
>>>> elapsed
>>>>>>>>>> decreases, which feels important for certain use cases. Either
>> way,
>>>>>>>> agreed
>>>>>>>>>> that it can be a discussion for the future as incorporating this
>>>> would
>>>>>>>>>> substantially complicate the implementation.
>>>>>>>>>>
>>>>>>>>>> Also a couple nits:
>>>>>>>>>>
>>>>>>>>>>         - The KIP currently says "We recently added versioned
>> tables
>>>>>> which
>>>>>>>> allow
>>>>>>>>>>         the table side of the a join [...] but it is not taken
>>>> advantage
>>>>>> of
>>>>>>>> in
>>>>>>>>>>         joins," but this doesn't seem true? If the table of a
>>>>>> stream-table
>>>>>>>> join is
>>>>>>>>>>         versioned, then the DSL's stream-table join processor will
>>>>>>>> automatically
>>>>>>>>>>         perform timestamped lookups into the table, in order to
>> take
>>>>>>>> advantage of
>>>>>>>>>>         the new timestamp-aware store to provide better join
>>>> semantics.
>>>>>>>>>>         - The KIP mentions "grace period" for versioned stores in a
>>>>>> number
>>>>>>>> of
>>>>>>>>>>         places but I think you actually mean "history retention"?
>> The
>>>> two
>>>>>>>> happen to
>>>>>>>>>>         be the same today (it is not an option for users to
>> configure
>>>> the
>>>>>>>> two
>>>>>>>>>>         separately) but this need not be true in the future.
>> "History
>>>>>>>> retention"
>>>>>>>>>>         governs how far back in time reads may occur, which is the
>>>>>> relevant
>>>>>>>>>>         parameter for performing lookups as part of the
>> stream-table
>>>>>> join.
>>>>>>>> "Grace
>>>>>>>>>>         period" in the context of versioned stores refers to how
>> far
>>>> back
>>>>>>>> in time
>>>>>>>>>>         out-of-order writes may occur, which probably isn't
>> directly
>>>>>>>> relevant for
>>>>>>>>>>         introducing a stream-side buffer, though it's also possible
>>>> I've
>>>>>>>> overlooked
>>>>>>>>>>         something. (As a bonus, switching from "table grace
>> period" in
>>>>>> the
>>>>>>>> KIP to
>>>>>>>>>>         "table history retention" also helps to clarify/distinguish
>>>> that
>>>>>>>> it's a
>>>>>>>>>>         different parameter from the "join grace period," which I
>>>> could
>>>>>> see
>>>>>>>> being
>>>>>>>>>>         confusing to readers. :) )
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Cheers,
>>>>>>>>>> Victoria
>>>>>>>>>>
>>>>>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hey all,
>>>>>>>>>>>
>>>>>>>>>>> Thanks for the comments, they gave me a lot to think about. I'll
>>>> try
>>>>>> to
>>>>>>>>>>> address them all inorder. I have made some updates to the kip
>>>> related
>>>>>>>> to
>>>>>>>>>>> them, but I mention where below.
>>>>>>>>>>>
>>>>>>>>>>> Lucas
>>>>>>>>>>>
>>>>>>>>>>> Good idea about the example. I added a simple one.
>>>>>>>>>>>
>>>>>>>>>>> 1) I have thought about including options for the underlying
>> buffer
>>>>>>>>>>> configuration. One of which might be adding an in memory option.
>> My
>>>>>>>> biggest
>>>>>>>>>>> concern is about the semantic guarantees. This isn't like
>> suppress
>>>> or
>>>>>>>> with
>>>>>>>>>>> windows where producing incomplete results is repetitively
>>>> harmless.
>>>>>>>> Here
>>>>>>>>>>> we would be possibly producing incorrect results. I also would
>> like
>>>>>> to
>>>>>>>> keep
>>>>>>>>>>> the interface changes as simple as I can. Making more than this
>>>>>> change
>>>>>>>> to
>>>>>>>>>>> Joined I feel could make this more complicated than it needs to
>> be.
>>>>>> If
>>>>>>>> we
>>>>>>>>>>> really want to I could see adding a grace() option with a
>>>>>> BufferConifg
>>>>>>>> in
>>>>>>>>>>> there or something, but I would rather not.
>>>>>>>>>>>
>>>>>>>>>>> 2) The buffer will be independent of if the table is versioned or
>>>>>> not.
>>>>>>>> If
>>>>>>>>>>> table is not materialized it will materialize it as versioned. It
>>>>>> might
>>>>>>>>>>> make sense to do a follow up kip where we force the retention
>>>> period
>>>>>>>> of
>>>>>>>>>>> the versioned to be greater than whatever the max of the stream
>>>>>> buffer
>>>>>>>> is.
>>>>>>>>>>>
>>>>>>>>>>> Victoria
>>>>>>>>>>>
>>>>>>>>>>> 1) Yes, records will exit in timestamp order not in offset order.
>>>>>>>>>>> 2) Late records will be dropped (Late as out of the grace
>> period).
>>>>>>>>     From my
>>>>>>>>>>> understanding that is the point of a grace period, no? Doesn't
>> the
>>>>>> same
>>>>>>>>>>> thing happen with versioned stores?
>>>>>>>>>>> 3) The segment store already has an observed stream time, we
>>>> advance
>>>>>>>> based
>>>>>>>>>>> on that. That should only advance based on records that enter the
>>>>>>>> store. So
>>>>>>>>>>> yes, only stream side records. We could maybe do an improvement
>>>> later
>>>>>>>> to
>>>>>>>>>>> advance stream time from table side as well, but that might be
>>>>>>>> debatable as
>>>>>>>>>>> we might get more late records. Anyways I would rather have that
>>>> as a
>>>>>>>>>>> separate discussion.
>>>>>>>>>>>
>>>>>>>>>>> in memory option? We can do that, for the buffer I plan to use
>> the
>>>>>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in
>> memory
>>>>>>>>>>> implantation, so it would be simple.
>>>>>>>>>>>
>>>>>>>>>>> I said more in my answer to Lucas's question. The concern I have
>>>> with
>>>>>>>>>>> buffer configs or in memory is complicating the interface. Also
>>>>>>>> semantic
>>>>>>>>>>> guarantees but in memory shouldn't effect that
>>>>>>>>>>>
>>>>>>>>>>> Matthias
>>>>>>>>>>>
>>>>>>>>>>> 1) fixed out of order vs late terminology in the kip.
>>>>>>>>>>>
>>>>>>>>>>> 2) I was referring to having a stream. So after this kip we can
>>>> have
>>>>>> a
>>>>>>>>>>> buffered stream or a normal one. For the table we can use a
>>>> versioned
>>>>>>>> table
>>>>>>>>>>> or a normal table.
>>>>>>>>>>>
>>>>>>>>>>> 3 Good call out. I clarified this as "If the table side uses a
>>>>>>>> materialized
>>>>>>>>>>> version store, it can store multiple versions of each record
>> within
>>>>>> its
>>>>>>>>>>> defined grace period." and modified the rest of the paragraph a
>>>> bit.
>>>>>>>>>>>
>>>>>>>>>>> 4) I get the preserving off offset ordering, but if the stream is
>>>>>>>> buffered
>>>>>>>>>>> to join on timestamp instead of offset doesn't it already seem
>> like
>>>>>> we
>>>>>>>> care
>>>>>>>>>>> more about time in this case?
>>>>>>>>>>>
>>>>>>>>>>> If we end up adding more options it might make sense to do this.
>>>>>> Maybe
>>>>>>>>>>> offset order processing can be a follow up?
>>>>>>>>>>>
>>>>>>>>>>> I'll add a section for this in Rejected Alternatives. I think it
>>>>>> makes
>>>>>>>>>>> sense to do something like this but maybe in a follow up.
>>>>>>>>>>>
>>>>>>>>>>> 5) I hadn't thought about this. I suppose if they changed this in
>>>> an
>>>>>>>>>>> upgrade the next record would either evict a lot of records (if
>> the
>>>>>>>> grace
>>>>>>>>>>> period decreased) or there would be a pause until the new grace
>>>>>> period
>>>>>>>>>>> reached. Increasing is a bit more problematic, especially if the
>>>>>> table
>>>>>>>>>>> grace period and retention time stays the same. If the data is
>>>>>>>> reprocessed
>>>>>>>>>>> after a change like that then there would be different results,
>>>> but I
>>>>>>>> feel
>>>>>>>>>>> like that would be expected after such a change.
>>>>>>>>>>>
>>>>>>>>>>> What do you think should happen?
>>>>>>>>>>>
>>>>>>>>>>> Hopefully this answers your questions!
>>>>>>>>>>>
>>>>>>>>>>> Walker
>>>>>>>>>>>
>>>>>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <
>> mjsax@apache.org>
>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Thanks for the KIP! Also some question/comments from my side:
>>>>>>>>>>>>
>>>>>>>>>>>> 10) Notation: you use the term "late data" but I think you mean
>>>>>>>>>>>> out-of-order. We reserve the term "late" to records that arrive
>>>>>> after
>>>>>>>>>>>> grace period passed, and thus, "late == out-of-order data that
>> is
>>>>>>>>>>> dropped".
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 20) "There is only one option from the stream side and only
>>>> recently
>>>>>>>> is
>>>>>>>>>>>> there a second option on the table side."
>>>>>>>>>>>>
>>>>>>>>>>>> What are those options? Victoria already asked about the table
>>>> side,
>>>>>>>> but
>>>>>>>>>>>> I am also not sure what option you mean for the stream side?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 30) "If the table side uses a materialized version store the
>> value
>>>>>> is
>>>>>>>>>>>> the latest by stream time rather than by offset within its
>> defined
>>>>>>>> grace
>>>>>>>>>>>> period."
>>>>>>>>>>>>
>>>>>>>>>>>> The phrase "the value is the latest by stream time" is confusing
>>>> --
>>>>>> in
>>>>>>>>>>>> the end, a versioned stores multiple versions, not just one.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 40) I am also wondering about ordering. In general, KS tries to
>>>>>>>> preserve
>>>>>>>>>>>> offset-order during processing (with some exception, when offset
>>>>>> order
>>>>>>>>>>>> preservation is not clearly defined). Given that the stream-side
>>>>>>>> buffer
>>>>>>>>>>>> is really just a "linear buffer", we could easily preserve
>>>>>>>> offset-order.
>>>>>>>>>>>> But I also see a benefit of re-ordering and emitting
>> out-of-order
>>>>>> data
>>>>>>>>>>>> right away when read (instead of blocking them behind in-order
>>>>>> records
>>>>>>>>>>>> that are not ready yet). -- It might even be a possibility, to
>> let
>>>>>>>> users
>>>>>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name
>> just
>>>> a
>>>>>>>>>>>> placeholder).
>>>>>>>>>>>>
>>>>>>>>>>>> The KIP should explain this in more detail and also discuss
>>>>>> different
>>>>>>>>>>>> options and mention them in "Rejected alternatives" in case we
>>>> don't
>>>>>>>>>>>> want to include them.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 50) What happens when users change the grace period? Especially,
>>>>>> when
>>>>>>>>>>>> they turn it on/off (but also increasing/decreasing is an
>>>>>> interesting
>>>>>>>>>>>> point)? I think we should try to support this if possible; the
>>>>>>>>>>>> "Compatibility" section needs to cover switching on/off in more
>>>>>>>> detail.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -Matthias
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
>>>>>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
>>>>>>>>>>>>>
>>>>>>>>>>>>> A few clarifications:
>>>>>>>>>>>>>
>>>>>>>>>>>>> 1. Is the order that records exit the buffer in necessarily the
>>>>>> same
>>>>>>>> as
>>>>>>>>>>>> the
>>>>>>>>>>>>> order that records enter the buffer in, or no? Based on the
>>>>>>>> description
>>>>>>>>>>>> in
>>>>>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will
>> exit
>>>>>> the
>>>>>>>>>>>>> buffer in increasing timestamp order, which means that they may
>>>> be
>>>>>>>>>>>> ordered
>>>>>>>>>>>>> (even for the same key) compared to the input order.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
>>>>>>>> stream-side
>>>>>>>>>>>>> record arrives with a timestamp that is older than the current
>>>>>> stream
>>>>>>>>>>>> time
>>>>>>>>>>>>> minus the grace period? Will this record trigger a join result,
>>>> or
>>>>>>>> will
>>>>>>>>>>>> it
>>>>>>>>>>>>> be dropped? Based on the description for what happens when the
>>>> join
>>>>>>>>>>> grace
>>>>>>>>>>>>> period is set to zero, it sounds like the late record will be
>>>>>>>> dropped,
>>>>>>>>>>>> even
>>>>>>>>>>>>> if the join grace period is nonzero. Is that true?
>>>>>>>>>>>>>
>>>>>>>>>>>>> 3. What could cause stream time to advance, for purposes of
>>>>>> removing
>>>>>>>>>>>>> records from the join buffer? For example, will new records
>>>>>> arriving
>>>>>>>> on
>>>>>>>>>>>> the
>>>>>>>>>>>>> table side of the join cause stream time to advance? From the
>> KIP
>>>>>> it
>>>>>>>>>>>> sounds
>>>>>>>>>>>>> like only stream-side records will advance stream time -- does
>>>> that
>>>>>>>>>>> mean
>>>>>>>>>>>>> that the join processor itself will have to track this stream
>>>> time?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also +1 to Lucas's question about what options will be
>> available
>>>>>> for
>>>>>>>>>>>>> configuring the join buffer. Will users have the option to
>> choose
>>>>>>>>>>> whether
>>>>>>>>>>>>> they want the buffer to be in-memory vs persistent?
>>>>>>>>>>>>>
>>>>>>>>>>>>> - Victoria
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
>>>>>>>>>>>>> <lb...@confluent.io.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> HI Walker,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> thanks for the KIP! We definitely need this. I have two
>>>> questions:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>        - Have you considered allowing the customization of the
>>>>>>>> underlying
>>>>>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
>>>>>>>> customize
>>>>>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it
>> make
>>>>>>>> sense
>>>>>>>>>>>>>> for `Joined` to have this as well? I can imagine one may want
>> to
>>>>>>>> limit
>>>>>>>>>>>>>> the number of records in the buffer, for example. If we hit
>> the
>>>>>>>>>>>>>> maximum, the only option would be to drop semantic guarantees,
>>>> but
>>>>>>>>>>>>>> users may still want to do this.
>>>>>>>>>>>>>>        - With "second option on the table side" you are
>> referring
>>>> to
>>>>>>>>>>>>>> versioned tables, right? Will the buffer on the stream side
>>>> behave
>>>>>>>> any
>>>>>>>>>>>>>> different whether the table side is versioned or not?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Finally, I think a simple example in the motivation section
>>>> could
>>>>>>>> help
>>>>>>>>>>>>>> non-experts understand the KIP.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Lucas
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>>>>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hello everybody,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have a stream proposal to improve the stream table join by
>>>>>>>> adding a
>>>>>>>>>>>>>> grace
>>>>>>>>>>>>>>> period and buffer to the stream side of the join to allow
>>>>>>>> processing
>>>>>>>>>>> in
>>>>>>>>>>>>>>> timestamp order matching the recent improvements of the
>>>> versioned
>>>>>>>>>>>> tables.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Please take a look here <
>>>>>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> share your thoughts.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> best,
>>>>>>>>>>>>>>> Walker
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Victoria Xia <vi...@confluent.io.INVALID>.

Hi Walker,

Thanks for the latest updates! The KIP looks great. Just one question about
the changelog topic for the join buffer: The KIP says "When a failure
occurs the buffer will try to recover from an OffsetCheckpoint if possible.
If not it will reload the buffer from a compacted change-log topic." This
is a new changelog topic that will be introduced specifically for the join
buffer, right? Why is the changelog topic compacted? What are the keys? I
am confused because the buffer contains records from the stream-side of the
join, for which multiple records with the same key should be treated as
separate updates will all must be tracked in the buffer, rather than
updates which replace each other.

Thanks,
Victoria

On Mon, Jun 5, 2023 at 1:47 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi Walker,
>
> Thanks once more for the updates to the KIP!
>
> Do you also plan to expose metrics for the buffer?
>
> Best,
> Bruno
>
> On 02.06.23 17:16, Walker Carlson wrote:
> > Hello Bruno,
> >
> > I think this covers your questions. Let me know what you think
> >
> > 2.
> > We can use a changelog topic. I think we can treat it like any other
> store
> > and recover in the usual manner. Also implementation is on disk
> >
> > 3.
> > The description is in the public interfaces description. I will copy it
> > into the proposed changes as well.
> >
> > This is a bit of an implementation detail that I didn't want to add into
> > the kip, but the record will be added to the buffer to keep the stream
> time
> > consistent, it will just be ejected immediately. If of course if this
> > causes performance issues we will skip this step and track stream time
> > separately. I will update the kip to say that stream time advances when a
> > stream record enters the node.
> >
> > Also, yes, updated.
> >
> > 5.
> > No there is no difference right now, everything gets processed as it
> comes
> > in and tries to find a record for its time stamp.
> >
> > Walker
> >
> > On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org> wrote:
> >
> >> Hi Walker,
> >>
> >> Thanks for the updates!
> >>
> >> 2.
> >> It is still not clear to me how a failure is handled. I do not
> >> understand what you mean by "recover from an OffsetCheckpoint".
> >>
> >> My understanding is that the buffer needs to be replicated into its own
> >> Kafka topic. The input topic is not enough. The offset of a record is
> >> added to the offsets to commit once the record is streamed through the
> >> subtopology. That means once the record is added to the buffer its
> >> offset is added to the offsets to commit -- independently of whether the
> >> record was evicted from the buffer and sent to the join node or not.
> >> Now, let's assume the following scenario
> >> 1. a record is read from the input topic and added to the buffer, but
> >> not evicted to be processed by the join node.
> >> 2. When the processing of the subtopology finishes the offset of the
> >> record is added to the offsets to commit.
> >> 3. A commit happens.
> >> 4. A failure happens
> >>
> >> After the failure the buffer is empty but the record will not be read
> >> anymore from the input topic since its offset has been already
> >> committed. The record is lost.
> >> One solution to avoid the loss is to recreate the buffer from a
> >> compacted Kafka topic as we do for suppression buffers. I do not think,
> >> we need any offset checkpoint here since, we keep the buffer in memory,
> >> right? Or do you plan to back the buffer with a persistent store? Even
> >> in that case, a compacted Kafka topic would be needed.
> >>
> >>
> >> 3.
> >>   From the KIP it is still not clear to me what happens if a record is
> >> outside of the grace period. I guess the record that falls outside of
> >> the grace period will not be added to the buffer, but will be send to
> >> the join node. Since it is outside of the grace period it will also not
> >> increase stream time and it will not trigger an eviction. Also the head
> >> of the buffer will not contain a record that needs to be evicted since
> >> the the timestamp of the head record will be within the interval stream
> >> time minus grace period. Is this correct? Please add such a description
> >> to the KIP.
> >> Furthermore, I think there is a mistake in the text:
> >> "... will dequeue when the record timestamp is greater than stream time
> >> plus the grace period". I guess that should be "... will dequeue when
> >> the record timestamp is less than (or equal?) stream time minus the
> >> grace period"
> >>
> >>
> >> 5.
> >> What is the difference between not setting the grace period and setting
> >> it to zero? If there is a difference, why is there a difference?
> >>
> >>
> >> Best,
> >> Bruno
> >>
> >>
> >> On 01.06.23 23:58, Walker Carlson wrote:
> >>> Hey Bruno thanks for the feedback.
> >>>
> >>> 1)
> >>> I will add this to the kip, but stream time only advances as the when
> the
> >>> buffer receives a new record.
> >>>
> >>> 2)
> >>> You are correct, I will add a failure section on to the kip. Since the
> >>> records wont change in the buffer from when they are read from the
> topic
> >>> they are replicated already.
> >>>
> >>> 3)
> >>> I see that I'm out voted on the dropping of records thing. We will pass
> >>> them on and try to join them if possible. This might cause some null
> >>> results, but increasing the table history retention should help that.
> >>>
> >>> 4)
> >>> I can add some on the kip. But its pretty directly adding whatever the
> >>> grace period is to the latency. I don't see a way around it.
> >>>
> >>> Walker
> >>>
> >>> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org>
> wrote:
> >>>
> >>>> Hi Walker,
> >>>>
> >>>> thanks for the KIP!
> >>>>
> >>>> Here my feedback:
> >>>>
> >>>> 1.
> >>>> It is still not clear to me when stream time for the buffer advances.
> >>>> What is the event that let the stream time advance? In the
> discussion, I
> >>>> do not understand what you mean by "The segment store already has an
> >>>> observed stream time, we advance based on that. That should only
> advance
> >>>> based on records that enter the store." Where does this segment store
> >>>> come from? Anyways, I think it would be great to also state how stream
> >>>> time advances in the KIP.
> >>>>
> >>>> 2.
> >>>> How does the buffer behave in case of a failure? I think I understand
> >>>> that the buffer will use an implementation of
> TimeOrderedKeyValueBuffer
> >>>> and therefore the records in the buffer will be replicated to a topic
> in
> >>>> Kafka, but I am not completely sure. Could you elaborate on this in
> the
> >>>> KIP?
> >>>>
> >>>> 3.
> >>>> I agree with Matthias about dropping late records. We use grace
> periods
> >>>> in scenarios where we records are grouped like in windowed
> aggregations
> >>>> and windowed joins. The stream buffer you propose does not really
> group
> >>>> any records. It rather delays records and reorders them. I am not sure
> >>>> if grace period is the right naming/concept to apply here. Instead of
> >>>> dropping records that fall outside of the buffer's time interval the
> >>>> join should skip the buffer and try to join the record immediately. In
> >>>> the end, a stream-table join is a unwindowed join, i.e., no grouping
> is
> >>>> applied to the records.
> >>>> What do you and other folks think about this proposal?
> >>>>
> >>>> 4.
> >>>> How does the proposed buffer, affects processing latency? Could you
> >>>> please add some words about this to the KIP?
> >>>>
> >>>>
> >>>> Best,
> >>>> Bruno
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 31.05.23 01:49, Walker Carlson wrote:
> >>>>> Thanks for all the additional comments. I will either address them
> here
> >>>> or
> >>>>> update the kip accordingly.
> >>>>>
> >>>>>
> >>>>> I mentioned a follow kip to add extra features before and in the
> >>>> responses.
> >>>>> I will try to briefly summarize what options and optimizations I plan
> >> to
> >>>>> include. If a concern is not covered in this list I for sure talk
> about
> >>>> it
> >>>>> below.
> >>>>>
> >>>>> * Allowing non versioned tables to still use the stream buffer
> >>>>> * Automatically materializing tables instead of forcing the user to
> do
> >> it
> >>>>> * Configurable for in memory buffer
> >>>>> * Order the records in offset order or in time order
> >>>>> * Non memory use buffer (offset order, delayed pull from stream.)
> >>>>> * Time synced between stream and table side (maybe)
> >>>>> * Do not drop late records and process them as they come in instead.
> >>>>>
> >>>>>
> >>>>> First, Victoria.
> >>>>>
> >>>>> 1) (One of your nits covers this, but you are correct it doesn't make
> >>>>> sense. so I removed that part of the example.)
> >>>>> For those examples with the "bad" join results I said without
> buffering
> >>>> the
> >>>>> stream it would look like that, but that was incomplete. If the look
> up
> >>>> was
> >>>>> simply looking at the latest version of the table when the stream
> >> records
> >>>>> came in then the results were possible. If we are using the point in
> >> time
> >>>>> lookup that versioned tables let us then you are correct the future
> >>>> results
> >>>>> are not possible.
> >>>>>
> >>>>> 2) I'll get to this later as Matthias brought up something related.
> >>>>>
> >>>>> To your additional thoughts, I agree that we need to call those
> things
> >>>> out
> >>>>> in the documentation. I'm writing up a follow up kip with a lot of
> the
> >>>>> ideas we have discussed so that we can improve this feature beyond
> the
> >>>> base
> >>>>> implementation if it's needed.
> >>>>>
> >>>>> I addressed the nits in the kip. I somehow missed the table stream
> >> table
> >>>>> join processor improvement, it makes your first question make a lot
> >> more
> >>>>> sense.  Table history retention is a much cleaner way to describe it.
> >>>>>
> >>>>> As to your mention of the syncing the time for the table and stream.
> >>>>> Matthias mentioned that as well. I will address both here. I plan to
> >>>> bring
> >>>>> that up in the future, but for now we will leave it out. I suppose it
> >>>> will
> >>>>> be more useful after the table history retention is separable from
> the
> >>>>> table grace period.
> >>>>>
> >>>>>
> >>>>> To address Matthias comments.
> >>>>>
> >>>>> You are correct by saying the in memory store shouldn't cause any
> >>>> semantic
> >>>>> concerns. My concern would be more with if we limited the number of
> >>>> records
> >>>>> on the buffer and what we would do if we hit said limits, (emitting
> >> those
> >>>>> records might be an issue, throwing an error and halting would not).
> I
> >>>>> think we can leave this discussion to the follow up kip along with a
> >> few
> >>>>> other options.
> >>>>>
> >>>>> I will go through your proposals now.
> >>>>>
> >>>>>      - don't support non-versioned KTables
> >>>>>
> >>>>> Sure, we can always expand this later on. Will include as part of the
> >> of
> >>>>> the improvement kip
> >>>>>
> >>>>>      - if grace period is added, users need to explicitly materialize
> >> the
> >>>>> table as version (either directly, or upstream. Upstream only works
> if
> >>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>>>>
> >>>>> again, that works for me for now, if we find a use we can always add
> >>>> later.
> >>>>>
> >>>>>      - the table's history retention time must be larger than the
> grace
> >>>>> period (should be easy to check at runtime, when we build the
> topology)
> >>>>>
> >>>>> agreed
> >>>>>
> >>>>>      - because switching from non-versioned to version stores is not
> >>>>> backward compatibly (cf KIP-914), users need to take care of this
> >>>>> themselves, and this also implies that adding grace period is not a
> >>>>> backward compatible change (even only if via indirect means)
> >>>>>
> >>>>> sure, this works
> >>>>>
> >>>>> As to the dropping of late records, I'm not sure. One one hand I like
> >> not
> >>>>> dropping things. But on the other I struggle to see how a user can
> >> filter
> >>>>> out late records that might have incomplete join results. The point
> in
> >>>> time
> >>>>> look up will aggressively expire old data and if new data has been
> >>>> replaced
> >>>>> it will return null if outside of the retention. This seems like it
> >> could
> >>>>> corrupt the integrity of the join output. Seeing that we drop late
> >>>> records
> >>>>> on the table side as well I would think it makes sense to drop late
> >>>> records
> >>>>> on the stream buffer. I could be convinced otherwise I suppose, I
> could
> >>>> see
> >>>>> adding this as an option in a follow up kip. It would be very easy to
> >>>>> implement either way. For now unless no one else objects I'm going to
> >>>> stick
> >>>>> with dropping the records for the sake of getting this kip passed. It
> >> is
> >>>>> functionally a small change to make and we can update later if you
> feel
> >>>>> strongly about it.
> >>>>>
> >>>>> For the ordering. I have to say that it would be more complicated to
> >>>>> implement it to be in offset order, if the goal it to get as many of
> >> the
> >>>>> records validly joined as possible. Because we would process as
> things
> >>>> left
> >>>>> the buffer a sufficiency early enough record could hold up records
> that
> >>>>> would otherwise be valid past the table history retention. To fix
> this
> >> we
> >>>>> could process by timestamp then store in a second queue and emit by
> >>>> offset,
> >>>>> but that would be a lot more complicated. If we didn't care about not
> >>>>> missing some valid joins we could just have no store and pull from
> the
> >>>>> topic at a delay only caring about the timestamp of the next offset.
> >> For
> >>>>> now I want to stick with the timestamp ordering as it makes much more
> >>>> sense
> >>>>> to me, but would propose we add both of the other options I have laid
> >> out
> >>>>> here in the follow up kip.
> >>>>>
> >>>>> Lastly, I think having an empty store with zero grace period would be
> >>>> super
> >>>>> simple and not costly, so we might as well make it even if nothing
> gets
> >>>>> entered.
> >>>>>
> >>>>> I hope that address all your concerns,
> >>>>>
> >>>>> Walker
> >>>>>
> >>>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
> >>>> wrote:
> >>>>>
> >>>>>> Walker,
> >>>>>>
> >>>>>> thanks for the updates. The KIP itself reads fine (of course
> Victoria
> >>>>>> made good comments about some phrases), but there is a couple of
> >> things
> >>>>>> from your latest reply I don't understand, and that I still think
> need
> >>>>>> some more discussions.
> >>>>>>
> >>>>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and
> you
> >>>>>> mention "semantic concerns". There should not be any semantic
> >> difference
> >>>>>> from the underlying buffer implementation, so I am not sure what you
> >>>>>> mean here (also the relationship to suppress() is unclear to me)?
> -- I
> >>>>>> am ok to not make it configurable for now. We can always do it via a
> >>>>>> follow up KIP, and keep interface changes limited for now.
> >>>>>>
> >>>>>> Does it really make sense to allow a grace period if the table is
> >>>>>> non-versioned? You also say: "If table is not materialized it will
> >>>>>> materialize it as versioned." -- What history retention time would
> we
> >>>>>> pick for this case (also asked by Victoria)? Or should we rather not
> >>>>>> support this and force the user to materialize the table explicitly,
> >> and
> >>>>>> thus explicitly picking a history retention time? It's tradeoff
> >> between
> >>>>>> usability and guiding uses that there will be a significant impact
> on
> >>>>>> disk usage. There is also compatibility concerns: If the table is
> not
> >>>>>> explicitly materialized in the old program, we would already need to
> >>>>>> materialize it also in the old program (of course, we would use a
> >>>>>> non-versioned store so far). Thus, if somebody adds a grace period,
> we
> >>>>>> cannot just switch the store type, as it would be a breaking change,
> >>>>>> potentially required an application re-set, or following the upgrade
> >>>>>> path for versioned state stores, and also changing the program to
> >>>>>> explicitly materialize using a versioned store. Also note, that we
> >> might
> >>>>>> not materialize the actual join table, but only an upstream table,
> and
> >>>>>> use `ValueGetter` to access the upstream data.
> >>>>>>
> >>>>>> To this end, as you already mentioned, history retention of the
> table
> >>>>>> should be at least grace period. You proposed to include this in a
> >>>>>> follow up KIP, but I am wondering if it's a fundamental requirement
> >> and
> >>>>>> thus we should put a check in place right away and reject an invalid
> >>>>>> configuration? (It always easier to lift restriction than to
> introduce
> >>>>>> them later.) This would also imply that a non-versioned table cannot
> >> be
> >>>>>> supported, because it does not have a history retention that is
> larger
> >>>>>> than grace period, and maybe also answer the requirement about
> >>>>>> materialization: as we already always materialize something on the
> >>>>>> tablet side as non-versioned store right now, it seems difficult to
> >>>>>> migrate the store to a versioned store. Ie, it might be ok to push
> the
> >>>>>> burden onto the user and say: if you start using grace period, you
> >> also
> >>>>>> need to manually switch from non-versioned to versioned KTables.
> Doing
> >>>>>> stuff automatically under the hood if very complex for this case, we
> >> if
> >>>>>> we push the burden onto the user, it might be ok to not complicate
> >> this
> >>>>>> KIP significantly.
> >>>>>>
> >>>>>> To summarize the last two paragraphs, I would propose to:
> >>>>>>      - don't support non-versioned KTables
> >>>>>>      - if grace period is added, users need to explicitly
> materialize
> >> the
> >>>>>> table as version (either directly, or upstream. Upstream only works
> if
> >>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>>>>>      - the table's history retention time must be larger than the
> grace
> >>>>>> period (should be easy to check at runtime, when we build the
> >> topology)
> >>>>>>      - because switching from non-versioned to version stores is not
> >>>>>> backward compatibly (cf KIP-914), users need to take care of this
> >>>>>> themselves, and this also implies that adding grace period is not a
> >>>>>> backward compatible change (even only if via indirect means)
> >>>>>>
> >>>>>> About dropping late records: wondering if we should never drop a
> >>>>>> stream-side record for a left-join, even if it's late? In general,
> one
> >>>>>> thing I observed over the years is, that it's easier to keep stuff
> and
> >>>>>> let users filter explicitly downstream (or make it configurable),
> >>>>>> instead of dropping pro-actively, because users have no good way to
> >>>>>> resurrect record that got already dropped.
> >>>>>>
> >>>>>> For ordering, sounds reasonable to me only start with one
> >>>>>> implementation, and maybe make it configurable as a follow up.
> >> However,
> >>>>>> I am wondering if starting with offset order might be the better
> >> option
> >>>>>> as it seems to align more with what we do so far? So instead of
> >> storing
> >>>>>> record ordered by timestamp, we can just store them ordered by
> offset,
> >>>>>> and still "poll" from the buffer based on the head records
> timestamp.
> >> Or
> >>>>>> would this complicate the implementation significantly?
> >>>>>>
> >>>>>> I also think it's ok to not "sync" stream-time between the table and
> >> the
> >>>>>> stream in this KIP, but we should consider doing this as a follow up
> >>>>>> change (not sure if we would need a KIP or not for a change this
> >> this).
> >>>>>>
> >>>>>> About increasing/decreasing grace period: what you describe make
> sense
> >>>>>> to me. If decreased, the next record would just trigger emitting a
> lot
> >>>>>> of records, and for increase, the buffer would just need to "fill
> up"
> >>>>>> again. For reprocessing getting a different result with a different
> >>>>>> grace period is expected, so that's ok IMHO. -- There seems to be
> one
> >>>>>> special corner case: grace period zero. For this case, we actually
> >> don't
> >>>>>> need any store, and the stream-side could be stateless. I think it
> can
> >>>>>> have the same behavior, but if we want to "add / remove" the store
> >>>>>> dynamically, we need to add specific code for it. For example, even
> if
> >>>>>> we start up with a grace period of zero, we would need to check if
> >> there
> >>>>>> is a local store, and still emit everything in it, before we can
> ditch
> >>>>>> the store (not sure if that's even easily done at all). Or: we would
> >>>>>> need to have a store for _all_ cases, even if grace period is zero
> >> (the
> >>>>>> store would be empty all the time though), to avoid super complex
> >> code?
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> >>>>>>> Hi Walker,
> >>>>>>>
> >>>>>>> thanks for your responses. That makes sense. I guess there is
> always
> >>>>>>> the option to make the implementation more configurable later on,
> if
> >>>>>>> users request it. Also thanks for the clarifications. From my side,
> >>>>>>> the KIP is good to go.
> >>>>>>>
> >>>>>>> Cheers,
> >>>>>>> Lucas
> >>>>>>>
> >>>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> >>>>>>> <vi...@confluent.io.invalid> wrote:
> >>>>>>>>
> >>>>>>>> Thanks for the updates, Walker! Looks great, though I do have a
> >> couple
> >>>>>>>> questions about the latest updates:
> >>>>>>>>
> >>>>>>>>        1. The new example says that without stream-side buffering,
> >> "ex"
> >>>> and
> >>>>>>>>        "fy" are possible join results. How could those join
> results
> >>>>>> happen? The
> >>>>>>>>        example versioned table suggests that table record "x" has
> >>>>>> timestamp 2, and
> >>>>>>>>        table record "y" has timestamp 3. If stream record "e" has
> >>>>>> timestamp 1,
> >>>>>>>>        then it can never be joined against record "x", and
> similarly
> >> for
> >>>>>> stream
> >>>>>>>>        record "f" with timestamp 2 being joined against "y".
> >>>>>>>>        2. I see in your replies above that "If table is not
> >>>> materialized it
> >>>>>>>>        will materialize it as versioned" but I don't see this
> called
> >> out
> >>>>>> in the
> >>>>>>>>        KIP -- seems worth calling out. Also, what will the history
> >>>>>> retention for
> >>>>>>>>        the versioned table be? Will it be the same as the join
> grace
> >>>>>> period, or
> >>>>>>>>        will it be greater?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> And some additional thoughts:
> >>>>>>>>
> >>>>>>>> Sounds like there are a few things users should watch out for when
> >>>>>> enabling
> >>>>>>>> the stream-side buffer:
> >>>>>>>>
> >>>>>>>>        - Records will get "stuck" if there are no newer records to
> >>>> advance
> >>>>>>>>        stream time.
> >>>>>>>>        - If there are large gaps between the timestamps of
> >> stream-side
> >>>>>> records,
> >>>>>>>>        then it's possible that versioned store history retention
> will
> >>>> have
> >>>>>> expired
> >>>>>>>>        by the time a record is evicted from the join buffer,
> leading
> >> to
> >>>> a
> >>>>>> join
> >>>>>>>>        "miss." For example, if the join grace period and table
> >> history
> >>>>>> retention
> >>>>>>>>        are both 10, and records come in the order:
> >>>>>>>>
> >>>>>>>>        table side t0 with ts=0
> >>>>>>>>        stream side s1 with ts=1 <-- enters buffer
> >>>>>>>>        table side t10 with ts=10
> >>>>>>>>        table side t20 with ts=20
> >>>>>>>>        stream side s21 with ts=21 <-- evicts record s1 from
> buffer,
> >> but
> >>>>>>>>        versioned store no longer contains data for ts=1 due to
> >> history
> >>>>>> retention
> >>>>>>>>        having elapsed
> >>>>>>>>
> >>>>>>>>        This will result in the join result (s1, null) even though
> it
> >>>>>> should've
> >>>>>>>>        been (s1, t0), due to t0 having been expired from the
> >> versioned
> >>>>>> store
> >>>>>>>>        already.
> >>>>>>>>        - Out-of-order records from the stream-side will be
> reordered,
> >>>> and
> >>>>>> late
> >>>>>>>>        records will be dropped.
> >>>>>>>>
> >>>>>>>> I don't think any of these are reasons to not go forward with this
> >>>> KIP,
> >>>>>> but
> >>>>>>>> it'd be good to call them out in the eventual documentation to
> >>>> decrease
> >>>>>> the
> >>>>>>>> chance users get tripped up.
> >>>>>>>>
> >>>>>>>>> We could maybe do an improvement later to advance stream time
> from
> >>>>>> table
> >>>>>>>> side as well, but that might be debatable as we might get more
> late
> >>>>>> records.
> >>>>>>>>
> >>>>>>>> Yes, the likelihood of late records increases but also the
> >> likelihood
> >>>> of
> >>>>>>>> "join misses" due to versioned store history retention having
> >> elapsed
> >>>>>>>> decreases, which feels important for certain use cases. Either
> way,
> >>>>>> agreed
> >>>>>>>> that it can be a discussion for the future as incorporating this
> >> would
> >>>>>>>> substantially complicate the implementation.
> >>>>>>>>
> >>>>>>>> Also a couple nits:
> >>>>>>>>
> >>>>>>>>        - The KIP currently says "We recently added versioned
> tables
> >>>> which
> >>>>>> allow
> >>>>>>>>        the table side of the a join [...] but it is not taken
> >> advantage
> >>>> of
> >>>>>> in
> >>>>>>>>        joins," but this doesn't seem true? If the table of a
> >>>> stream-table
> >>>>>> join is
> >>>>>>>>        versioned, then the DSL's stream-table join processor will
> >>>>>> automatically
> >>>>>>>>        perform timestamped lookups into the table, in order to
> take
> >>>>>> advantage of
> >>>>>>>>        the new timestamp-aware store to provide better join
> >> semantics.
> >>>>>>>>        - The KIP mentions "grace period" for versioned stores in a
> >>>> number
> >>>>>> of
> >>>>>>>>        places but I think you actually mean "history retention"?
> The
> >> two
> >>>>>> happen to
> >>>>>>>>        be the same today (it is not an option for users to
> configure
> >> the
> >>>>>> two
> >>>>>>>>        separately) but this need not be true in the future.
> "History
> >>>>>> retention"
> >>>>>>>>        governs how far back in time reads may occur, which is the
> >>>> relevant
> >>>>>>>>        parameter for performing lookups as part of the
> stream-table
> >>>> join.
> >>>>>> "Grace
> >>>>>>>>        period" in the context of versioned stores refers to how
> far
> >> back
> >>>>>> in time
> >>>>>>>>        out-of-order writes may occur, which probably isn't
> directly
> >>>>>> relevant for
> >>>>>>>>        introducing a stream-side buffer, though it's also possible
> >> I've
> >>>>>> overlooked
> >>>>>>>>        something. (As a bonus, switching from "table grace
> period" in
> >>>> the
> >>>>>> KIP to
> >>>>>>>>        "table history retention" also helps to clarify/distinguish
> >> that
> >>>>>> it's a
> >>>>>>>>        different parameter from the "join grace period," which I
> >> could
> >>>> see
> >>>>>> being
> >>>>>>>>        confusing to readers. :) )
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Cheers,
> >>>>>>>> Victoria
> >>>>>>>>
> >>>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> >>>>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>>
> >>>>>>>>> Hey all,
> >>>>>>>>>
> >>>>>>>>> Thanks for the comments, they gave me a lot to think about. I'll
> >> try
> >>>> to
> >>>>>>>>> address them all inorder. I have made some updates to the kip
> >> related
> >>>>>> to
> >>>>>>>>> them, but I mention where below.
> >>>>>>>>>
> >>>>>>>>> Lucas
> >>>>>>>>>
> >>>>>>>>> Good idea about the example. I added a simple one.
> >>>>>>>>>
> >>>>>>>>> 1) I have thought about including options for the underlying
> buffer
> >>>>>>>>> configuration. One of which might be adding an in memory option.
> My
> >>>>>> biggest
> >>>>>>>>> concern is about the semantic guarantees. This isn't like
> suppress
> >> or
> >>>>>> with
> >>>>>>>>> windows where producing incomplete results is repetitively
> >> harmless.
> >>>>>> Here
> >>>>>>>>> we would be possibly producing incorrect results. I also would
> like
> >>>> to
> >>>>>> keep
> >>>>>>>>> the interface changes as simple as I can. Making more than this
> >>>> change
> >>>>>> to
> >>>>>>>>> Joined I feel could make this more complicated than it needs to
> be.
> >>>> If
> >>>>>> we
> >>>>>>>>> really want to I could see adding a grace() option with a
> >>>> BufferConifg
> >>>>>> in
> >>>>>>>>> there or something, but I would rather not.
> >>>>>>>>>
> >>>>>>>>> 2) The buffer will be independent of if the table is versioned or
> >>>> not.
> >>>>>> If
> >>>>>>>>> table is not materialized it will materialize it as versioned. It
> >>>> might
> >>>>>>>>> make sense to do a follow up kip where we force the retention
> >> period
> >>>>>> of
> >>>>>>>>> the versioned to be greater than whatever the max of the stream
> >>>> buffer
> >>>>>> is.
> >>>>>>>>>
> >>>>>>>>> Victoria
> >>>>>>>>>
> >>>>>>>>> 1) Yes, records will exit in timestamp order not in offset order.
> >>>>>>>>> 2) Late records will be dropped (Late as out of the grace
> period).
> >>>>>>    From my
> >>>>>>>>> understanding that is the point of a grace period, no? Doesn't
> the
> >>>> same
> >>>>>>>>> thing happen with versioned stores?
> >>>>>>>>> 3) The segment store already has an observed stream time, we
> >> advance
> >>>>>> based
> >>>>>>>>> on that. That should only advance based on records that enter the
> >>>>>> store. So
> >>>>>>>>> yes, only stream side records. We could maybe do an improvement
> >> later
> >>>>>> to
> >>>>>>>>> advance stream time from table side as well, but that might be
> >>>>>> debatable as
> >>>>>>>>> we might get more late records. Anyways I would rather have that
> >> as a
> >>>>>>>>> separate discussion.
> >>>>>>>>>
> >>>>>>>>> in memory option? We can do that, for the buffer I plan to use
> the
> >>>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in
> memory
> >>>>>>>>> implantation, so it would be simple.
> >>>>>>>>>
> >>>>>>>>> I said more in my answer to Lucas's question. The concern I have
> >> with
> >>>>>>>>> buffer configs or in memory is complicating the interface. Also
> >>>>>> semantic
> >>>>>>>>> guarantees but in memory shouldn't effect that
> >>>>>>>>>
> >>>>>>>>> Matthias
> >>>>>>>>>
> >>>>>>>>> 1) fixed out of order vs late terminology in the kip.
> >>>>>>>>>
> >>>>>>>>> 2) I was referring to having a stream. So after this kip we can
> >> have
> >>>> a
> >>>>>>>>> buffered stream or a normal one. For the table we can use a
> >> versioned
> >>>>>> table
> >>>>>>>>> or a normal table.
> >>>>>>>>>
> >>>>>>>>> 3 Good call out. I clarified this as "If the table side uses a
> >>>>>> materialized
> >>>>>>>>> version store, it can store multiple versions of each record
> within
> >>>> its
> >>>>>>>>> defined grace period." and modified the rest of the paragraph a
> >> bit.
> >>>>>>>>>
> >>>>>>>>> 4) I get the preserving off offset ordering, but if the stream is
> >>>>>> buffered
> >>>>>>>>> to join on timestamp instead of offset doesn't it already seem
> like
> >>>> we
> >>>>>> care
> >>>>>>>>> more about time in this case?
> >>>>>>>>>
> >>>>>>>>> If we end up adding more options it might make sense to do this.
> >>>> Maybe
> >>>>>>>>> offset order processing can be a follow up?
> >>>>>>>>>
> >>>>>>>>> I'll add a section for this in Rejected Alternatives. I think it
> >>>> makes
> >>>>>>>>> sense to do something like this but maybe in a follow up.
> >>>>>>>>>
> >>>>>>>>> 5) I hadn't thought about this. I suppose if they changed this in
> >> an
> >>>>>>>>> upgrade the next record would either evict a lot of records (if
> the
> >>>>>> grace
> >>>>>>>>> period decreased) or there would be a pause until the new grace
> >>>> period
> >>>>>>>>> reached. Increasing is a bit more problematic, especially if the
> >>>> table
> >>>>>>>>> grace period and retention time stays the same. If the data is
> >>>>>> reprocessed
> >>>>>>>>> after a change like that then there would be different results,
> >> but I
> >>>>>> feel
> >>>>>>>>> like that would be expected after such a change.
> >>>>>>>>>
> >>>>>>>>> What do you think should happen?
> >>>>>>>>>
> >>>>>>>>> Hopefully this answers your questions!
> >>>>>>>>>
> >>>>>>>>> Walker
> >>>>>>>>>
> >>>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <
> mjsax@apache.org>
> >>>>>> wrote:
> >>>>>>>>>
> >>>>>>>>>> Thanks for the KIP! Also some question/comments from my side:
> >>>>>>>>>>
> >>>>>>>>>> 10) Notation: you use the term "late data" but I think you mean
> >>>>>>>>>> out-of-order. We reserve the term "late" to records that arrive
> >>>> after
> >>>>>>>>>> grace period passed, and thus, "late == out-of-order data that
> is
> >>>>>>>>> dropped".
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 20) "There is only one option from the stream side and only
> >> recently
> >>>>>> is
> >>>>>>>>>> there a second option on the table side."
> >>>>>>>>>>
> >>>>>>>>>> What are those options? Victoria already asked about the table
> >> side,
> >>>>>> but
> >>>>>>>>>> I am also not sure what option you mean for the stream side?
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 30) "If the table side uses a materialized version store the
> value
> >>>> is
> >>>>>>>>>> the latest by stream time rather than by offset within its
> defined
> >>>>>> grace
> >>>>>>>>>> period."
> >>>>>>>>>>
> >>>>>>>>>> The phrase "the value is the latest by stream time" is confusing
> >> --
> >>>> in
> >>>>>>>>>> the end, a versioned stores multiple versions, not just one.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 40) I am also wondering about ordering. In general, KS tries to
> >>>>>> preserve
> >>>>>>>>>> offset-order during processing (with some exception, when offset
> >>>> order
> >>>>>>>>>> preservation is not clearly defined). Given that the stream-side
> >>>>>> buffer
> >>>>>>>>>> is really just a "linear buffer", we could easily preserve
> >>>>>> offset-order.
> >>>>>>>>>> But I also see a benefit of re-ordering and emitting
> out-of-order
> >>>> data
> >>>>>>>>>> right away when read (instead of blocking them behind in-order
> >>>> records
> >>>>>>>>>> that are not ready yet). -- It might even be a possibility, to
> let
> >>>>>> users
> >>>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name
> just
> >> a
> >>>>>>>>>> placeholder).
> >>>>>>>>>>
> >>>>>>>>>> The KIP should explain this in more detail and also discuss
> >>>> different
> >>>>>>>>>> options and mention them in "Rejected alternatives" in case we
> >> don't
> >>>>>>>>>> want to include them.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> 50) What happens when users change the grace period? Especially,
> >>>> when
> >>>>>>>>>> they turn it on/off (but also increasing/decreasing is an
> >>>> interesting
> >>>>>>>>>> point)? I think we should try to support this if possible; the
> >>>>>>>>>> "Compatibility" section needs to cover switching on/off in more
> >>>>>> detail.
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> -Matthias
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
> >>>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
> >>>>>>>>>>>
> >>>>>>>>>>> A few clarifications:
> >>>>>>>>>>>
> >>>>>>>>>>> 1. Is the order that records exit the buffer in necessarily the
> >>>> same
> >>>>>> as
> >>>>>>>>>> the
> >>>>>>>>>>> order that records enter the buffer in, or no? Based on the
> >>>>>> description
> >>>>>>>>>> in
> >>>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will
> exit
> >>>> the
> >>>>>>>>>>> buffer in increasing timestamp order, which means that they may
> >> be
> >>>>>>>>>> ordered
> >>>>>>>>>>> (even for the same key) compared to the input order.
> >>>>>>>>>>>
> >>>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
> >>>>>> stream-side
> >>>>>>>>>>> record arrives with a timestamp that is older than the current
> >>>> stream
> >>>>>>>>>> time
> >>>>>>>>>>> minus the grace period? Will this record trigger a join result,
> >> or
> >>>>>> will
> >>>>>>>>>> it
> >>>>>>>>>>> be dropped? Based on the description for what happens when the
> >> join
> >>>>>>>>> grace
> >>>>>>>>>>> period is set to zero, it sounds like the late record will be
> >>>>>> dropped,
> >>>>>>>>>> even
> >>>>>>>>>>> if the join grace period is nonzero. Is that true?
> >>>>>>>>>>>
> >>>>>>>>>>> 3. What could cause stream time to advance, for purposes of
> >>>> removing
> >>>>>>>>>>> records from the join buffer? For example, will new records
> >>>> arriving
> >>>>>> on
> >>>>>>>>>> the
> >>>>>>>>>>> table side of the join cause stream time to advance? From the
> KIP
> >>>> it
> >>>>>>>>>> sounds
> >>>>>>>>>>> like only stream-side records will advance stream time -- does
> >> that
> >>>>>>>>> mean
> >>>>>>>>>>> that the join processor itself will have to track this stream
> >> time?
> >>>>>>>>>>>
> >>>>>>>>>>> Also +1 to Lucas's question about what options will be
> available
> >>>> for
> >>>>>>>>>>> configuring the join buffer. Will users have the option to
> choose
> >>>>>>>>> whether
> >>>>>>>>>>> they want the buffer to be in-memory vs persistent?
> >>>>>>>>>>>
> >>>>>>>>>>> - Victoria
> >>>>>>>>>>>
> >>>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> >>>>>>>>>>> <lb...@confluent.io.invalid> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>>> HI Walker,
> >>>>>>>>>>>>
> >>>>>>>>>>>> thanks for the KIP! We definitely need this. I have two
> >> questions:
> >>>>>>>>>>>>
> >>>>>>>>>>>>       - Have you considered allowing the customization of the
> >>>>>> underlying
> >>>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
> >>>>>> customize
> >>>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it
> make
> >>>>>> sense
> >>>>>>>>>>>> for `Joined` to have this as well? I can imagine one may want
> to
> >>>>>> limit
> >>>>>>>>>>>> the number of records in the buffer, for example. If we hit
> the
> >>>>>>>>>>>> maximum, the only option would be to drop semantic guarantees,
> >> but
> >>>>>>>>>>>> users may still want to do this.
> >>>>>>>>>>>>       - With "second option on the table side" you are
> referring
> >> to
> >>>>>>>>>>>> versioned tables, right? Will the buffer on the stream side
> >> behave
> >>>>>> any
> >>>>>>>>>>>> different whether the table side is versioned or not?
> >>>>>>>>>>>>
> >>>>>>>>>>>> Finally, I think a simple example in the motivation section
> >> could
> >>>>>> help
> >>>>>>>>>>>> non-experts understand the KIP.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Best,
> >>>>>>>>>>>> Lucas
> >>>>>>>>>>>>
> >>>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> >>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Hello everybody,
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I have a stream proposal to improve the stream table join by
> >>>>>> adding a
> >>>>>>>>>>>> grace
> >>>>>>>>>>>>> period and buffer to the stream side of the join to allow
> >>>>>> processing
> >>>>>>>>> in
> >>>>>>>>>>>>> timestamp order matching the recent improvements of the
> >> versioned
> >>>>>>>>>> tables.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Please take a look here <
> >>>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
> >>>>>>>>>>>> and
> >>>>>>>>>>>>> share your thoughts.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> best,
> >>>>>>>>>>>>> Walker
> >>>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Walker,

Thanks once more for the updates to the KIP!

Do you also plan to expose metrics for the buffer?

Best,
Bruno

On 02.06.23 17:16, Walker Carlson wrote:
> Hello Bruno,
> 
> I think this covers your questions. Let me know what you think
> 
> 2.
> We can use a changelog topic. I think we can treat it like any other store
> and recover in the usual manner. Also implementation is on disk
> 
> 3.
> The description is in the public interfaces description. I will copy it
> into the proposed changes as well.
> 
> This is a bit of an implementation detail that I didn't want to add into
> the kip, but the record will be added to the buffer to keep the stream time
> consistent, it will just be ejected immediately. If of course if this
> causes performance issues we will skip this step and track stream time
> separately. I will update the kip to say that stream time advances when a
> stream record enters the node.
> 
> Also, yes, updated.
> 
> 5.
> No there is no difference right now, everything gets processed as it comes
> in and tries to find a record for its time stamp.
> 
> Walker
> 
> On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org> wrote:
> 
>> Hi Walker,
>>
>> Thanks for the updates!
>>
>> 2.
>> It is still not clear to me how a failure is handled. I do not
>> understand what you mean by "recover from an OffsetCheckpoint".
>>
>> My understanding is that the buffer needs to be replicated into its own
>> Kafka topic. The input topic is not enough. The offset of a record is
>> added to the offsets to commit once the record is streamed through the
>> subtopology. That means once the record is added to the buffer its
>> offset is added to the offsets to commit -- independently of whether the
>> record was evicted from the buffer and sent to the join node or not.
>> Now, let's assume the following scenario
>> 1. a record is read from the input topic and added to the buffer, but
>> not evicted to be processed by the join node.
>> 2. When the processing of the subtopology finishes the offset of the
>> record is added to the offsets to commit.
>> 3. A commit happens.
>> 4. A failure happens
>>
>> After the failure the buffer is empty but the record will not be read
>> anymore from the input topic since its offset has been already
>> committed. The record is lost.
>> One solution to avoid the loss is to recreate the buffer from a
>> compacted Kafka topic as we do for suppression buffers. I do not think,
>> we need any offset checkpoint here since, we keep the buffer in memory,
>> right? Or do you plan to back the buffer with a persistent store? Even
>> in that case, a compacted Kafka topic would be needed.
>>
>>
>> 3.
>>   From the KIP it is still not clear to me what happens if a record is
>> outside of the grace period. I guess the record that falls outside of
>> the grace period will not be added to the buffer, but will be send to
>> the join node. Since it is outside of the grace period it will also not
>> increase stream time and it will not trigger an eviction. Also the head
>> of the buffer will not contain a record that needs to be evicted since
>> the the timestamp of the head record will be within the interval stream
>> time minus grace period. Is this correct? Please add such a description
>> to the KIP.
>> Furthermore, I think there is a mistake in the text:
>> "... will dequeue when the record timestamp is greater than stream time
>> plus the grace period". I guess that should be "... will dequeue when
>> the record timestamp is less than (or equal?) stream time minus the
>> grace period"
>>
>>
>> 5.
>> What is the difference between not setting the grace period and setting
>> it to zero? If there is a difference, why is there a difference?
>>
>>
>> Best,
>> Bruno
>>
>>
>> On 01.06.23 23:58, Walker Carlson wrote:
>>> Hey Bruno thanks for the feedback.
>>>
>>> 1)
>>> I will add this to the kip, but stream time only advances as the when the
>>> buffer receives a new record.
>>>
>>> 2)
>>> You are correct, I will add a failure section on to the kip. Since the
>>> records wont change in the buffer from when they are read from the topic
>>> they are replicated already.
>>>
>>> 3)
>>> I see that I'm out voted on the dropping of records thing. We will pass
>>> them on and try to join them if possible. This might cause some null
>>> results, but increasing the table history retention should help that.
>>>
>>> 4)
>>> I can add some on the kip. But its pretty directly adding whatever the
>>> grace period is to the latency. I don't see a way around it.
>>>
>>> Walker
>>>
>>> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org> wrote:
>>>
>>>> Hi Walker,
>>>>
>>>> thanks for the KIP!
>>>>
>>>> Here my feedback:
>>>>
>>>> 1.
>>>> It is still not clear to me when stream time for the buffer advances.
>>>> What is the event that let the stream time advance? In the discussion, I
>>>> do not understand what you mean by "The segment store already has an
>>>> observed stream time, we advance based on that. That should only advance
>>>> based on records that enter the store." Where does this segment store
>>>> come from? Anyways, I think it would be great to also state how stream
>>>> time advances in the KIP.
>>>>
>>>> 2.
>>>> How does the buffer behave in case of a failure? I think I understand
>>>> that the buffer will use an implementation of TimeOrderedKeyValueBuffer
>>>> and therefore the records in the buffer will be replicated to a topic in
>>>> Kafka, but I am not completely sure. Could you elaborate on this in the
>>>> KIP?
>>>>
>>>> 3.
>>>> I agree with Matthias about dropping late records. We use grace periods
>>>> in scenarios where we records are grouped like in windowed aggregations
>>>> and windowed joins. The stream buffer you propose does not really group
>>>> any records. It rather delays records and reorders them. I am not sure
>>>> if grace period is the right naming/concept to apply here. Instead of
>>>> dropping records that fall outside of the buffer's time interval the
>>>> join should skip the buffer and try to join the record immediately. In
>>>> the end, a stream-table join is a unwindowed join, i.e., no grouping is
>>>> applied to the records.
>>>> What do you and other folks think about this proposal?
>>>>
>>>> 4.
>>>> How does the proposed buffer, affects processing latency? Could you
>>>> please add some words about this to the KIP?
>>>>
>>>>
>>>> Best,
>>>> Bruno
>>>>
>>>>
>>>>
>>>>
>>>> On 31.05.23 01:49, Walker Carlson wrote:
>>>>> Thanks for all the additional comments. I will either address them here
>>>> or
>>>>> update the kip accordingly.
>>>>>
>>>>>
>>>>> I mentioned a follow kip to add extra features before and in the
>>>> responses.
>>>>> I will try to briefly summarize what options and optimizations I plan
>> to
>>>>> include. If a concern is not covered in this list I for sure talk about
>>>> it
>>>>> below.
>>>>>
>>>>> * Allowing non versioned tables to still use the stream buffer
>>>>> * Automatically materializing tables instead of forcing the user to do
>> it
>>>>> * Configurable for in memory buffer
>>>>> * Order the records in offset order or in time order
>>>>> * Non memory use buffer (offset order, delayed pull from stream.)
>>>>> * Time synced between stream and table side (maybe)
>>>>> * Do not drop late records and process them as they come in instead.
>>>>>
>>>>>
>>>>> First, Victoria.
>>>>>
>>>>> 1) (One of your nits covers this, but you are correct it doesn't make
>>>>> sense. so I removed that part of the example.)
>>>>> For those examples with the "bad" join results I said without buffering
>>>> the
>>>>> stream it would look like that, but that was incomplete. If the look up
>>>> was
>>>>> simply looking at the latest version of the table when the stream
>> records
>>>>> came in then the results were possible. If we are using the point in
>> time
>>>>> lookup that versioned tables let us then you are correct the future
>>>> results
>>>>> are not possible.
>>>>>
>>>>> 2) I'll get to this later as Matthias brought up something related.
>>>>>
>>>>> To your additional thoughts, I agree that we need to call those things
>>>> out
>>>>> in the documentation. I'm writing up a follow up kip with a lot of the
>>>>> ideas we have discussed so that we can improve this feature beyond the
>>>> base
>>>>> implementation if it's needed.
>>>>>
>>>>> I addressed the nits in the kip. I somehow missed the table stream
>> table
>>>>> join processor improvement, it makes your first question make a lot
>> more
>>>>> sense.  Table history retention is a much cleaner way to describe it.
>>>>>
>>>>> As to your mention of the syncing the time for the table and stream.
>>>>> Matthias mentioned that as well. I will address both here. I plan to
>>>> bring
>>>>> that up in the future, but for now we will leave it out. I suppose it
>>>> will
>>>>> be more useful after the table history retention is separable from the
>>>>> table grace period.
>>>>>
>>>>>
>>>>> To address Matthias comments.
>>>>>
>>>>> You are correct by saying the in memory store shouldn't cause any
>>>> semantic
>>>>> concerns. My concern would be more with if we limited the number of
>>>> records
>>>>> on the buffer and what we would do if we hit said limits, (emitting
>> those
>>>>> records might be an issue, throwing an error and halting would not). I
>>>>> think we can leave this discussion to the follow up kip along with a
>> few
>>>>> other options.
>>>>>
>>>>> I will go through your proposals now.
>>>>>
>>>>>      - don't support non-versioned KTables
>>>>>
>>>>> Sure, we can always expand this later on. Will include as part of the
>> of
>>>>> the improvement kip
>>>>>
>>>>>      - if grace period is added, users need to explicitly materialize
>> the
>>>>> table as version (either directly, or upstream. Upstream only works if
>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>>
>>>>> again, that works for me for now, if we find a use we can always add
>>>> later.
>>>>>
>>>>>      - the table's history retention time must be larger than the grace
>>>>> period (should be easy to check at runtime, when we build the topology)
>>>>>
>>>>> agreed
>>>>>
>>>>>      - because switching from non-versioned to version stores is not
>>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>>> themselves, and this also implies that adding grace period is not a
>>>>> backward compatible change (even only if via indirect means)
>>>>>
>>>>> sure, this works
>>>>>
>>>>> As to the dropping of late records, I'm not sure. One one hand I like
>> not
>>>>> dropping things. But on the other I struggle to see how a user can
>> filter
>>>>> out late records that might have incomplete join results. The point in
>>>> time
>>>>> look up will aggressively expire old data and if new data has been
>>>> replaced
>>>>> it will return null if outside of the retention. This seems like it
>> could
>>>>> corrupt the integrity of the join output. Seeing that we drop late
>>>> records
>>>>> on the table side as well I would think it makes sense to drop late
>>>> records
>>>>> on the stream buffer. I could be convinced otherwise I suppose, I could
>>>> see
>>>>> adding this as an option in a follow up kip. It would be very easy to
>>>>> implement either way. For now unless no one else objects I'm going to
>>>> stick
>>>>> with dropping the records for the sake of getting this kip passed. It
>> is
>>>>> functionally a small change to make and we can update later if you feel
>>>>> strongly about it.
>>>>>
>>>>> For the ordering. I have to say that it would be more complicated to
>>>>> implement it to be in offset order, if the goal it to get as many of
>> the
>>>>> records validly joined as possible. Because we would process as things
>>>> left
>>>>> the buffer a sufficiency early enough record could hold up records that
>>>>> would otherwise be valid past the table history retention. To fix this
>> we
>>>>> could process by timestamp then store in a second queue and emit by
>>>> offset,
>>>>> but that would be a lot more complicated. If we didn't care about not
>>>>> missing some valid joins we could just have no store and pull from the
>>>>> topic at a delay only caring about the timestamp of the next offset.
>> For
>>>>> now I want to stick with the timestamp ordering as it makes much more
>>>> sense
>>>>> to me, but would propose we add both of the other options I have laid
>> out
>>>>> here in the follow up kip.
>>>>>
>>>>> Lastly, I think having an empty store with zero grace period would be
>>>> super
>>>>> simple and not costly, so we might as well make it even if nothing gets
>>>>> entered.
>>>>>
>>>>> I hope that address all your concerns,
>>>>>
>>>>> Walker
>>>>>
>>>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
>>>> wrote:
>>>>>
>>>>>> Walker,
>>>>>>
>>>>>> thanks for the updates. The KIP itself reads fine (of course Victoria
>>>>>> made good comments about some phrases), but there is a couple of
>> things
>>>>>> from your latest reply I don't understand, and that I still think need
>>>>>> some more discussions.
>>>>>>
>>>>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and you
>>>>>> mention "semantic concerns". There should not be any semantic
>> difference
>>>>>> from the underlying buffer implementation, so I am not sure what you
>>>>>> mean here (also the relationship to suppress() is unclear to me)? -- I
>>>>>> am ok to not make it configurable for now. We can always do it via a
>>>>>> follow up KIP, and keep interface changes limited for now.
>>>>>>
>>>>>> Does it really make sense to allow a grace period if the table is
>>>>>> non-versioned? You also say: "If table is not materialized it will
>>>>>> materialize it as versioned." -- What history retention time would we
>>>>>> pick for this case (also asked by Victoria)? Or should we rather not
>>>>>> support this and force the user to materialize the table explicitly,
>> and
>>>>>> thus explicitly picking a history retention time? It's tradeoff
>> between
>>>>>> usability and guiding uses that there will be a significant impact on
>>>>>> disk usage. There is also compatibility concerns: If the table is not
>>>>>> explicitly materialized in the old program, we would already need to
>>>>>> materialize it also in the old program (of course, we would use a
>>>>>> non-versioned store so far). Thus, if somebody adds a grace period, we
>>>>>> cannot just switch the store type, as it would be a breaking change,
>>>>>> potentially required an application re-set, or following the upgrade
>>>>>> path for versioned state stores, and also changing the program to
>>>>>> explicitly materialize using a versioned store. Also note, that we
>> might
>>>>>> not materialize the actual join table, but only an upstream table, and
>>>>>> use `ValueGetter` to access the upstream data.
>>>>>>
>>>>>> To this end, as you already mentioned, history retention of the table
>>>>>> should be at least grace period. You proposed to include this in a
>>>>>> follow up KIP, but I am wondering if it's a fundamental requirement
>> and
>>>>>> thus we should put a check in place right away and reject an invalid
>>>>>> configuration? (It always easier to lift restriction than to introduce
>>>>>> them later.) This would also imply that a non-versioned table cannot
>> be
>>>>>> supported, because it does not have a history retention that is larger
>>>>>> than grace period, and maybe also answer the requirement about
>>>>>> materialization: as we already always materialize something on the
>>>>>> tablet side as non-versioned store right now, it seems difficult to
>>>>>> migrate the store to a versioned store. Ie, it might be ok to push the
>>>>>> burden onto the user and say: if you start using grace period, you
>> also
>>>>>> need to manually switch from non-versioned to versioned KTables. Doing
>>>>>> stuff automatically under the hood if very complex for this case, we
>> if
>>>>>> we push the burden onto the user, it might be ok to not complicate
>> this
>>>>>> KIP significantly.
>>>>>>
>>>>>> To summarize the last two paragraphs, I would propose to:
>>>>>>      - don't support non-versioned KTables
>>>>>>      - if grace period is added, users need to explicitly materialize
>> the
>>>>>> table as version (either directly, or upstream. Upstream only works if
>>>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>>>      - the table's history retention time must be larger than the grace
>>>>>> period (should be easy to check at runtime, when we build the
>> topology)
>>>>>>      - because switching from non-versioned to version stores is not
>>>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>>>> themselves, and this also implies that adding grace period is not a
>>>>>> backward compatible change (even only if via indirect means)
>>>>>>
>>>>>> About dropping late records: wondering if we should never drop a
>>>>>> stream-side record for a left-join, even if it's late? In general, one
>>>>>> thing I observed over the years is, that it's easier to keep stuff and
>>>>>> let users filter explicitly downstream (or make it configurable),
>>>>>> instead of dropping pro-actively, because users have no good way to
>>>>>> resurrect record that got already dropped.
>>>>>>
>>>>>> For ordering, sounds reasonable to me only start with one
>>>>>> implementation, and maybe make it configurable as a follow up.
>> However,
>>>>>> I am wondering if starting with offset order might be the better
>> option
>>>>>> as it seems to align more with what we do so far? So instead of
>> storing
>>>>>> record ordered by timestamp, we can just store them ordered by offset,
>>>>>> and still "poll" from the buffer based on the head records timestamp.
>> Or
>>>>>> would this complicate the implementation significantly?
>>>>>>
>>>>>> I also think it's ok to not "sync" stream-time between the table and
>> the
>>>>>> stream in this KIP, but we should consider doing this as a follow up
>>>>>> change (not sure if we would need a KIP or not for a change this
>> this).
>>>>>>
>>>>>> About increasing/decreasing grace period: what you describe make sense
>>>>>> to me. If decreased, the next record would just trigger emitting a lot
>>>>>> of records, and for increase, the buffer would just need to "fill up"
>>>>>> again. For reprocessing getting a different result with a different
>>>>>> grace period is expected, so that's ok IMHO. -- There seems to be one
>>>>>> special corner case: grace period zero. For this case, we actually
>> don't
>>>>>> need any store, and the stream-side could be stateless. I think it can
>>>>>> have the same behavior, but if we want to "add / remove" the store
>>>>>> dynamically, we need to add specific code for it. For example, even if
>>>>>> we start up with a grace period of zero, we would need to check if
>> there
>>>>>> is a local store, and still emit everything in it, before we can ditch
>>>>>> the store (not sure if that's even easily done at all). Or: we would
>>>>>> need to have a store for _all_ cases, even if grace period is zero
>> (the
>>>>>> store would be empty all the time though), to avoid super complex
>> code?
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
>>>>>>> Hi Walker,
>>>>>>>
>>>>>>> thanks for your responses. That makes sense. I guess there is always
>>>>>>> the option to make the implementation more configurable later on, if
>>>>>>> users request it. Also thanks for the clarifications. From my side,
>>>>>>> the KIP is good to go.
>>>>>>>
>>>>>>> Cheers,
>>>>>>> Lucas
>>>>>>>
>>>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
>>>>>>> <vi...@confluent.io.invalid> wrote:
>>>>>>>>
>>>>>>>> Thanks for the updates, Walker! Looks great, though I do have a
>> couple
>>>>>>>> questions about the latest updates:
>>>>>>>>
>>>>>>>>        1. The new example says that without stream-side buffering,
>> "ex"
>>>> and
>>>>>>>>        "fy" are possible join results. How could those join results
>>>>>> happen? The
>>>>>>>>        example versioned table suggests that table record "x" has
>>>>>> timestamp 2, and
>>>>>>>>        table record "y" has timestamp 3. If stream record "e" has
>>>>>> timestamp 1,
>>>>>>>>        then it can never be joined against record "x", and similarly
>> for
>>>>>> stream
>>>>>>>>        record "f" with timestamp 2 being joined against "y".
>>>>>>>>        2. I see in your replies above that "If table is not
>>>> materialized it
>>>>>>>>        will materialize it as versioned" but I don't see this called
>> out
>>>>>> in the
>>>>>>>>        KIP -- seems worth calling out. Also, what will the history
>>>>>> retention for
>>>>>>>>        the versioned table be? Will it be the same as the join grace
>>>>>> period, or
>>>>>>>>        will it be greater?
>>>>>>>>
>>>>>>>>
>>>>>>>> And some additional thoughts:
>>>>>>>>
>>>>>>>> Sounds like there are a few things users should watch out for when
>>>>>> enabling
>>>>>>>> the stream-side buffer:
>>>>>>>>
>>>>>>>>        - Records will get "stuck" if there are no newer records to
>>>> advance
>>>>>>>>        stream time.
>>>>>>>>        - If there are large gaps between the timestamps of
>> stream-side
>>>>>> records,
>>>>>>>>        then it's possible that versioned store history retention will
>>>> have
>>>>>> expired
>>>>>>>>        by the time a record is evicted from the join buffer, leading
>> to
>>>> a
>>>>>> join
>>>>>>>>        "miss." For example, if the join grace period and table
>> history
>>>>>> retention
>>>>>>>>        are both 10, and records come in the order:
>>>>>>>>
>>>>>>>>        table side t0 with ts=0
>>>>>>>>        stream side s1 with ts=1 <-- enters buffer
>>>>>>>>        table side t10 with ts=10
>>>>>>>>        table side t20 with ts=20
>>>>>>>>        stream side s21 with ts=21 <-- evicts record s1 from buffer,
>> but
>>>>>>>>        versioned store no longer contains data for ts=1 due to
>> history
>>>>>> retention
>>>>>>>>        having elapsed
>>>>>>>>
>>>>>>>>        This will result in the join result (s1, null) even though it
>>>>>> should've
>>>>>>>>        been (s1, t0), due to t0 having been expired from the
>> versioned
>>>>>> store
>>>>>>>>        already.
>>>>>>>>        - Out-of-order records from the stream-side will be reordered,
>>>> and
>>>>>> late
>>>>>>>>        records will be dropped.
>>>>>>>>
>>>>>>>> I don't think any of these are reasons to not go forward with this
>>>> KIP,
>>>>>> but
>>>>>>>> it'd be good to call them out in the eventual documentation to
>>>> decrease
>>>>>> the
>>>>>>>> chance users get tripped up.
>>>>>>>>
>>>>>>>>> We could maybe do an improvement later to advance stream time from
>>>>>> table
>>>>>>>> side as well, but that might be debatable as we might get more late
>>>>>> records.
>>>>>>>>
>>>>>>>> Yes, the likelihood of late records increases but also the
>> likelihood
>>>> of
>>>>>>>> "join misses" due to versioned store history retention having
>> elapsed
>>>>>>>> decreases, which feels important for certain use cases. Either way,
>>>>>> agreed
>>>>>>>> that it can be a discussion for the future as incorporating this
>> would
>>>>>>>> substantially complicate the implementation.
>>>>>>>>
>>>>>>>> Also a couple nits:
>>>>>>>>
>>>>>>>>        - The KIP currently says "We recently added versioned tables
>>>> which
>>>>>> allow
>>>>>>>>        the table side of the a join [...] but it is not taken
>> advantage
>>>> of
>>>>>> in
>>>>>>>>        joins," but this doesn't seem true? If the table of a
>>>> stream-table
>>>>>> join is
>>>>>>>>        versioned, then the DSL's stream-table join processor will
>>>>>> automatically
>>>>>>>>        perform timestamped lookups into the table, in order to take
>>>>>> advantage of
>>>>>>>>        the new timestamp-aware store to provide better join
>> semantics.
>>>>>>>>        - The KIP mentions "grace period" for versioned stores in a
>>>> number
>>>>>> of
>>>>>>>>        places but I think you actually mean "history retention"? The
>> two
>>>>>> happen to
>>>>>>>>        be the same today (it is not an option for users to configure
>> the
>>>>>> two
>>>>>>>>        separately) but this need not be true in the future. "History
>>>>>> retention"
>>>>>>>>        governs how far back in time reads may occur, which is the
>>>> relevant
>>>>>>>>        parameter for performing lookups as part of the stream-table
>>>> join.
>>>>>> "Grace
>>>>>>>>        period" in the context of versioned stores refers to how far
>> back
>>>>>> in time
>>>>>>>>        out-of-order writes may occur, which probably isn't directly
>>>>>> relevant for
>>>>>>>>        introducing a stream-side buffer, though it's also possible
>> I've
>>>>>> overlooked
>>>>>>>>        something. (As a bonus, switching from "table grace period" in
>>>> the
>>>>>> KIP to
>>>>>>>>        "table history retention" also helps to clarify/distinguish
>> that
>>>>>> it's a
>>>>>>>>        different parameter from the "join grace period," which I
>> could
>>>> see
>>>>>> being
>>>>>>>>        confusing to readers. :) )
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Victoria
>>>>>>>>
>>>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>
>>>>>>>>> Hey all,
>>>>>>>>>
>>>>>>>>> Thanks for the comments, they gave me a lot to think about. I'll
>> try
>>>> to
>>>>>>>>> address them all inorder. I have made some updates to the kip
>> related
>>>>>> to
>>>>>>>>> them, but I mention where below.
>>>>>>>>>
>>>>>>>>> Lucas
>>>>>>>>>
>>>>>>>>> Good idea about the example. I added a simple one.
>>>>>>>>>
>>>>>>>>> 1) I have thought about including options for the underlying buffer
>>>>>>>>> configuration. One of which might be adding an in memory option. My
>>>>>> biggest
>>>>>>>>> concern is about the semantic guarantees. This isn't like suppress
>> or
>>>>>> with
>>>>>>>>> windows where producing incomplete results is repetitively
>> harmless.
>>>>>> Here
>>>>>>>>> we would be possibly producing incorrect results. I also would like
>>>> to
>>>>>> keep
>>>>>>>>> the interface changes as simple as I can. Making more than this
>>>> change
>>>>>> to
>>>>>>>>> Joined I feel could make this more complicated than it needs to be.
>>>> If
>>>>>> we
>>>>>>>>> really want to I could see adding a grace() option with a
>>>> BufferConifg
>>>>>> in
>>>>>>>>> there or something, but I would rather not.
>>>>>>>>>
>>>>>>>>> 2) The buffer will be independent of if the table is versioned or
>>>> not.
>>>>>> If
>>>>>>>>> table is not materialized it will materialize it as versioned. It
>>>> might
>>>>>>>>> make sense to do a follow up kip where we force the retention
>> period
>>>>>> of
>>>>>>>>> the versioned to be greater than whatever the max of the stream
>>>> buffer
>>>>>> is.
>>>>>>>>>
>>>>>>>>> Victoria
>>>>>>>>>
>>>>>>>>> 1) Yes, records will exit in timestamp order not in offset order.
>>>>>>>>> 2) Late records will be dropped (Late as out of the grace period).
>>>>>>    From my
>>>>>>>>> understanding that is the point of a grace period, no? Doesn't the
>>>> same
>>>>>>>>> thing happen with versioned stores?
>>>>>>>>> 3) The segment store already has an observed stream time, we
>> advance
>>>>>> based
>>>>>>>>> on that. That should only advance based on records that enter the
>>>>>> store. So
>>>>>>>>> yes, only stream side records. We could maybe do an improvement
>> later
>>>>>> to
>>>>>>>>> advance stream time from table side as well, but that might be
>>>>>> debatable as
>>>>>>>>> we might get more late records. Anyways I would rather have that
>> as a
>>>>>>>>> separate discussion.
>>>>>>>>>
>>>>>>>>> in memory option? We can do that, for the buffer I plan to use the
>>>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in memory
>>>>>>>>> implantation, so it would be simple.
>>>>>>>>>
>>>>>>>>> I said more in my answer to Lucas's question. The concern I have
>> with
>>>>>>>>> buffer configs or in memory is complicating the interface. Also
>>>>>> semantic
>>>>>>>>> guarantees but in memory shouldn't effect that
>>>>>>>>>
>>>>>>>>> Matthias
>>>>>>>>>
>>>>>>>>> 1) fixed out of order vs late terminology in the kip.
>>>>>>>>>
>>>>>>>>> 2) I was referring to having a stream. So after this kip we can
>> have
>>>> a
>>>>>>>>> buffered stream or a normal one. For the table we can use a
>> versioned
>>>>>> table
>>>>>>>>> or a normal table.
>>>>>>>>>
>>>>>>>>> 3 Good call out. I clarified this as "If the table side uses a
>>>>>> materialized
>>>>>>>>> version store, it can store multiple versions of each record within
>>>> its
>>>>>>>>> defined grace period." and modified the rest of the paragraph a
>> bit.
>>>>>>>>>
>>>>>>>>> 4) I get the preserving off offset ordering, but if the stream is
>>>>>> buffered
>>>>>>>>> to join on timestamp instead of offset doesn't it already seem like
>>>> we
>>>>>> care
>>>>>>>>> more about time in this case?
>>>>>>>>>
>>>>>>>>> If we end up adding more options it might make sense to do this.
>>>> Maybe
>>>>>>>>> offset order processing can be a follow up?
>>>>>>>>>
>>>>>>>>> I'll add a section for this in Rejected Alternatives. I think it
>>>> makes
>>>>>>>>> sense to do something like this but maybe in a follow up.
>>>>>>>>>
>>>>>>>>> 5) I hadn't thought about this. I suppose if they changed this in
>> an
>>>>>>>>> upgrade the next record would either evict a lot of records (if the
>>>>>> grace
>>>>>>>>> period decreased) or there would be a pause until the new grace
>>>> period
>>>>>>>>> reached. Increasing is a bit more problematic, especially if the
>>>> table
>>>>>>>>> grace period and retention time stays the same. If the data is
>>>>>> reprocessed
>>>>>>>>> after a change like that then there would be different results,
>> but I
>>>>>> feel
>>>>>>>>> like that would be expected after such a change.
>>>>>>>>>
>>>>>>>>> What do you think should happen?
>>>>>>>>>
>>>>>>>>> Hopefully this answers your questions!
>>>>>>>>>
>>>>>>>>> Walker
>>>>>>>>>
>>>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the KIP! Also some question/comments from my side:
>>>>>>>>>>
>>>>>>>>>> 10) Notation: you use the term "late data" but I think you mean
>>>>>>>>>> out-of-order. We reserve the term "late" to records that arrive
>>>> after
>>>>>>>>>> grace period passed, and thus, "late == out-of-order data that is
>>>>>>>>> dropped".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 20) "There is only one option from the stream side and only
>> recently
>>>>>> is
>>>>>>>>>> there a second option on the table side."
>>>>>>>>>>
>>>>>>>>>> What are those options? Victoria already asked about the table
>> side,
>>>>>> but
>>>>>>>>>> I am also not sure what option you mean for the stream side?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 30) "If the table side uses a materialized version store the value
>>>> is
>>>>>>>>>> the latest by stream time rather than by offset within its defined
>>>>>> grace
>>>>>>>>>> period."
>>>>>>>>>>
>>>>>>>>>> The phrase "the value is the latest by stream time" is confusing
>> --
>>>> in
>>>>>>>>>> the end, a versioned stores multiple versions, not just one.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 40) I am also wondering about ordering. In general, KS tries to
>>>>>> preserve
>>>>>>>>>> offset-order during processing (with some exception, when offset
>>>> order
>>>>>>>>>> preservation is not clearly defined). Given that the stream-side
>>>>>> buffer
>>>>>>>>>> is really just a "linear buffer", we could easily preserve
>>>>>> offset-order.
>>>>>>>>>> But I also see a benefit of re-ordering and emitting out-of-order
>>>> data
>>>>>>>>>> right away when read (instead of blocking them behind in-order
>>>> records
>>>>>>>>>> that are not ready yet). -- It might even be a possibility, to let
>>>>>> users
>>>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just
>> a
>>>>>>>>>> placeholder).
>>>>>>>>>>
>>>>>>>>>> The KIP should explain this in more detail and also discuss
>>>> different
>>>>>>>>>> options and mention them in "Rejected alternatives" in case we
>> don't
>>>>>>>>>> want to include them.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 50) What happens when users change the grace period? Especially,
>>>> when
>>>>>>>>>> they turn it on/off (but also increasing/decreasing is an
>>>> interesting
>>>>>>>>>> point)? I think we should try to support this if possible; the
>>>>>>>>>> "Compatibility" section needs to cover switching on/off in more
>>>>>> detail.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -Matthias
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
>>>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
>>>>>>>>>>>
>>>>>>>>>>> A few clarifications:
>>>>>>>>>>>
>>>>>>>>>>> 1. Is the order that records exit the buffer in necessarily the
>>>> same
>>>>>> as
>>>>>>>>>> the
>>>>>>>>>>> order that records enter the buffer in, or no? Based on the
>>>>>> description
>>>>>>>>>> in
>>>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will exit
>>>> the
>>>>>>>>>>> buffer in increasing timestamp order, which means that they may
>> be
>>>>>>>>>> ordered
>>>>>>>>>>> (even for the same key) compared to the input order.
>>>>>>>>>>>
>>>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
>>>>>> stream-side
>>>>>>>>>>> record arrives with a timestamp that is older than the current
>>>> stream
>>>>>>>>>> time
>>>>>>>>>>> minus the grace period? Will this record trigger a join result,
>> or
>>>>>> will
>>>>>>>>>> it
>>>>>>>>>>> be dropped? Based on the description for what happens when the
>> join
>>>>>>>>> grace
>>>>>>>>>>> period is set to zero, it sounds like the late record will be
>>>>>> dropped,
>>>>>>>>>> even
>>>>>>>>>>> if the join grace period is nonzero. Is that true?
>>>>>>>>>>>
>>>>>>>>>>> 3. What could cause stream time to advance, for purposes of
>>>> removing
>>>>>>>>>>> records from the join buffer? For example, will new records
>>>> arriving
>>>>>> on
>>>>>>>>>> the
>>>>>>>>>>> table side of the join cause stream time to advance? From the KIP
>>>> it
>>>>>>>>>> sounds
>>>>>>>>>>> like only stream-side records will advance stream time -- does
>> that
>>>>>>>>> mean
>>>>>>>>>>> that the join processor itself will have to track this stream
>> time?
>>>>>>>>>>>
>>>>>>>>>>> Also +1 to Lucas's question about what options will be available
>>>> for
>>>>>>>>>>> configuring the join buffer. Will users have the option to choose
>>>>>>>>> whether
>>>>>>>>>>> they want the buffer to be in-memory vs persistent?
>>>>>>>>>>>
>>>>>>>>>>> - Victoria
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
>>>>>>>>>>> <lb...@confluent.io.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> HI Walker,
>>>>>>>>>>>>
>>>>>>>>>>>> thanks for the KIP! We definitely need this. I have two
>> questions:
>>>>>>>>>>>>
>>>>>>>>>>>>       - Have you considered allowing the customization of the
>>>>>> underlying
>>>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
>>>>>> customize
>>>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it make
>>>>>> sense
>>>>>>>>>>>> for `Joined` to have this as well? I can imagine one may want to
>>>>>> limit
>>>>>>>>>>>> the number of records in the buffer, for example. If we hit the
>>>>>>>>>>>> maximum, the only option would be to drop semantic guarantees,
>> but
>>>>>>>>>>>> users may still want to do this.
>>>>>>>>>>>>       - With "second option on the table side" you are referring
>> to
>>>>>>>>>>>> versioned tables, right? Will the buffer on the stream side
>> behave
>>>>>> any
>>>>>>>>>>>> different whether the table side is versioned or not?
>>>>>>>>>>>>
>>>>>>>>>>>> Finally, I think a simple example in the motivation section
>> could
>>>>>> help
>>>>>>>>>>>> non-experts understand the KIP.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Lucas
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>>>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hello everybody,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have a stream proposal to improve the stream table join by
>>>>>> adding a
>>>>>>>>>>>> grace
>>>>>>>>>>>>> period and buffer to the stream side of the join to allow
>>>>>> processing
>>>>>>>>> in
>>>>>>>>>>>>> timestamp order matching the recent improvements of the
>> versioned
>>>>>>>>>> tables.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Please take a look here <
>>>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
>>>>>>>>>>>> and
>>>>>>>>>>>>> share your thoughts.
>>>>>>>>>>>>>
>>>>>>>>>>>>> best,
>>>>>>>>>>>>> Walker
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Hello Bruno,

I think this covers your questions. Let me know what you think

2.
We can use a changelog topic. I think we can treat it like any other store
and recover in the usual manner. Also implementation is on disk

3.
The description is in the public interfaces description. I will copy it
into the proposed changes as well.

This is a bit of an implementation detail that I didn't want to add into
the kip, but the record will be added to the buffer to keep the stream time
consistent, it will just be ejected immediately. If of course if this
causes performance issues we will skip this step and track stream time
separately. I will update the kip to say that stream time advances when a
stream record enters the node.

Also, yes, updated.

5.
No there is no difference right now, everything gets processed as it comes
in and tries to find a record for its time stamp.

Walker

On Fri, Jun 2, 2023 at 6:41 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi Walker,
>
> Thanks for the updates!
>
> 2.
> It is still not clear to me how a failure is handled. I do not
> understand what you mean by "recover from an OffsetCheckpoint".
>
> My understanding is that the buffer needs to be replicated into its own
> Kafka topic. The input topic is not enough. The offset of a record is
> added to the offsets to commit once the record is streamed through the
> subtopology. That means once the record is added to the buffer its
> offset is added to the offsets to commit -- independently of whether the
> record was evicted from the buffer and sent to the join node or not.
> Now, let's assume the following scenario
> 1. a record is read from the input topic and added to the buffer, but
> not evicted to be processed by the join node.
> 2. When the processing of the subtopology finishes the offset of the
> record is added to the offsets to commit.
> 3. A commit happens.
> 4. A failure happens
>
> After the failure the buffer is empty but the record will not be read
> anymore from the input topic since its offset has been already
> committed. The record is lost.
> One solution to avoid the loss is to recreate the buffer from a
> compacted Kafka topic as we do for suppression buffers. I do not think,
> we need any offset checkpoint here since, we keep the buffer in memory,
> right? Or do you plan to back the buffer with a persistent store? Even
> in that case, a compacted Kafka topic would be needed.
>
>
> 3.
>  From the KIP it is still not clear to me what happens if a record is
> outside of the grace period. I guess the record that falls outside of
> the grace period will not be added to the buffer, but will be send to
> the join node. Since it is outside of the grace period it will also not
> increase stream time and it will not trigger an eviction. Also the head
> of the buffer will not contain a record that needs to be evicted since
> the the timestamp of the head record will be within the interval stream
> time minus grace period. Is this correct? Please add such a description
> to the KIP.
> Furthermore, I think there is a mistake in the text:
> "... will dequeue when the record timestamp is greater than stream time
> plus the grace period". I guess that should be "... will dequeue when
> the record timestamp is less than (or equal?) stream time minus the
> grace period"
>
>
> 5.
> What is the difference between not setting the grace period and setting
> it to zero? If there is a difference, why is there a difference?
>
>
> Best,
> Bruno
>
>
> On 01.06.23 23:58, Walker Carlson wrote:
> > Hey Bruno thanks for the feedback.
> >
> > 1)
> > I will add this to the kip, but stream time only advances as the when the
> > buffer receives a new record.
> >
> > 2)
> > You are correct, I will add a failure section on to the kip. Since the
> > records wont change in the buffer from when they are read from the topic
> > they are replicated already.
> >
> > 3)
> > I see that I'm out voted on the dropping of records thing. We will pass
> > them on and try to join them if possible. This might cause some null
> > results, but increasing the table history retention should help that.
> >
> > 4)
> > I can add some on the kip. But its pretty directly adding whatever the
> > grace period is to the latency. I don't see a way around it.
> >
> > Walker
> >
> > On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org> wrote:
> >
> >> Hi Walker,
> >>
> >> thanks for the KIP!
> >>
> >> Here my feedback:
> >>
> >> 1.
> >> It is still not clear to me when stream time for the buffer advances.
> >> What is the event that let the stream time advance? In the discussion, I
> >> do not understand what you mean by "The segment store already has an
> >> observed stream time, we advance based on that. That should only advance
> >> based on records that enter the store." Where does this segment store
> >> come from? Anyways, I think it would be great to also state how stream
> >> time advances in the KIP.
> >>
> >> 2.
> >> How does the buffer behave in case of a failure? I think I understand
> >> that the buffer will use an implementation of TimeOrderedKeyValueBuffer
> >> and therefore the records in the buffer will be replicated to a topic in
> >> Kafka, but I am not completely sure. Could you elaborate on this in the
> >> KIP?
> >>
> >> 3.
> >> I agree with Matthias about dropping late records. We use grace periods
> >> in scenarios where we records are grouped like in windowed aggregations
> >> and windowed joins. The stream buffer you propose does not really group
> >> any records. It rather delays records and reorders them. I am not sure
> >> if grace period is the right naming/concept to apply here. Instead of
> >> dropping records that fall outside of the buffer's time interval the
> >> join should skip the buffer and try to join the record immediately. In
> >> the end, a stream-table join is a unwindowed join, i.e., no grouping is
> >> applied to the records.
> >> What do you and other folks think about this proposal?
> >>
> >> 4.
> >> How does the proposed buffer, affects processing latency? Could you
> >> please add some words about this to the KIP?
> >>
> >>
> >> Best,
> >> Bruno
> >>
> >>
> >>
> >>
> >> On 31.05.23 01:49, Walker Carlson wrote:
> >>> Thanks for all the additional comments. I will either address them here
> >> or
> >>> update the kip accordingly.
> >>>
> >>>
> >>> I mentioned a follow kip to add extra features before and in the
> >> responses.
> >>> I will try to briefly summarize what options and optimizations I plan
> to
> >>> include. If a concern is not covered in this list I for sure talk about
> >> it
> >>> below.
> >>>
> >>> * Allowing non versioned tables to still use the stream buffer
> >>> * Automatically materializing tables instead of forcing the user to do
> it
> >>> * Configurable for in memory buffer
> >>> * Order the records in offset order or in time order
> >>> * Non memory use buffer (offset order, delayed pull from stream.)
> >>> * Time synced between stream and table side (maybe)
> >>> * Do not drop late records and process them as they come in instead.
> >>>
> >>>
> >>> First, Victoria.
> >>>
> >>> 1) (One of your nits covers this, but you are correct it doesn't make
> >>> sense. so I removed that part of the example.)
> >>> For those examples with the "bad" join results I said without buffering
> >> the
> >>> stream it would look like that, but that was incomplete. If the look up
> >> was
> >>> simply looking at the latest version of the table when the stream
> records
> >>> came in then the results were possible. If we are using the point in
> time
> >>> lookup that versioned tables let us then you are correct the future
> >> results
> >>> are not possible.
> >>>
> >>> 2) I'll get to this later as Matthias brought up something related.
> >>>
> >>> To your additional thoughts, I agree that we need to call those things
> >> out
> >>> in the documentation. I'm writing up a follow up kip with a lot of the
> >>> ideas we have discussed so that we can improve this feature beyond the
> >> base
> >>> implementation if it's needed.
> >>>
> >>> I addressed the nits in the kip. I somehow missed the table stream
> table
> >>> join processor improvement, it makes your first question make a lot
> more
> >>> sense.  Table history retention is a much cleaner way to describe it.
> >>>
> >>> As to your mention of the syncing the time for the table and stream.
> >>> Matthias mentioned that as well. I will address both here. I plan to
> >> bring
> >>> that up in the future, but for now we will leave it out. I suppose it
> >> will
> >>> be more useful after the table history retention is separable from the
> >>> table grace period.
> >>>
> >>>
> >>> To address Matthias comments.
> >>>
> >>> You are correct by saying the in memory store shouldn't cause any
> >> semantic
> >>> concerns. My concern would be more with if we limited the number of
> >> records
> >>> on the buffer and what we would do if we hit said limits, (emitting
> those
> >>> records might be an issue, throwing an error and halting would not). I
> >>> think we can leave this discussion to the follow up kip along with a
> few
> >>> other options.
> >>>
> >>> I will go through your proposals now.
> >>>
> >>>     - don't support non-versioned KTables
> >>>
> >>> Sure, we can always expand this later on. Will include as part of the
> of
> >>> the improvement kip
> >>>
> >>>     - if grace period is added, users need to explicitly materialize
> the
> >>> table as version (either directly, or upstream. Upstream only works if
> >>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>>
> >>> again, that works for me for now, if we find a use we can always add
> >> later.
> >>>
> >>>     - the table's history retention time must be larger than the grace
> >>> period (should be easy to check at runtime, when we build the topology)
> >>>
> >>> agreed
> >>>
> >>>     - because switching from non-versioned to version stores is not
> >>> backward compatibly (cf KIP-914), users need to take care of this
> >>> themselves, and this also implies that adding grace period is not a
> >>> backward compatible change (even only if via indirect means)
> >>>
> >>> sure, this works
> >>>
> >>> As to the dropping of late records, I'm not sure. One one hand I like
> not
> >>> dropping things. But on the other I struggle to see how a user can
> filter
> >>> out late records that might have incomplete join results. The point in
> >> time
> >>> look up will aggressively expire old data and if new data has been
> >> replaced
> >>> it will return null if outside of the retention. This seems like it
> could
> >>> corrupt the integrity of the join output. Seeing that we drop late
> >> records
> >>> on the table side as well I would think it makes sense to drop late
> >> records
> >>> on the stream buffer. I could be convinced otherwise I suppose, I could
> >> see
> >>> adding this as an option in a follow up kip. It would be very easy to
> >>> implement either way. For now unless no one else objects I'm going to
> >> stick
> >>> with dropping the records for the sake of getting this kip passed. It
> is
> >>> functionally a small change to make and we can update later if you feel
> >>> strongly about it.
> >>>
> >>> For the ordering. I have to say that it would be more complicated to
> >>> implement it to be in offset order, if the goal it to get as many of
> the
> >>> records validly joined as possible. Because we would process as things
> >> left
> >>> the buffer a sufficiency early enough record could hold up records that
> >>> would otherwise be valid past the table history retention. To fix this
> we
> >>> could process by timestamp then store in a second queue and emit by
> >> offset,
> >>> but that would be a lot more complicated. If we didn't care about not
> >>> missing some valid joins we could just have no store and pull from the
> >>> topic at a delay only caring about the timestamp of the next offset.
> For
> >>> now I want to stick with the timestamp ordering as it makes much more
> >> sense
> >>> to me, but would propose we add both of the other options I have laid
> out
> >>> here in the follow up kip.
> >>>
> >>> Lastly, I think having an empty store with zero grace period would be
> >> super
> >>> simple and not costly, so we might as well make it even if nothing gets
> >>> entered.
> >>>
> >>> I hope that address all your concerns,
> >>>
> >>> Walker
> >>>
> >>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
> >> wrote:
> >>>
> >>>> Walker,
> >>>>
> >>>> thanks for the updates. The KIP itself reads fine (of course Victoria
> >>>> made good comments about some phrases), but there is a couple of
> things
> >>>> from your latest reply I don't understand, and that I still think need
> >>>> some more discussions.
> >>>>
> >>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and you
> >>>> mention "semantic concerns". There should not be any semantic
> difference
> >>>> from the underlying buffer implementation, so I am not sure what you
> >>>> mean here (also the relationship to suppress() is unclear to me)? -- I
> >>>> am ok to not make it configurable for now. We can always do it via a
> >>>> follow up KIP, and keep interface changes limited for now.
> >>>>
> >>>> Does it really make sense to allow a grace period if the table is
> >>>> non-versioned? You also say: "If table is not materialized it will
> >>>> materialize it as versioned." -- What history retention time would we
> >>>> pick for this case (also asked by Victoria)? Or should we rather not
> >>>> support this and force the user to materialize the table explicitly,
> and
> >>>> thus explicitly picking a history retention time? It's tradeoff
> between
> >>>> usability and guiding uses that there will be a significant impact on
> >>>> disk usage. There is also compatibility concerns: If the table is not
> >>>> explicitly materialized in the old program, we would already need to
> >>>> materialize it also in the old program (of course, we would use a
> >>>> non-versioned store so far). Thus, if somebody adds a grace period, we
> >>>> cannot just switch the store type, as it would be a breaking change,
> >>>> potentially required an application re-set, or following the upgrade
> >>>> path for versioned state stores, and also changing the program to
> >>>> explicitly materialize using a versioned store. Also note, that we
> might
> >>>> not materialize the actual join table, but only an upstream table, and
> >>>> use `ValueGetter` to access the upstream data.
> >>>>
> >>>> To this end, as you already mentioned, history retention of the table
> >>>> should be at least grace period. You proposed to include this in a
> >>>> follow up KIP, but I am wondering if it's a fundamental requirement
> and
> >>>> thus we should put a check in place right away and reject an invalid
> >>>> configuration? (It always easier to lift restriction than to introduce
> >>>> them later.) This would also imply that a non-versioned table cannot
> be
> >>>> supported, because it does not have a history retention that is larger
> >>>> than grace period, and maybe also answer the requirement about
> >>>> materialization: as we already always materialize something on the
> >>>> tablet side as non-versioned store right now, it seems difficult to
> >>>> migrate the store to a versioned store. Ie, it might be ok to push the
> >>>> burden onto the user and say: if you start using grace period, you
> also
> >>>> need to manually switch from non-versioned to versioned KTables. Doing
> >>>> stuff automatically under the hood if very complex for this case, we
> if
> >>>> we push the burden onto the user, it might be ok to not complicate
> this
> >>>> KIP significantly.
> >>>>
> >>>> To summarize the last two paragraphs, I would propose to:
> >>>>     - don't support non-versioned KTables
> >>>>     - if grace period is added, users need to explicitly materialize
> the
> >>>> table as version (either directly, or upstream. Upstream only works if
> >>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>>>     - the table's history retention time must be larger than the grace
> >>>> period (should be easy to check at runtime, when we build the
> topology)
> >>>>     - because switching from non-versioned to version stores is not
> >>>> backward compatibly (cf KIP-914), users need to take care of this
> >>>> themselves, and this also implies that adding grace period is not a
> >>>> backward compatible change (even only if via indirect means)
> >>>>
> >>>> About dropping late records: wondering if we should never drop a
> >>>> stream-side record for a left-join, even if it's late? In general, one
> >>>> thing I observed over the years is, that it's easier to keep stuff and
> >>>> let users filter explicitly downstream (or make it configurable),
> >>>> instead of dropping pro-actively, because users have no good way to
> >>>> resurrect record that got already dropped.
> >>>>
> >>>> For ordering, sounds reasonable to me only start with one
> >>>> implementation, and maybe make it configurable as a follow up.
> However,
> >>>> I am wondering if starting with offset order might be the better
> option
> >>>> as it seems to align more with what we do so far? So instead of
> storing
> >>>> record ordered by timestamp, we can just store them ordered by offset,
> >>>> and still "poll" from the buffer based on the head records timestamp.
> Or
> >>>> would this complicate the implementation significantly?
> >>>>
> >>>> I also think it's ok to not "sync" stream-time between the table and
> the
> >>>> stream in this KIP, but we should consider doing this as a follow up
> >>>> change (not sure if we would need a KIP or not for a change this
> this).
> >>>>
> >>>> About increasing/decreasing grace period: what you describe make sense
> >>>> to me. If decreased, the next record would just trigger emitting a lot
> >>>> of records, and for increase, the buffer would just need to "fill up"
> >>>> again. For reprocessing getting a different result with a different
> >>>> grace period is expected, so that's ok IMHO. -- There seems to be one
> >>>> special corner case: grace period zero. For this case, we actually
> don't
> >>>> need any store, and the stream-side could be stateless. I think it can
> >>>> have the same behavior, but if we want to "add / remove" the store
> >>>> dynamically, we need to add specific code for it. For example, even if
> >>>> we start up with a grace period of zero, we would need to check if
> there
> >>>> is a local store, and still emit everything in it, before we can ditch
> >>>> the store (not sure if that's even easily done at all). Or: we would
> >>>> need to have a store for _all_ cases, even if grace period is zero
> (the
> >>>> store would be empty all the time though), to avoid super complex
> code?
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> >>>>> Hi Walker,
> >>>>>
> >>>>> thanks for your responses. That makes sense. I guess there is always
> >>>>> the option to make the implementation more configurable later on, if
> >>>>> users request it. Also thanks for the clarifications. From my side,
> >>>>> the KIP is good to go.
> >>>>>
> >>>>> Cheers,
> >>>>> Lucas
> >>>>>
> >>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> >>>>> <vi...@confluent.io.invalid> wrote:
> >>>>>>
> >>>>>> Thanks for the updates, Walker! Looks great, though I do have a
> couple
> >>>>>> questions about the latest updates:
> >>>>>>
> >>>>>>       1. The new example says that without stream-side buffering,
> "ex"
> >> and
> >>>>>>       "fy" are possible join results. How could those join results
> >>>> happen? The
> >>>>>>       example versioned table suggests that table record "x" has
> >>>> timestamp 2, and
> >>>>>>       table record "y" has timestamp 3. If stream record "e" has
> >>>> timestamp 1,
> >>>>>>       then it can never be joined against record "x", and similarly
> for
> >>>> stream
> >>>>>>       record "f" with timestamp 2 being joined against "y".
> >>>>>>       2. I see in your replies above that "If table is not
> >> materialized it
> >>>>>>       will materialize it as versioned" but I don't see this called
> out
> >>>> in the
> >>>>>>       KIP -- seems worth calling out. Also, what will the history
> >>>> retention for
> >>>>>>       the versioned table be? Will it be the same as the join grace
> >>>> period, or
> >>>>>>       will it be greater?
> >>>>>>
> >>>>>>
> >>>>>> And some additional thoughts:
> >>>>>>
> >>>>>> Sounds like there are a few things users should watch out for when
> >>>> enabling
> >>>>>> the stream-side buffer:
> >>>>>>
> >>>>>>       - Records will get "stuck" if there are no newer records to
> >> advance
> >>>>>>       stream time.
> >>>>>>       - If there are large gaps between the timestamps of
> stream-side
> >>>> records,
> >>>>>>       then it's possible that versioned store history retention will
> >> have
> >>>> expired
> >>>>>>       by the time a record is evicted from the join buffer, leading
> to
> >> a
> >>>> join
> >>>>>>       "miss." For example, if the join grace period and table
> history
> >>>> retention
> >>>>>>       are both 10, and records come in the order:
> >>>>>>
> >>>>>>       table side t0 with ts=0
> >>>>>>       stream side s1 with ts=1 <-- enters buffer
> >>>>>>       table side t10 with ts=10
> >>>>>>       table side t20 with ts=20
> >>>>>>       stream side s21 with ts=21 <-- evicts record s1 from buffer,
> but
> >>>>>>       versioned store no longer contains data for ts=1 due to
> history
> >>>> retention
> >>>>>>       having elapsed
> >>>>>>
> >>>>>>       This will result in the join result (s1, null) even though it
> >>>> should've
> >>>>>>       been (s1, t0), due to t0 having been expired from the
> versioned
> >>>> store
> >>>>>>       already.
> >>>>>>       - Out-of-order records from the stream-side will be reordered,
> >> and
> >>>> late
> >>>>>>       records will be dropped.
> >>>>>>
> >>>>>> I don't think any of these are reasons to not go forward with this
> >> KIP,
> >>>> but
> >>>>>> it'd be good to call them out in the eventual documentation to
> >> decrease
> >>>> the
> >>>>>> chance users get tripped up.
> >>>>>>
> >>>>>>> We could maybe do an improvement later to advance stream time from
> >>>> table
> >>>>>> side as well, but that might be debatable as we might get more late
> >>>> records.
> >>>>>>
> >>>>>> Yes, the likelihood of late records increases but also the
> likelihood
> >> of
> >>>>>> "join misses" due to versioned store history retention having
> elapsed
> >>>>>> decreases, which feels important for certain use cases. Either way,
> >>>> agreed
> >>>>>> that it can be a discussion for the future as incorporating this
> would
> >>>>>> substantially complicate the implementation.
> >>>>>>
> >>>>>> Also a couple nits:
> >>>>>>
> >>>>>>       - The KIP currently says "We recently added versioned tables
> >> which
> >>>> allow
> >>>>>>       the table side of the a join [...] but it is not taken
> advantage
> >> of
> >>>> in
> >>>>>>       joins," but this doesn't seem true? If the table of a
> >> stream-table
> >>>> join is
> >>>>>>       versioned, then the DSL's stream-table join processor will
> >>>> automatically
> >>>>>>       perform timestamped lookups into the table, in order to take
> >>>> advantage of
> >>>>>>       the new timestamp-aware store to provide better join
> semantics.
> >>>>>>       - The KIP mentions "grace period" for versioned stores in a
> >> number
> >>>> of
> >>>>>>       places but I think you actually mean "history retention"? The
> two
> >>>> happen to
> >>>>>>       be the same today (it is not an option for users to configure
> the
> >>>> two
> >>>>>>       separately) but this need not be true in the future. "History
> >>>> retention"
> >>>>>>       governs how far back in time reads may occur, which is the
> >> relevant
> >>>>>>       parameter for performing lookups as part of the stream-table
> >> join.
> >>>> "Grace
> >>>>>>       period" in the context of versioned stores refers to how far
> back
> >>>> in time
> >>>>>>       out-of-order writes may occur, which probably isn't directly
> >>>> relevant for
> >>>>>>       introducing a stream-side buffer, though it's also possible
> I've
> >>>> overlooked
> >>>>>>       something. (As a bonus, switching from "table grace period" in
> >> the
> >>>> KIP to
> >>>>>>       "table history retention" also helps to clarify/distinguish
> that
> >>>> it's a
> >>>>>>       different parameter from the "join grace period," which I
> could
> >> see
> >>>> being
> >>>>>>       confusing to readers. :) )
> >>>>>>
> >>>>>>
> >>>>>> Cheers,
> >>>>>> Victoria
> >>>>>>
> >>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> >>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>
> >>>>>>> Hey all,
> >>>>>>>
> >>>>>>> Thanks for the comments, they gave me a lot to think about. I'll
> try
> >> to
> >>>>>>> address them all inorder. I have made some updates to the kip
> related
> >>>> to
> >>>>>>> them, but I mention where below.
> >>>>>>>
> >>>>>>> Lucas
> >>>>>>>
> >>>>>>> Good idea about the example. I added a simple one.
> >>>>>>>
> >>>>>>> 1) I have thought about including options for the underlying buffer
> >>>>>>> configuration. One of which might be adding an in memory option. My
> >>>> biggest
> >>>>>>> concern is about the semantic guarantees. This isn't like suppress
> or
> >>>> with
> >>>>>>> windows where producing incomplete results is repetitively
> harmless.
> >>>> Here
> >>>>>>> we would be possibly producing incorrect results. I also would like
> >> to
> >>>> keep
> >>>>>>> the interface changes as simple as I can. Making more than this
> >> change
> >>>> to
> >>>>>>> Joined I feel could make this more complicated than it needs to be.
> >> If
> >>>> we
> >>>>>>> really want to I could see adding a grace() option with a
> >> BufferConifg
> >>>> in
> >>>>>>> there or something, but I would rather not.
> >>>>>>>
> >>>>>>> 2) The buffer will be independent of if the table is versioned or
> >> not.
> >>>> If
> >>>>>>> table is not materialized it will materialize it as versioned. It
> >> might
> >>>>>>> make sense to do a follow up kip where we force the retention
> period
> >>>> of
> >>>>>>> the versioned to be greater than whatever the max of the stream
> >> buffer
> >>>> is.
> >>>>>>>
> >>>>>>> Victoria
> >>>>>>>
> >>>>>>> 1) Yes, records will exit in timestamp order not in offset order.
> >>>>>>> 2) Late records will be dropped (Late as out of the grace period).
> >>>>   From my
> >>>>>>> understanding that is the point of a grace period, no? Doesn't the
> >> same
> >>>>>>> thing happen with versioned stores?
> >>>>>>> 3) The segment store already has an observed stream time, we
> advance
> >>>> based
> >>>>>>> on that. That should only advance based on records that enter the
> >>>> store. So
> >>>>>>> yes, only stream side records. We could maybe do an improvement
> later
> >>>> to
> >>>>>>> advance stream time from table side as well, but that might be
> >>>> debatable as
> >>>>>>> we might get more late records. Anyways I would rather have that
> as a
> >>>>>>> separate discussion.
> >>>>>>>
> >>>>>>> in memory option? We can do that, for the buffer I plan to use the
> >>>>>>> TimeOrderedKeyValueBuffer interface which already has an in memory
> >>>>>>> implantation, so it would be simple.
> >>>>>>>
> >>>>>>> I said more in my answer to Lucas's question. The concern I have
> with
> >>>>>>> buffer configs or in memory is complicating the interface. Also
> >>>> semantic
> >>>>>>> guarantees but in memory shouldn't effect that
> >>>>>>>
> >>>>>>> Matthias
> >>>>>>>
> >>>>>>> 1) fixed out of order vs late terminology in the kip.
> >>>>>>>
> >>>>>>> 2) I was referring to having a stream. So after this kip we can
> have
> >> a
> >>>>>>> buffered stream or a normal one. For the table we can use a
> versioned
> >>>> table
> >>>>>>> or a normal table.
> >>>>>>>
> >>>>>>> 3 Good call out. I clarified this as "If the table side uses a
> >>>> materialized
> >>>>>>> version store, it can store multiple versions of each record within
> >> its
> >>>>>>> defined grace period." and modified the rest of the paragraph a
> bit.
> >>>>>>>
> >>>>>>> 4) I get the preserving off offset ordering, but if the stream is
> >>>> buffered
> >>>>>>> to join on timestamp instead of offset doesn't it already seem like
> >> we
> >>>> care
> >>>>>>> more about time in this case?
> >>>>>>>
> >>>>>>> If we end up adding more options it might make sense to do this.
> >> Maybe
> >>>>>>> offset order processing can be a follow up?
> >>>>>>>
> >>>>>>> I'll add a section for this in Rejected Alternatives. I think it
> >> makes
> >>>>>>> sense to do something like this but maybe in a follow up.
> >>>>>>>
> >>>>>>> 5) I hadn't thought about this. I suppose if they changed this in
> an
> >>>>>>> upgrade the next record would either evict a lot of records (if the
> >>>> grace
> >>>>>>> period decreased) or there would be a pause until the new grace
> >> period
> >>>>>>> reached. Increasing is a bit more problematic, especially if the
> >> table
> >>>>>>> grace period and retention time stays the same. If the data is
> >>>> reprocessed
> >>>>>>> after a change like that then there would be different results,
> but I
> >>>> feel
> >>>>>>> like that would be expected after such a change.
> >>>>>>>
> >>>>>>> What do you think should happen?
> >>>>>>>
> >>>>>>> Hopefully this answers your questions!
> >>>>>>>
> >>>>>>> Walker
> >>>>>>>
> >>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org>
> >>>> wrote:
> >>>>>>>
> >>>>>>>> Thanks for the KIP! Also some question/comments from my side:
> >>>>>>>>
> >>>>>>>> 10) Notation: you use the term "late data" but I think you mean
> >>>>>>>> out-of-order. We reserve the term "late" to records that arrive
> >> after
> >>>>>>>> grace period passed, and thus, "late == out-of-order data that is
> >>>>>>> dropped".
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 20) "There is only one option from the stream side and only
> recently
> >>>> is
> >>>>>>>> there a second option on the table side."
> >>>>>>>>
> >>>>>>>> What are those options? Victoria already asked about the table
> side,
> >>>> but
> >>>>>>>> I am also not sure what option you mean for the stream side?
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 30) "If the table side uses a materialized version store the value
> >> is
> >>>>>>>> the latest by stream time rather than by offset within its defined
> >>>> grace
> >>>>>>>> period."
> >>>>>>>>
> >>>>>>>> The phrase "the value is the latest by stream time" is confusing
> --
> >> in
> >>>>>>>> the end, a versioned stores multiple versions, not just one.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 40) I am also wondering about ordering. In general, KS tries to
> >>>> preserve
> >>>>>>>> offset-order during processing (with some exception, when offset
> >> order
> >>>>>>>> preservation is not clearly defined). Given that the stream-side
> >>>> buffer
> >>>>>>>> is really just a "linear buffer", we could easily preserve
> >>>> offset-order.
> >>>>>>>> But I also see a benefit of re-ordering and emitting out-of-order
> >> data
> >>>>>>>> right away when read (instead of blocking them behind in-order
> >> records
> >>>>>>>> that are not ready yet). -- It might even be a possibility, to let
> >>>> users
> >>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just
> a
> >>>>>>>> placeholder).
> >>>>>>>>
> >>>>>>>> The KIP should explain this in more detail and also discuss
> >> different
> >>>>>>>> options and mention them in "Rejected alternatives" in case we
> don't
> >>>>>>>> want to include them.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> 50) What happens when users change the grace period? Especially,
> >> when
> >>>>>>>> they turn it on/off (but also increasing/decreasing is an
> >> interesting
> >>>>>>>> point)? I think we should try to support this if possible; the
> >>>>>>>> "Compatibility" section needs to cover switching on/off in more
> >>>> detail.
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> -Matthias
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
> >>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
> >>>>>>>>>
> >>>>>>>>> A few clarifications:
> >>>>>>>>>
> >>>>>>>>> 1. Is the order that records exit the buffer in necessarily the
> >> same
> >>>> as
> >>>>>>>> the
> >>>>>>>>> order that records enter the buffer in, or no? Based on the
> >>>> description
> >>>>>>>> in
> >>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will exit
> >> the
> >>>>>>>>> buffer in increasing timestamp order, which means that they may
> be
> >>>>>>>> ordered
> >>>>>>>>> (even for the same key) compared to the input order.
> >>>>>>>>>
> >>>>>>>>> 2. What happens if the join grace period is nonzero, and a
> >>>> stream-side
> >>>>>>>>> record arrives with a timestamp that is older than the current
> >> stream
> >>>>>>>> time
> >>>>>>>>> minus the grace period? Will this record trigger a join result,
> or
> >>>> will
> >>>>>>>> it
> >>>>>>>>> be dropped? Based on the description for what happens when the
> join
> >>>>>>> grace
> >>>>>>>>> period is set to zero, it sounds like the late record will be
> >>>> dropped,
> >>>>>>>> even
> >>>>>>>>> if the join grace period is nonzero. Is that true?
> >>>>>>>>>
> >>>>>>>>> 3. What could cause stream time to advance, for purposes of
> >> removing
> >>>>>>>>> records from the join buffer? For example, will new records
> >> arriving
> >>>> on
> >>>>>>>> the
> >>>>>>>>> table side of the join cause stream time to advance? From the KIP
> >> it
> >>>>>>>> sounds
> >>>>>>>>> like only stream-side records will advance stream time -- does
> that
> >>>>>>> mean
> >>>>>>>>> that the join processor itself will have to track this stream
> time?
> >>>>>>>>>
> >>>>>>>>> Also +1 to Lucas's question about what options will be available
> >> for
> >>>>>>>>> configuring the join buffer. Will users have the option to choose
> >>>>>>> whether
> >>>>>>>>> they want the buffer to be in-memory vs persistent?
> >>>>>>>>>
> >>>>>>>>> - Victoria
> >>>>>>>>>
> >>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> >>>>>>>>> <lb...@confluent.io.invalid> wrote:
> >>>>>>>>>
> >>>>>>>>>> HI Walker,
> >>>>>>>>>>
> >>>>>>>>>> thanks for the KIP! We definitely need this. I have two
> questions:
> >>>>>>>>>>
> >>>>>>>>>>      - Have you considered allowing the customization of the
> >>>> underlying
> >>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
> >>>> customize
> >>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it make
> >>>> sense
> >>>>>>>>>> for `Joined` to have this as well? I can imagine one may want to
> >>>> limit
> >>>>>>>>>> the number of records in the buffer, for example. If we hit the
> >>>>>>>>>> maximum, the only option would be to drop semantic guarantees,
> but
> >>>>>>>>>> users may still want to do this.
> >>>>>>>>>>      - With "second option on the table side" you are referring
> to
> >>>>>>>>>> versioned tables, right? Will the buffer on the stream side
> behave
> >>>> any
> >>>>>>>>>> different whether the table side is versioned or not?
> >>>>>>>>>>
> >>>>>>>>>> Finally, I think a simple example in the motivation section
> could
> >>>> help
> >>>>>>>>>> non-experts understand the KIP.
> >>>>>>>>>>
> >>>>>>>>>> Best,
> >>>>>>>>>> Lucas
> >>>>>>>>>>
> >>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> >>>>>>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>>>>>
> >>>>>>>>>>> Hello everybody,
> >>>>>>>>>>>
> >>>>>>>>>>> I have a stream proposal to improve the stream table join by
> >>>> adding a
> >>>>>>>>>> grace
> >>>>>>>>>>> period and buffer to the stream side of the join to allow
> >>>> processing
> >>>>>>> in
> >>>>>>>>>>> timestamp order matching the recent improvements of the
> versioned
> >>>>>>>> tables.
> >>>>>>>>>>>
> >>>>>>>>>>> Please take a look here <
> >>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
> >>>>>>>>>> and
> >>>>>>>>>>> share your thoughts.
> >>>>>>>>>>>
> >>>>>>>>>>> best,
> >>>>>>>>>>> Walker
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>
> >>>>
> >>>
> >>
> >
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Walker,

Thanks for the updates!

2.
It is still not clear to me how a failure is handled. I do not 
understand what you mean by "recover from an OffsetCheckpoint".

My understanding is that the buffer needs to be replicated into its own 
Kafka topic. The input topic is not enough. The offset of a record is 
added to the offsets to commit once the record is streamed through the 
subtopology. That means once the record is added to the buffer its 
offset is added to the offsets to commit -- independently of whether the 
record was evicted from the buffer and sent to the join node or not.
Now, let's assume the following scenario
1. a record is read from the input topic and added to the buffer, but 
not evicted to be processed by the join node.
2. When the processing of the subtopology finishes the offset of the 
record is added to the offsets to commit.
3. A commit happens.
4. A failure happens

After the failure the buffer is empty but the record will not be read 
anymore from the input topic since its offset has been already 
committed. The record is lost.
One solution to avoid the loss is to recreate the buffer from a 
compacted Kafka topic as we do for suppression buffers. I do not think, 
we need any offset checkpoint here since, we keep the buffer in memory, 
right? Or do you plan to back the buffer with a persistent store? Even 
in that case, a compacted Kafka topic would be needed.


3.
 From the KIP it is still not clear to me what happens if a record is 
outside of the grace period. I guess the record that falls outside of 
the grace period will not be added to the buffer, but will be send to 
the join node. Since it is outside of the grace period it will also not 
increase stream time and it will not trigger an eviction. Also the head 
of the buffer will not contain a record that needs to be evicted since 
the the timestamp of the head record will be within the interval stream 
time minus grace period. Is this correct? Please add such a description 
to the KIP.
Furthermore, I think there is a mistake in the text:
"... will dequeue when the record timestamp is greater than stream time 
plus the grace period". I guess that should be "... will dequeue when 
the record timestamp is less than (or equal?) stream time minus the 
grace period"


5.
What is the difference between not setting the grace period and setting 
it to zero? If there is a difference, why is there a difference?


Best,
Bruno


On 01.06.23 23:58, Walker Carlson wrote:
> Hey Bruno thanks for the feedback.
> 
> 1)
> I will add this to the kip, but stream time only advances as the when the
> buffer receives a new record.
> 
> 2)
> You are correct, I will add a failure section on to the kip. Since the
> records wont change in the buffer from when they are read from the topic
> they are replicated already.
> 
> 3)
> I see that I'm out voted on the dropping of records thing. We will pass
> them on and try to join them if possible. This might cause some null
> results, but increasing the table history retention should help that.
> 
> 4)
> I can add some on the kip. But its pretty directly adding whatever the
> grace period is to the latency. I don't see a way around it.
> 
> Walker
> 
> On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org> wrote:
> 
>> Hi Walker,
>>
>> thanks for the KIP!
>>
>> Here my feedback:
>>
>> 1.
>> It is still not clear to me when stream time for the buffer advances.
>> What is the event that let the stream time advance? In the discussion, I
>> do not understand what you mean by "The segment store already has an
>> observed stream time, we advance based on that. That should only advance
>> based on records that enter the store." Where does this segment store
>> come from? Anyways, I think it would be great to also state how stream
>> time advances in the KIP.
>>
>> 2.
>> How does the buffer behave in case of a failure? I think I understand
>> that the buffer will use an implementation of TimeOrderedKeyValueBuffer
>> and therefore the records in the buffer will be replicated to a topic in
>> Kafka, but I am not completely sure. Could you elaborate on this in the
>> KIP?
>>
>> 3.
>> I agree with Matthias about dropping late records. We use grace periods
>> in scenarios where we records are grouped like in windowed aggregations
>> and windowed joins. The stream buffer you propose does not really group
>> any records. It rather delays records and reorders them. I am not sure
>> if grace period is the right naming/concept to apply here. Instead of
>> dropping records that fall outside of the buffer's time interval the
>> join should skip the buffer and try to join the record immediately. In
>> the end, a stream-table join is a unwindowed join, i.e., no grouping is
>> applied to the records.
>> What do you and other folks think about this proposal?
>>
>> 4.
>> How does the proposed buffer, affects processing latency? Could you
>> please add some words about this to the KIP?
>>
>>
>> Best,
>> Bruno
>>
>>
>>
>>
>> On 31.05.23 01:49, Walker Carlson wrote:
>>> Thanks for all the additional comments. I will either address them here
>> or
>>> update the kip accordingly.
>>>
>>>
>>> I mentioned a follow kip to add extra features before and in the
>> responses.
>>> I will try to briefly summarize what options and optimizations I plan to
>>> include. If a concern is not covered in this list I for sure talk about
>> it
>>> below.
>>>
>>> * Allowing non versioned tables to still use the stream buffer
>>> * Automatically materializing tables instead of forcing the user to do it
>>> * Configurable for in memory buffer
>>> * Order the records in offset order or in time order
>>> * Non memory use buffer (offset order, delayed pull from stream.)
>>> * Time synced between stream and table side (maybe)
>>> * Do not drop late records and process them as they come in instead.
>>>
>>>
>>> First, Victoria.
>>>
>>> 1) (One of your nits covers this, but you are correct it doesn't make
>>> sense. so I removed that part of the example.)
>>> For those examples with the "bad" join results I said without buffering
>> the
>>> stream it would look like that, but that was incomplete. If the look up
>> was
>>> simply looking at the latest version of the table when the stream records
>>> came in then the results were possible. If we are using the point in time
>>> lookup that versioned tables let us then you are correct the future
>> results
>>> are not possible.
>>>
>>> 2) I'll get to this later as Matthias brought up something related.
>>>
>>> To your additional thoughts, I agree that we need to call those things
>> out
>>> in the documentation. I'm writing up a follow up kip with a lot of the
>>> ideas we have discussed so that we can improve this feature beyond the
>> base
>>> implementation if it's needed.
>>>
>>> I addressed the nits in the kip. I somehow missed the table stream table
>>> join processor improvement, it makes your first question make a lot more
>>> sense.  Table history retention is a much cleaner way to describe it.
>>>
>>> As to your mention of the syncing the time for the table and stream.
>>> Matthias mentioned that as well. I will address both here. I plan to
>> bring
>>> that up in the future, but for now we will leave it out. I suppose it
>> will
>>> be more useful after the table history retention is separable from the
>>> table grace period.
>>>
>>>
>>> To address Matthias comments.
>>>
>>> You are correct by saying the in memory store shouldn't cause any
>> semantic
>>> concerns. My concern would be more with if we limited the number of
>> records
>>> on the buffer and what we would do if we hit said limits, (emitting those
>>> records might be an issue, throwing an error and halting would not). I
>>> think we can leave this discussion to the follow up kip along with a few
>>> other options.
>>>
>>> I will go through your proposals now.
>>>
>>>     - don't support non-versioned KTables
>>>
>>> Sure, we can always expand this later on. Will include as part of the of
>>> the improvement kip
>>>
>>>     - if grace period is added, users need to explicitly materialize the
>>> table as version (either directly, or upstream. Upstream only works if
>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>
>>> again, that works for me for now, if we find a use we can always add
>> later.
>>>
>>>     - the table's history retention time must be larger than the grace
>>> period (should be easy to check at runtime, when we build the topology)
>>>
>>> agreed
>>>
>>>     - because switching from non-versioned to version stores is not
>>> backward compatibly (cf KIP-914), users need to take care of this
>>> themselves, and this also implies that adding grace period is not a
>>> backward compatible change (even only if via indirect means)
>>>
>>> sure, this works
>>>
>>> As to the dropping of late records, I'm not sure. One one hand I like not
>>> dropping things. But on the other I struggle to see how a user can filter
>>> out late records that might have incomplete join results. The point in
>> time
>>> look up will aggressively expire old data and if new data has been
>> replaced
>>> it will return null if outside of the retention. This seems like it could
>>> corrupt the integrity of the join output. Seeing that we drop late
>> records
>>> on the table side as well I would think it makes sense to drop late
>> records
>>> on the stream buffer. I could be convinced otherwise I suppose, I could
>> see
>>> adding this as an option in a follow up kip. It would be very easy to
>>> implement either way. For now unless no one else objects I'm going to
>> stick
>>> with dropping the records for the sake of getting this kip passed. It is
>>> functionally a small change to make and we can update later if you feel
>>> strongly about it.
>>>
>>> For the ordering. I have to say that it would be more complicated to
>>> implement it to be in offset order, if the goal it to get as many of the
>>> records validly joined as possible. Because we would process as things
>> left
>>> the buffer a sufficiency early enough record could hold up records that
>>> would otherwise be valid past the table history retention. To fix this we
>>> could process by timestamp then store in a second queue and emit by
>> offset,
>>> but that would be a lot more complicated. If we didn't care about not
>>> missing some valid joins we could just have no store and pull from the
>>> topic at a delay only caring about the timestamp of the next offset. For
>>> now I want to stick with the timestamp ordering as it makes much more
>> sense
>>> to me, but would propose we add both of the other options I have laid out
>>> here in the follow up kip.
>>>
>>> Lastly, I think having an empty store with zero grace period would be
>> super
>>> simple and not costly, so we might as well make it even if nothing gets
>>> entered.
>>>
>>> I hope that address all your concerns,
>>>
>>> Walker
>>>
>>> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
>> wrote:
>>>
>>>> Walker,
>>>>
>>>> thanks for the updates. The KIP itself reads fine (of course Victoria
>>>> made good comments about some phrases), but there is a couple of things
>>>> from your latest reply I don't understand, and that I still think need
>>>> some more discussions.
>>>>
>>>> Lukas, asked about in-memory option and `WindowStoreSupplier` and you
>>>> mention "semantic concerns". There should not be any semantic difference
>>>> from the underlying buffer implementation, so I am not sure what you
>>>> mean here (also the relationship to suppress() is unclear to me)? -- I
>>>> am ok to not make it configurable for now. We can always do it via a
>>>> follow up KIP, and keep interface changes limited for now.
>>>>
>>>> Does it really make sense to allow a grace period if the table is
>>>> non-versioned? You also say: "If table is not materialized it will
>>>> materialize it as versioned." -- What history retention time would we
>>>> pick for this case (also asked by Victoria)? Or should we rather not
>>>> support this and force the user to materialize the table explicitly, and
>>>> thus explicitly picking a history retention time? It's tradeoff between
>>>> usability and guiding uses that there will be a significant impact on
>>>> disk usage. There is also compatibility concerns: If the table is not
>>>> explicitly materialized in the old program, we would already need to
>>>> materialize it also in the old program (of course, we would use a
>>>> non-versioned store so far). Thus, if somebody adds a grace period, we
>>>> cannot just switch the store type, as it would be a breaking change,
>>>> potentially required an application re-set, or following the upgrade
>>>> path for versioned state stores, and also changing the program to
>>>> explicitly materialize using a versioned store. Also note, that we might
>>>> not materialize the actual join table, but only an upstream table, and
>>>> use `ValueGetter` to access the upstream data.
>>>>
>>>> To this end, as you already mentioned, history retention of the table
>>>> should be at least grace period. You proposed to include this in a
>>>> follow up KIP, but I am wondering if it's a fundamental requirement and
>>>> thus we should put a check in place right away and reject an invalid
>>>> configuration? (It always easier to lift restriction than to introduce
>>>> them later.) This would also imply that a non-versioned table cannot be
>>>> supported, because it does not have a history retention that is larger
>>>> than grace period, and maybe also answer the requirement about
>>>> materialization: as we already always materialize something on the
>>>> tablet side as non-versioned store right now, it seems difficult to
>>>> migrate the store to a versioned store. Ie, it might be ok to push the
>>>> burden onto the user and say: if you start using grace period, you also
>>>> need to manually switch from non-versioned to versioned KTables. Doing
>>>> stuff automatically under the hood if very complex for this case, we if
>>>> we push the burden onto the user, it might be ok to not complicate this
>>>> KIP significantly.
>>>>
>>>> To summarize the last two paragraphs, I would propose to:
>>>>     - don't support non-versioned KTables
>>>>     - if grace period is added, users need to explicitly materialize the
>>>> table as version (either directly, or upstream. Upstream only works if
>>>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>>>     - the table's history retention time must be larger than the grace
>>>> period (should be easy to check at runtime, when we build the topology)
>>>>     - because switching from non-versioned to version stores is not
>>>> backward compatibly (cf KIP-914), users need to take care of this
>>>> themselves, and this also implies that adding grace period is not a
>>>> backward compatible change (even only if via indirect means)
>>>>
>>>> About dropping late records: wondering if we should never drop a
>>>> stream-side record for a left-join, even if it's late? In general, one
>>>> thing I observed over the years is, that it's easier to keep stuff and
>>>> let users filter explicitly downstream (or make it configurable),
>>>> instead of dropping pro-actively, because users have no good way to
>>>> resurrect record that got already dropped.
>>>>
>>>> For ordering, sounds reasonable to me only start with one
>>>> implementation, and maybe make it configurable as a follow up. However,
>>>> I am wondering if starting with offset order might be the better option
>>>> as it seems to align more with what we do so far? So instead of storing
>>>> record ordered by timestamp, we can just store them ordered by offset,
>>>> and still "poll" from the buffer based on the head records timestamp. Or
>>>> would this complicate the implementation significantly?
>>>>
>>>> I also think it's ok to not "sync" stream-time between the table and the
>>>> stream in this KIP, but we should consider doing this as a follow up
>>>> change (not sure if we would need a KIP or not for a change this this).
>>>>
>>>> About increasing/decreasing grace period: what you describe make sense
>>>> to me. If decreased, the next record would just trigger emitting a lot
>>>> of records, and for increase, the buffer would just need to "fill up"
>>>> again. For reprocessing getting a different result with a different
>>>> grace period is expected, so that's ok IMHO. -- There seems to be one
>>>> special corner case: grace period zero. For this case, we actually don't
>>>> need any store, and the stream-side could be stateless. I think it can
>>>> have the same behavior, but if we want to "add / remove" the store
>>>> dynamically, we need to add specific code for it. For example, even if
>>>> we start up with a grace period of zero, we would need to check if there
>>>> is a local store, and still emit everything in it, before we can ditch
>>>> the store (not sure if that's even easily done at all). Or: we would
>>>> need to have a store for _all_ cases, even if grace period is zero (the
>>>> store would be empty all the time though), to avoid super complex code?
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
>>>>> Hi Walker,
>>>>>
>>>>> thanks for your responses. That makes sense. I guess there is always
>>>>> the option to make the implementation more configurable later on, if
>>>>> users request it. Also thanks for the clarifications. From my side,
>>>>> the KIP is good to go.
>>>>>
>>>>> Cheers,
>>>>> Lucas
>>>>>
>>>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
>>>>> <vi...@confluent.io.invalid> wrote:
>>>>>>
>>>>>> Thanks for the updates, Walker! Looks great, though I do have a couple
>>>>>> questions about the latest updates:
>>>>>>
>>>>>>       1. The new example says that without stream-side buffering, "ex"
>> and
>>>>>>       "fy" are possible join results. How could those join results
>>>> happen? The
>>>>>>       example versioned table suggests that table record "x" has
>>>> timestamp 2, and
>>>>>>       table record "y" has timestamp 3. If stream record "e" has
>>>> timestamp 1,
>>>>>>       then it can never be joined against record "x", and similarly for
>>>> stream
>>>>>>       record "f" with timestamp 2 being joined against "y".
>>>>>>       2. I see in your replies above that "If table is not
>> materialized it
>>>>>>       will materialize it as versioned" but I don't see this called out
>>>> in the
>>>>>>       KIP -- seems worth calling out. Also, what will the history
>>>> retention for
>>>>>>       the versioned table be? Will it be the same as the join grace
>>>> period, or
>>>>>>       will it be greater?
>>>>>>
>>>>>>
>>>>>> And some additional thoughts:
>>>>>>
>>>>>> Sounds like there are a few things users should watch out for when
>>>> enabling
>>>>>> the stream-side buffer:
>>>>>>
>>>>>>       - Records will get "stuck" if there are no newer records to
>> advance
>>>>>>       stream time.
>>>>>>       - If there are large gaps between the timestamps of stream-side
>>>> records,
>>>>>>       then it's possible that versioned store history retention will
>> have
>>>> expired
>>>>>>       by the time a record is evicted from the join buffer, leading to
>> a
>>>> join
>>>>>>       "miss." For example, if the join grace period and table history
>>>> retention
>>>>>>       are both 10, and records come in the order:
>>>>>>
>>>>>>       table side t0 with ts=0
>>>>>>       stream side s1 with ts=1 <-- enters buffer
>>>>>>       table side t10 with ts=10
>>>>>>       table side t20 with ts=20
>>>>>>       stream side s21 with ts=21 <-- evicts record s1 from buffer, but
>>>>>>       versioned store no longer contains data for ts=1 due to history
>>>> retention
>>>>>>       having elapsed
>>>>>>
>>>>>>       This will result in the join result (s1, null) even though it
>>>> should've
>>>>>>       been (s1, t0), due to t0 having been expired from the versioned
>>>> store
>>>>>>       already.
>>>>>>       - Out-of-order records from the stream-side will be reordered,
>> and
>>>> late
>>>>>>       records will be dropped.
>>>>>>
>>>>>> I don't think any of these are reasons to not go forward with this
>> KIP,
>>>> but
>>>>>> it'd be good to call them out in the eventual documentation to
>> decrease
>>>> the
>>>>>> chance users get tripped up.
>>>>>>
>>>>>>> We could maybe do an improvement later to advance stream time from
>>>> table
>>>>>> side as well, but that might be debatable as we might get more late
>>>> records.
>>>>>>
>>>>>> Yes, the likelihood of late records increases but also the likelihood
>> of
>>>>>> "join misses" due to versioned store history retention having elapsed
>>>>>> decreases, which feels important for certain use cases. Either way,
>>>> agreed
>>>>>> that it can be a discussion for the future as incorporating this would
>>>>>> substantially complicate the implementation.
>>>>>>
>>>>>> Also a couple nits:
>>>>>>
>>>>>>       - The KIP currently says "We recently added versioned tables
>> which
>>>> allow
>>>>>>       the table side of the a join [...] but it is not taken advantage
>> of
>>>> in
>>>>>>       joins," but this doesn't seem true? If the table of a
>> stream-table
>>>> join is
>>>>>>       versioned, then the DSL's stream-table join processor will
>>>> automatically
>>>>>>       perform timestamped lookups into the table, in order to take
>>>> advantage of
>>>>>>       the new timestamp-aware store to provide better join semantics.
>>>>>>       - The KIP mentions "grace period" for versioned stores in a
>> number
>>>> of
>>>>>>       places but I think you actually mean "history retention"? The two
>>>> happen to
>>>>>>       be the same today (it is not an option for users to configure the
>>>> two
>>>>>>       separately) but this need not be true in the future. "History
>>>> retention"
>>>>>>       governs how far back in time reads may occur, which is the
>> relevant
>>>>>>       parameter for performing lookups as part of the stream-table
>> join.
>>>> "Grace
>>>>>>       period" in the context of versioned stores refers to how far back
>>>> in time
>>>>>>       out-of-order writes may occur, which probably isn't directly
>>>> relevant for
>>>>>>       introducing a stream-side buffer, though it's also possible I've
>>>> overlooked
>>>>>>       something. (As a bonus, switching from "table grace period" in
>> the
>>>> KIP to
>>>>>>       "table history retention" also helps to clarify/distinguish that
>>>> it's a
>>>>>>       different parameter from the "join grace period," which I could
>> see
>>>> being
>>>>>>       confusing to readers. :) )
>>>>>>
>>>>>>
>>>>>> Cheers,
>>>>>> Victoria
>>>>>>
>>>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>
>>>>>>> Hey all,
>>>>>>>
>>>>>>> Thanks for the comments, they gave me a lot to think about. I'll try
>> to
>>>>>>> address them all inorder. I have made some updates to the kip related
>>>> to
>>>>>>> them, but I mention where below.
>>>>>>>
>>>>>>> Lucas
>>>>>>>
>>>>>>> Good idea about the example. I added a simple one.
>>>>>>>
>>>>>>> 1) I have thought about including options for the underlying buffer
>>>>>>> configuration. One of which might be adding an in memory option. My
>>>> biggest
>>>>>>> concern is about the semantic guarantees. This isn't like suppress or
>>>> with
>>>>>>> windows where producing incomplete results is repetitively harmless.
>>>> Here
>>>>>>> we would be possibly producing incorrect results. I also would like
>> to
>>>> keep
>>>>>>> the interface changes as simple as I can. Making more than this
>> change
>>>> to
>>>>>>> Joined I feel could make this more complicated than it needs to be.
>> If
>>>> we
>>>>>>> really want to I could see adding a grace() option with a
>> BufferConifg
>>>> in
>>>>>>> there or something, but I would rather not.
>>>>>>>
>>>>>>> 2) The buffer will be independent of if the table is versioned or
>> not.
>>>> If
>>>>>>> table is not materialized it will materialize it as versioned. It
>> might
>>>>>>> make sense to do a follow up kip where we force the retention period
>>>> of
>>>>>>> the versioned to be greater than whatever the max of the stream
>> buffer
>>>> is.
>>>>>>>
>>>>>>> Victoria
>>>>>>>
>>>>>>> 1) Yes, records will exit in timestamp order not in offset order.
>>>>>>> 2) Late records will be dropped (Late as out of the grace period).
>>>>   From my
>>>>>>> understanding that is the point of a grace period, no? Doesn't the
>> same
>>>>>>> thing happen with versioned stores?
>>>>>>> 3) The segment store already has an observed stream time, we advance
>>>> based
>>>>>>> on that. That should only advance based on records that enter the
>>>> store. So
>>>>>>> yes, only stream side records. We could maybe do an improvement later
>>>> to
>>>>>>> advance stream time from table side as well, but that might be
>>>> debatable as
>>>>>>> we might get more late records. Anyways I would rather have that as a
>>>>>>> separate discussion.
>>>>>>>
>>>>>>> in memory option? We can do that, for the buffer I plan to use the
>>>>>>> TimeOrderedKeyValueBuffer interface which already has an in memory
>>>>>>> implantation, so it would be simple.
>>>>>>>
>>>>>>> I said more in my answer to Lucas's question. The concern I have with
>>>>>>> buffer configs or in memory is complicating the interface. Also
>>>> semantic
>>>>>>> guarantees but in memory shouldn't effect that
>>>>>>>
>>>>>>> Matthias
>>>>>>>
>>>>>>> 1) fixed out of order vs late terminology in the kip.
>>>>>>>
>>>>>>> 2) I was referring to having a stream. So after this kip we can have
>> a
>>>>>>> buffered stream or a normal one. For the table we can use a versioned
>>>> table
>>>>>>> or a normal table.
>>>>>>>
>>>>>>> 3 Good call out. I clarified this as "If the table side uses a
>>>> materialized
>>>>>>> version store, it can store multiple versions of each record within
>> its
>>>>>>> defined grace period." and modified the rest of the paragraph a bit.
>>>>>>>
>>>>>>> 4) I get the preserving off offset ordering, but if the stream is
>>>> buffered
>>>>>>> to join on timestamp instead of offset doesn't it already seem like
>> we
>>>> care
>>>>>>> more about time in this case?
>>>>>>>
>>>>>>> If we end up adding more options it might make sense to do this.
>> Maybe
>>>>>>> offset order processing can be a follow up?
>>>>>>>
>>>>>>> I'll add a section for this in Rejected Alternatives. I think it
>> makes
>>>>>>> sense to do something like this but maybe in a follow up.
>>>>>>>
>>>>>>> 5) I hadn't thought about this. I suppose if they changed this in an
>>>>>>> upgrade the next record would either evict a lot of records (if the
>>>> grace
>>>>>>> period decreased) or there would be a pause until the new grace
>> period
>>>>>>> reached. Increasing is a bit more problematic, especially if the
>> table
>>>>>>> grace period and retention time stays the same. If the data is
>>>> reprocessed
>>>>>>> after a change like that then there would be different results, but I
>>>> feel
>>>>>>> like that would be expected after such a change.
>>>>>>>
>>>>>>> What do you think should happen?
>>>>>>>
>>>>>>> Hopefully this answers your questions!
>>>>>>>
>>>>>>> Walker
>>>>>>>
>>>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org>
>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the KIP! Also some question/comments from my side:
>>>>>>>>
>>>>>>>> 10) Notation: you use the term "late data" but I think you mean
>>>>>>>> out-of-order. We reserve the term "late" to records that arrive
>> after
>>>>>>>> grace period passed, and thus, "late == out-of-order data that is
>>>>>>> dropped".
>>>>>>>>
>>>>>>>>
>>>>>>>> 20) "There is only one option from the stream side and only recently
>>>> is
>>>>>>>> there a second option on the table side."
>>>>>>>>
>>>>>>>> What are those options? Victoria already asked about the table side,
>>>> but
>>>>>>>> I am also not sure what option you mean for the stream side?
>>>>>>>>
>>>>>>>>
>>>>>>>> 30) "If the table side uses a materialized version store the value
>> is
>>>>>>>> the latest by stream time rather than by offset within its defined
>>>> grace
>>>>>>>> period."
>>>>>>>>
>>>>>>>> The phrase "the value is the latest by stream time" is confusing --
>> in
>>>>>>>> the end, a versioned stores multiple versions, not just one.
>>>>>>>>
>>>>>>>>
>>>>>>>> 40) I am also wondering about ordering. In general, KS tries to
>>>> preserve
>>>>>>>> offset-order during processing (with some exception, when offset
>> order
>>>>>>>> preservation is not clearly defined). Given that the stream-side
>>>> buffer
>>>>>>>> is really just a "linear buffer", we could easily preserve
>>>> offset-order.
>>>>>>>> But I also see a benefit of re-ordering and emitting out-of-order
>> data
>>>>>>>> right away when read (instead of blocking them behind in-order
>> records
>>>>>>>> that are not ready yet). -- It might even be a possibility, to let
>>>> users
>>>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
>>>>>>>> placeholder).
>>>>>>>>
>>>>>>>> The KIP should explain this in more detail and also discuss
>> different
>>>>>>>> options and mention them in "Rejected alternatives" in case we don't
>>>>>>>> want to include them.
>>>>>>>>
>>>>>>>>
>>>>>>>> 50) What happens when users change the grace period? Especially,
>> when
>>>>>>>> they turn it on/off (but also increasing/decreasing is an
>> interesting
>>>>>>>> point)? I think we should try to support this if possible; the
>>>>>>>> "Compatibility" section needs to cover switching on/off in more
>>>> detail.
>>>>>>>>
>>>>>>>>
>>>>>>>> -Matthias
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
>>>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
>>>>>>>>>
>>>>>>>>> A few clarifications:
>>>>>>>>>
>>>>>>>>> 1. Is the order that records exit the buffer in necessarily the
>> same
>>>> as
>>>>>>>> the
>>>>>>>>> order that records enter the buffer in, or no? Based on the
>>>> description
>>>>>>>> in
>>>>>>>>> the KIP, it sounds like the answer is no, i.e., records will exit
>> the
>>>>>>>>> buffer in increasing timestamp order, which means that they may be
>>>>>>>> ordered
>>>>>>>>> (even for the same key) compared to the input order.
>>>>>>>>>
>>>>>>>>> 2. What happens if the join grace period is nonzero, and a
>>>> stream-side
>>>>>>>>> record arrives with a timestamp that is older than the current
>> stream
>>>>>>>> time
>>>>>>>>> minus the grace period? Will this record trigger a join result, or
>>>> will
>>>>>>>> it
>>>>>>>>> be dropped? Based on the description for what happens when the join
>>>>>>> grace
>>>>>>>>> period is set to zero, it sounds like the late record will be
>>>> dropped,
>>>>>>>> even
>>>>>>>>> if the join grace period is nonzero. Is that true?
>>>>>>>>>
>>>>>>>>> 3. What could cause stream time to advance, for purposes of
>> removing
>>>>>>>>> records from the join buffer? For example, will new records
>> arriving
>>>> on
>>>>>>>> the
>>>>>>>>> table side of the join cause stream time to advance? From the KIP
>> it
>>>>>>>> sounds
>>>>>>>>> like only stream-side records will advance stream time -- does that
>>>>>>> mean
>>>>>>>>> that the join processor itself will have to track this stream time?
>>>>>>>>>
>>>>>>>>> Also +1 to Lucas's question about what options will be available
>> for
>>>>>>>>> configuring the join buffer. Will users have the option to choose
>>>>>>> whether
>>>>>>>>> they want the buffer to be in-memory vs persistent?
>>>>>>>>>
>>>>>>>>> - Victoria
>>>>>>>>>
>>>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
>>>>>>>>> <lb...@confluent.io.invalid> wrote:
>>>>>>>>>
>>>>>>>>>> HI Walker,
>>>>>>>>>>
>>>>>>>>>> thanks for the KIP! We definitely need this. I have two questions:
>>>>>>>>>>
>>>>>>>>>>      - Have you considered allowing the customization of the
>>>> underlying
>>>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
>>>> customize
>>>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it make
>>>> sense
>>>>>>>>>> for `Joined` to have this as well? I can imagine one may want to
>>>> limit
>>>>>>>>>> the number of records in the buffer, for example. If we hit the
>>>>>>>>>> maximum, the only option would be to drop semantic guarantees, but
>>>>>>>>>> users may still want to do this.
>>>>>>>>>>      - With "second option on the table side" you are referring to
>>>>>>>>>> versioned tables, right? Will the buffer on the stream side behave
>>>> any
>>>>>>>>>> different whether the table side is versioned or not?
>>>>>>>>>>
>>>>>>>>>> Finally, I think a simple example in the motivation section could
>>>> help
>>>>>>>>>> non-experts understand the KIP.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Lucas
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>>>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hello everybody,
>>>>>>>>>>>
>>>>>>>>>>> I have a stream proposal to improve the stream table join by
>>>> adding a
>>>>>>>>>> grace
>>>>>>>>>>> period and buffer to the stream side of the join to allow
>>>> processing
>>>>>>> in
>>>>>>>>>>> timestamp order matching the recent improvements of the versioned
>>>>>>>> tables.
>>>>>>>>>>>
>>>>>>>>>>> Please take a look here <
>>>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
>>>>>>>>>> and
>>>>>>>>>>> share your thoughts.
>>>>>>>>>>>
>>>>>>>>>>> best,
>>>>>>>>>>> Walker
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>
>>
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Hey Bruno thanks for the feedback.

1)
I will add this to the kip, but stream time only advances as the when the
buffer receives a new record.

2)
You are correct, I will add a failure section on to the kip. Since the
records wont change in the buffer from when they are read from the topic
they are replicated already.

3)
I see that I'm out voted on the dropping of records thing. We will pass
them on and try to join them if possible. This might cause some null
results, but increasing the table history retention should help that.

4)
I can add some on the kip. But its pretty directly adding whatever the
grace period is to the latency. I don't see a way around it.

Walker

On Thu, Jun 1, 2023 at 5:23 AM Bruno Cadonna <ca...@apache.org> wrote:

> Hi Walker,
>
> thanks for the KIP!
>
> Here my feedback:
>
> 1.
> It is still not clear to me when stream time for the buffer advances.
> What is the event that let the stream time advance? In the discussion, I
> do not understand what you mean by "The segment store already has an
> observed stream time, we advance based on that. That should only advance
> based on records that enter the store." Where does this segment store
> come from? Anyways, I think it would be great to also state how stream
> time advances in the KIP.
>
> 2.
> How does the buffer behave in case of a failure? I think I understand
> that the buffer will use an implementation of TimeOrderedKeyValueBuffer
> and therefore the records in the buffer will be replicated to a topic in
> Kafka, but I am not completely sure. Could you elaborate on this in the
> KIP?
>
> 3.
> I agree with Matthias about dropping late records. We use grace periods
> in scenarios where we records are grouped like in windowed aggregations
> and windowed joins. The stream buffer you propose does not really group
> any records. It rather delays records and reorders them. I am not sure
> if grace period is the right naming/concept to apply here. Instead of
> dropping records that fall outside of the buffer's time interval the
> join should skip the buffer and try to join the record immediately. In
> the end, a stream-table join is a unwindowed join, i.e., no grouping is
> applied to the records.
> What do you and other folks think about this proposal?
>
> 4.
> How does the proposed buffer, affects processing latency? Could you
> please add some words about this to the KIP?
>
>
> Best,
> Bruno
>
>
>
>
> On 31.05.23 01:49, Walker Carlson wrote:
> > Thanks for all the additional comments. I will either address them here
> or
> > update the kip accordingly.
> >
> >
> > I mentioned a follow kip to add extra features before and in the
> responses.
> > I will try to briefly summarize what options and optimizations I plan to
> > include. If a concern is not covered in this list I for sure talk about
> it
> > below.
> >
> > * Allowing non versioned tables to still use the stream buffer
> > * Automatically materializing tables instead of forcing the user to do it
> > * Configurable for in memory buffer
> > * Order the records in offset order or in time order
> > * Non memory use buffer (offset order, delayed pull from stream.)
> > * Time synced between stream and table side (maybe)
> > * Do not drop late records and process them as they come in instead.
> >
> >
> > First, Victoria.
> >
> > 1) (One of your nits covers this, but you are correct it doesn't make
> > sense. so I removed that part of the example.)
> > For those examples with the "bad" join results I said without buffering
> the
> > stream it would look like that, but that was incomplete. If the look up
> was
> > simply looking at the latest version of the table when the stream records
> > came in then the results were possible. If we are using the point in time
> > lookup that versioned tables let us then you are correct the future
> results
> > are not possible.
> >
> > 2) I'll get to this later as Matthias brought up something related.
> >
> > To your additional thoughts, I agree that we need to call those things
> out
> > in the documentation. I'm writing up a follow up kip with a lot of the
> > ideas we have discussed so that we can improve this feature beyond the
> base
> > implementation if it's needed.
> >
> > I addressed the nits in the kip. I somehow missed the table stream table
> > join processor improvement, it makes your first question make a lot more
> > sense.  Table history retention is a much cleaner way to describe it.
> >
> > As to your mention of the syncing the time for the table and stream.
> > Matthias mentioned that as well. I will address both here. I plan to
> bring
> > that up in the future, but for now we will leave it out. I suppose it
> will
> > be more useful after the table history retention is separable from the
> > table grace period.
> >
> >
> > To address Matthias comments.
> >
> > You are correct by saying the in memory store shouldn't cause any
> semantic
> > concerns. My concern would be more with if we limited the number of
> records
> > on the buffer and what we would do if we hit said limits, (emitting those
> > records might be an issue, throwing an error and halting would not). I
> > think we can leave this discussion to the follow up kip along with a few
> > other options.
> >
> > I will go through your proposals now.
> >
> >    - don't support non-versioned KTables
> >
> > Sure, we can always expand this later on. Will include as part of the of
> > the improvement kip
> >
> >    - if grace period is added, users need to explicitly materialize the
> > table as version (either directly, or upstream. Upstream only works if
> > downstream tables "inherit" versioned semantics -- cf KIP-914)
> >
> > again, that works for me for now, if we find a use we can always add
> later.
> >
> >    - the table's history retention time must be larger than the grace
> > period (should be easy to check at runtime, when we build the topology)
> >
> > agreed
> >
> >    - because switching from non-versioned to version stores is not
> > backward compatibly (cf KIP-914), users need to take care of this
> > themselves, and this also implies that adding grace period is not a
> > backward compatible change (even only if via indirect means)
> >
> > sure, this works
> >
> > As to the dropping of late records, I'm not sure. One one hand I like not
> > dropping things. But on the other I struggle to see how a user can filter
> > out late records that might have incomplete join results. The point in
> time
> > look up will aggressively expire old data and if new data has been
> replaced
> > it will return null if outside of the retention. This seems like it could
> > corrupt the integrity of the join output. Seeing that we drop late
> records
> > on the table side as well I would think it makes sense to drop late
> records
> > on the stream buffer. I could be convinced otherwise I suppose, I could
> see
> > adding this as an option in a follow up kip. It would be very easy to
> > implement either way. For now unless no one else objects I'm going to
> stick
> > with dropping the records for the sake of getting this kip passed. It is
> > functionally a small change to make and we can update later if you feel
> > strongly about it.
> >
> > For the ordering. I have to say that it would be more complicated to
> > implement it to be in offset order, if the goal it to get as many of the
> > records validly joined as possible. Because we would process as things
> left
> > the buffer a sufficiency early enough record could hold up records that
> > would otherwise be valid past the table history retention. To fix this we
> > could process by timestamp then store in a second queue and emit by
> offset,
> > but that would be a lot more complicated. If we didn't care about not
> > missing some valid joins we could just have no store and pull from the
> > topic at a delay only caring about the timestamp of the next offset. For
> > now I want to stick with the timestamp ordering as it makes much more
> sense
> > to me, but would propose we add both of the other options I have laid out
> > here in the follow up kip.
> >
> > Lastly, I think having an empty store with zero grace period would be
> super
> > simple and not costly, so we might as well make it even if nothing gets
> > entered.
> >
> > I hope that address all your concerns,
> >
> > Walker
> >
> > On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org>
> wrote:
> >
> >> Walker,
> >>
> >> thanks for the updates. The KIP itself reads fine (of course Victoria
> >> made good comments about some phrases), but there is a couple of things
> >> from your latest reply I don't understand, and that I still think need
> >> some more discussions.
> >>
> >> Lukas, asked about in-memory option and `WindowStoreSupplier` and you
> >> mention "semantic concerns". There should not be any semantic difference
> >> from the underlying buffer implementation, so I am not sure what you
> >> mean here (also the relationship to suppress() is unclear to me)? -- I
> >> am ok to not make it configurable for now. We can always do it via a
> >> follow up KIP, and keep interface changes limited for now.
> >>
> >> Does it really make sense to allow a grace period if the table is
> >> non-versioned? You also say: "If table is not materialized it will
> >> materialize it as versioned." -- What history retention time would we
> >> pick for this case (also asked by Victoria)? Or should we rather not
> >> support this and force the user to materialize the table explicitly, and
> >> thus explicitly picking a history retention time? It's tradeoff between
> >> usability and guiding uses that there will be a significant impact on
> >> disk usage. There is also compatibility concerns: If the table is not
> >> explicitly materialized in the old program, we would already need to
> >> materialize it also in the old program (of course, we would use a
> >> non-versioned store so far). Thus, if somebody adds a grace period, we
> >> cannot just switch the store type, as it would be a breaking change,
> >> potentially required an application re-set, or following the upgrade
> >> path for versioned state stores, and also changing the program to
> >> explicitly materialize using a versioned store. Also note, that we might
> >> not materialize the actual join table, but only an upstream table, and
> >> use `ValueGetter` to access the upstream data.
> >>
> >> To this end, as you already mentioned, history retention of the table
> >> should be at least grace period. You proposed to include this in a
> >> follow up KIP, but I am wondering if it's a fundamental requirement and
> >> thus we should put a check in place right away and reject an invalid
> >> configuration? (It always easier to lift restriction than to introduce
> >> them later.) This would also imply that a non-versioned table cannot be
> >> supported, because it does not have a history retention that is larger
> >> than grace period, and maybe also answer the requirement about
> >> materialization: as we already always materialize something on the
> >> tablet side as non-versioned store right now, it seems difficult to
> >> migrate the store to a versioned store. Ie, it might be ok to push the
> >> burden onto the user and say: if you start using grace period, you also
> >> need to manually switch from non-versioned to versioned KTables. Doing
> >> stuff automatically under the hood if very complex for this case, we if
> >> we push the burden onto the user, it might be ok to not complicate this
> >> KIP significantly.
> >>
> >> To summarize the last two paragraphs, I would propose to:
> >>    - don't support non-versioned KTables
> >>    - if grace period is added, users need to explicitly materialize the
> >> table as version (either directly, or upstream. Upstream only works if
> >> downstream tables "inherit" versioned semantics -- cf KIP-914)
> >>    - the table's history retention time must be larger than the grace
> >> period (should be easy to check at runtime, when we build the topology)
> >>    - because switching from non-versioned to version stores is not
> >> backward compatibly (cf KIP-914), users need to take care of this
> >> themselves, and this also implies that adding grace period is not a
> >> backward compatible change (even only if via indirect means)
> >>
> >> About dropping late records: wondering if we should never drop a
> >> stream-side record for a left-join, even if it's late? In general, one
> >> thing I observed over the years is, that it's easier to keep stuff and
> >> let users filter explicitly downstream (or make it configurable),
> >> instead of dropping pro-actively, because users have no good way to
> >> resurrect record that got already dropped.
> >>
> >> For ordering, sounds reasonable to me only start with one
> >> implementation, and maybe make it configurable as a follow up. However,
> >> I am wondering if starting with offset order might be the better option
> >> as it seems to align more with what we do so far? So instead of storing
> >> record ordered by timestamp, we can just store them ordered by offset,
> >> and still "poll" from the buffer based on the head records timestamp. Or
> >> would this complicate the implementation significantly?
> >>
> >> I also think it's ok to not "sync" stream-time between the table and the
> >> stream in this KIP, but we should consider doing this as a follow up
> >> change (not sure if we would need a KIP or not for a change this this).
> >>
> >> About increasing/decreasing grace period: what you describe make sense
> >> to me. If decreased, the next record would just trigger emitting a lot
> >> of records, and for increase, the buffer would just need to "fill up"
> >> again. For reprocessing getting a different result with a different
> >> grace period is expected, so that's ok IMHO. -- There seems to be one
> >> special corner case: grace period zero. For this case, we actually don't
> >> need any store, and the stream-side could be stateless. I think it can
> >> have the same behavior, but if we want to "add / remove" the store
> >> dynamically, we need to add specific code for it. For example, even if
> >> we start up with a grace period of zero, we would need to check if there
> >> is a local store, and still emit everything in it, before we can ditch
> >> the store (not sure if that's even easily done at all). Or: we would
> >> need to have a store for _all_ cases, even if grace period is zero (the
> >> store would be empty all the time though), to avoid super complex code?
> >>
> >>
> >> -Matthias
> >>
> >>
> >> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> >>> Hi Walker,
> >>>
> >>> thanks for your responses. That makes sense. I guess there is always
> >>> the option to make the implementation more configurable later on, if
> >>> users request it. Also thanks for the clarifications. From my side,
> >>> the KIP is good to go.
> >>>
> >>> Cheers,
> >>> Lucas
> >>>
> >>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> >>> <vi...@confluent.io.invalid> wrote:
> >>>>
> >>>> Thanks for the updates, Walker! Looks great, though I do have a couple
> >>>> questions about the latest updates:
> >>>>
> >>>>      1. The new example says that without stream-side buffering, "ex"
> and
> >>>>      "fy" are possible join results. How could those join results
> >> happen? The
> >>>>      example versioned table suggests that table record "x" has
> >> timestamp 2, and
> >>>>      table record "y" has timestamp 3. If stream record "e" has
> >> timestamp 1,
> >>>>      then it can never be joined against record "x", and similarly for
> >> stream
> >>>>      record "f" with timestamp 2 being joined against "y".
> >>>>      2. I see in your replies above that "If table is not
> materialized it
> >>>>      will materialize it as versioned" but I don't see this called out
> >> in the
> >>>>      KIP -- seems worth calling out. Also, what will the history
> >> retention for
> >>>>      the versioned table be? Will it be the same as the join grace
> >> period, or
> >>>>      will it be greater?
> >>>>
> >>>>
> >>>> And some additional thoughts:
> >>>>
> >>>> Sounds like there are a few things users should watch out for when
> >> enabling
> >>>> the stream-side buffer:
> >>>>
> >>>>      - Records will get "stuck" if there are no newer records to
> advance
> >>>>      stream time.
> >>>>      - If there are large gaps between the timestamps of stream-side
> >> records,
> >>>>      then it's possible that versioned store history retention will
> have
> >> expired
> >>>>      by the time a record is evicted from the join buffer, leading to
> a
> >> join
> >>>>      "miss." For example, if the join grace period and table history
> >> retention
> >>>>      are both 10, and records come in the order:
> >>>>
> >>>>      table side t0 with ts=0
> >>>>      stream side s1 with ts=1 <-- enters buffer
> >>>>      table side t10 with ts=10
> >>>>      table side t20 with ts=20
> >>>>      stream side s21 with ts=21 <-- evicts record s1 from buffer, but
> >>>>      versioned store no longer contains data for ts=1 due to history
> >> retention
> >>>>      having elapsed
> >>>>
> >>>>      This will result in the join result (s1, null) even though it
> >> should've
> >>>>      been (s1, t0), due to t0 having been expired from the versioned
> >> store
> >>>>      already.
> >>>>      - Out-of-order records from the stream-side will be reordered,
> and
> >> late
> >>>>      records will be dropped.
> >>>>
> >>>> I don't think any of these are reasons to not go forward with this
> KIP,
> >> but
> >>>> it'd be good to call them out in the eventual documentation to
> decrease
> >> the
> >>>> chance users get tripped up.
> >>>>
> >>>>> We could maybe do an improvement later to advance stream time from
> >> table
> >>>> side as well, but that might be debatable as we might get more late
> >> records.
> >>>>
> >>>> Yes, the likelihood of late records increases but also the likelihood
> of
> >>>> "join misses" due to versioned store history retention having elapsed
> >>>> decreases, which feels important for certain use cases. Either way,
> >> agreed
> >>>> that it can be a discussion for the future as incorporating this would
> >>>> substantially complicate the implementation.
> >>>>
> >>>> Also a couple nits:
> >>>>
> >>>>      - The KIP currently says "We recently added versioned tables
> which
> >> allow
> >>>>      the table side of the a join [...] but it is not taken advantage
> of
> >> in
> >>>>      joins," but this doesn't seem true? If the table of a
> stream-table
> >> join is
> >>>>      versioned, then the DSL's stream-table join processor will
> >> automatically
> >>>>      perform timestamped lookups into the table, in order to take
> >> advantage of
> >>>>      the new timestamp-aware store to provide better join semantics.
> >>>>      - The KIP mentions "grace period" for versioned stores in a
> number
> >> of
> >>>>      places but I think you actually mean "history retention"? The two
> >> happen to
> >>>>      be the same today (it is not an option for users to configure the
> >> two
> >>>>      separately) but this need not be true in the future. "History
> >> retention"
> >>>>      governs how far back in time reads may occur, which is the
> relevant
> >>>>      parameter for performing lookups as part of the stream-table
> join.
> >> "Grace
> >>>>      period" in the context of versioned stores refers to how far back
> >> in time
> >>>>      out-of-order writes may occur, which probably isn't directly
> >> relevant for
> >>>>      introducing a stream-side buffer, though it's also possible I've
> >> overlooked
> >>>>      something. (As a bonus, switching from "table grace period" in
> the
> >> KIP to
> >>>>      "table history retention" also helps to clarify/distinguish that
> >> it's a
> >>>>      different parameter from the "join grace period," which I could
> see
> >> being
> >>>>      confusing to readers. :) )
> >>>>
> >>>>
> >>>> Cheers,
> >>>> Victoria
> >>>>
> >>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> >>>> <wc...@confluent.io.invalid> wrote:
> >>>>
> >>>>> Hey all,
> >>>>>
> >>>>> Thanks for the comments, they gave me a lot to think about. I'll try
> to
> >>>>> address them all inorder. I have made some updates to the kip related
> >> to
> >>>>> them, but I mention where below.
> >>>>>
> >>>>> Lucas
> >>>>>
> >>>>> Good idea about the example. I added a simple one.
> >>>>>
> >>>>> 1) I have thought about including options for the underlying buffer
> >>>>> configuration. One of which might be adding an in memory option. My
> >> biggest
> >>>>> concern is about the semantic guarantees. This isn't like suppress or
> >> with
> >>>>> windows where producing incomplete results is repetitively harmless.
> >> Here
> >>>>> we would be possibly producing incorrect results. I also would like
> to
> >> keep
> >>>>> the interface changes as simple as I can. Making more than this
> change
> >> to
> >>>>> Joined I feel could make this more complicated than it needs to be.
> If
> >> we
> >>>>> really want to I could see adding a grace() option with a
> BufferConifg
> >> in
> >>>>> there or something, but I would rather not.
> >>>>>
> >>>>> 2) The buffer will be independent of if the table is versioned or
> not.
> >> If
> >>>>> table is not materialized it will materialize it as versioned. It
> might
> >>>>> make sense to do a follow up kip where we force the retention period
> >> of
> >>>>> the versioned to be greater than whatever the max of the stream
> buffer
> >> is.
> >>>>>
> >>>>> Victoria
> >>>>>
> >>>>> 1) Yes, records will exit in timestamp order not in offset order.
> >>>>> 2) Late records will be dropped (Late as out of the grace period).
> >>  From my
> >>>>> understanding that is the point of a grace period, no? Doesn't the
> same
> >>>>> thing happen with versioned stores?
> >>>>> 3) The segment store already has an observed stream time, we advance
> >> based
> >>>>> on that. That should only advance based on records that enter the
> >> store. So
> >>>>> yes, only stream side records. We could maybe do an improvement later
> >> to
> >>>>> advance stream time from table side as well, but that might be
> >> debatable as
> >>>>> we might get more late records. Anyways I would rather have that as a
> >>>>> separate discussion.
> >>>>>
> >>>>> in memory option? We can do that, for the buffer I plan to use the
> >>>>> TimeOrderedKeyValueBuffer interface which already has an in memory
> >>>>> implantation, so it would be simple.
> >>>>>
> >>>>> I said more in my answer to Lucas's question. The concern I have with
> >>>>> buffer configs or in memory is complicating the interface. Also
> >> semantic
> >>>>> guarantees but in memory shouldn't effect that
> >>>>>
> >>>>> Matthias
> >>>>>
> >>>>> 1) fixed out of order vs late terminology in the kip.
> >>>>>
> >>>>> 2) I was referring to having a stream. So after this kip we can have
> a
> >>>>> buffered stream or a normal one. For the table we can use a versioned
> >> table
> >>>>> or a normal table.
> >>>>>
> >>>>> 3 Good call out. I clarified this as "If the table side uses a
> >> materialized
> >>>>> version store, it can store multiple versions of each record within
> its
> >>>>> defined grace period." and modified the rest of the paragraph a bit.
> >>>>>
> >>>>> 4) I get the preserving off offset ordering, but if the stream is
> >> buffered
> >>>>> to join on timestamp instead of offset doesn't it already seem like
> we
> >> care
> >>>>> more about time in this case?
> >>>>>
> >>>>> If we end up adding more options it might make sense to do this.
> Maybe
> >>>>> offset order processing can be a follow up?
> >>>>>
> >>>>> I'll add a section for this in Rejected Alternatives. I think it
> makes
> >>>>> sense to do something like this but maybe in a follow up.
> >>>>>
> >>>>> 5) I hadn't thought about this. I suppose if they changed this in an
> >>>>> upgrade the next record would either evict a lot of records (if the
> >> grace
> >>>>> period decreased) or there would be a pause until the new grace
> period
> >>>>> reached. Increasing is a bit more problematic, especially if the
> table
> >>>>> grace period and retention time stays the same. If the data is
> >> reprocessed
> >>>>> after a change like that then there would be different results, but I
> >> feel
> >>>>> like that would be expected after such a change.
> >>>>>
> >>>>> What do you think should happen?
> >>>>>
> >>>>> Hopefully this answers your questions!
> >>>>>
> >>>>> Walker
> >>>>>
> >>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org>
> >> wrote:
> >>>>>
> >>>>>> Thanks for the KIP! Also some question/comments from my side:
> >>>>>>
> >>>>>> 10) Notation: you use the term "late data" but I think you mean
> >>>>>> out-of-order. We reserve the term "late" to records that arrive
> after
> >>>>>> grace period passed, and thus, "late == out-of-order data that is
> >>>>> dropped".
> >>>>>>
> >>>>>>
> >>>>>> 20) "There is only one option from the stream side and only recently
> >> is
> >>>>>> there a second option on the table side."
> >>>>>>
> >>>>>> What are those options? Victoria already asked about the table side,
> >> but
> >>>>>> I am also not sure what option you mean for the stream side?
> >>>>>>
> >>>>>>
> >>>>>> 30) "If the table side uses a materialized version store the value
> is
> >>>>>> the latest by stream time rather than by offset within its defined
> >> grace
> >>>>>> period."
> >>>>>>
> >>>>>> The phrase "the value is the latest by stream time" is confusing --
> in
> >>>>>> the end, a versioned stores multiple versions, not just one.
> >>>>>>
> >>>>>>
> >>>>>> 40) I am also wondering about ordering. In general, KS tries to
> >> preserve
> >>>>>> offset-order during processing (with some exception, when offset
> order
> >>>>>> preservation is not clearly defined). Given that the stream-side
> >> buffer
> >>>>>> is really just a "linear buffer", we could easily preserve
> >> offset-order.
> >>>>>> But I also see a benefit of re-ordering and emitting out-of-order
> data
> >>>>>> right away when read (instead of blocking them behind in-order
> records
> >>>>>> that are not ready yet). -- It might even be a possibility, to let
> >> users
> >>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
> >>>>>> placeholder).
> >>>>>>
> >>>>>> The KIP should explain this in more detail and also discuss
> different
> >>>>>> options and mention them in "Rejected alternatives" in case we don't
> >>>>>> want to include them.
> >>>>>>
> >>>>>>
> >>>>>> 50) What happens when users change the grace period? Especially,
> when
> >>>>>> they turn it on/off (but also increasing/decreasing is an
> interesting
> >>>>>> point)? I think we should try to support this if possible; the
> >>>>>> "Compatibility" section needs to cover switching on/off in more
> >> detail.
> >>>>>>
> >>>>>>
> >>>>>> -Matthias
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
> >>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
> >>>>>>>
> >>>>>>> A few clarifications:
> >>>>>>>
> >>>>>>> 1. Is the order that records exit the buffer in necessarily the
> same
> >> as
> >>>>>> the
> >>>>>>> order that records enter the buffer in, or no? Based on the
> >> description
> >>>>>> in
> >>>>>>> the KIP, it sounds like the answer is no, i.e., records will exit
> the
> >>>>>>> buffer in increasing timestamp order, which means that they may be
> >>>>>> ordered
> >>>>>>> (even for the same key) compared to the input order.
> >>>>>>>
> >>>>>>> 2. What happens if the join grace period is nonzero, and a
> >> stream-side
> >>>>>>> record arrives with a timestamp that is older than the current
> stream
> >>>>>> time
> >>>>>>> minus the grace period? Will this record trigger a join result, or
> >> will
> >>>>>> it
> >>>>>>> be dropped? Based on the description for what happens when the join
> >>>>> grace
> >>>>>>> period is set to zero, it sounds like the late record will be
> >> dropped,
> >>>>>> even
> >>>>>>> if the join grace period is nonzero. Is that true?
> >>>>>>>
> >>>>>>> 3. What could cause stream time to advance, for purposes of
> removing
> >>>>>>> records from the join buffer? For example, will new records
> arriving
> >> on
> >>>>>> the
> >>>>>>> table side of the join cause stream time to advance? From the KIP
> it
> >>>>>> sounds
> >>>>>>> like only stream-side records will advance stream time -- does that
> >>>>> mean
> >>>>>>> that the join processor itself will have to track this stream time?
> >>>>>>>
> >>>>>>> Also +1 to Lucas's question about what options will be available
> for
> >>>>>>> configuring the join buffer. Will users have the option to choose
> >>>>> whether
> >>>>>>> they want the buffer to be in-memory vs persistent?
> >>>>>>>
> >>>>>>> - Victoria
> >>>>>>>
> >>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> >>>>>>> <lb...@confluent.io.invalid> wrote:
> >>>>>>>
> >>>>>>>> HI Walker,
> >>>>>>>>
> >>>>>>>> thanks for the KIP! We definitely need this. I have two questions:
> >>>>>>>>
> >>>>>>>>     - Have you considered allowing the customization of the
> >> underlying
> >>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
> >> customize
> >>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it make
> >> sense
> >>>>>>>> for `Joined` to have this as well? I can imagine one may want to
> >> limit
> >>>>>>>> the number of records in the buffer, for example. If we hit the
> >>>>>>>> maximum, the only option would be to drop semantic guarantees, but
> >>>>>>>> users may still want to do this.
> >>>>>>>>     - With "second option on the table side" you are referring to
> >>>>>>>> versioned tables, right? Will the buffer on the stream side behave
> >> any
> >>>>>>>> different whether the table side is versioned or not?
> >>>>>>>>
> >>>>>>>> Finally, I think a simple example in the motivation section could
> >> help
> >>>>>>>> non-experts understand the KIP.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Lucas
> >>>>>>>>
> >>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> >>>>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>>>
> >>>>>>>>> Hello everybody,
> >>>>>>>>>
> >>>>>>>>> I have a stream proposal to improve the stream table join by
> >> adding a
> >>>>>>>> grace
> >>>>>>>>> period and buffer to the stream side of the join to allow
> >> processing
> >>>>> in
> >>>>>>>>> timestamp order matching the recent improvements of the versioned
> >>>>>> tables.
> >>>>>>>>>
> >>>>>>>>> Please take a look here <
> >>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
> >>>>>>>> and
> >>>>>>>>> share your thoughts.
> >>>>>>>>>
> >>>>>>>>> best,
> >>>>>>>>> Walker
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>
> >>
> >
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Bruno Cadonna <ca...@apache.org>.

Hi Walker,

thanks for the KIP!

Here my feedback:

1.
It is still not clear to me when stream time for the buffer advances. 
What is the event that let the stream time advance? In the discussion, I 
do not understand what you mean by "The segment store already has an 
observed stream time, we advance based on that. That should only advance 
based on records that enter the store." Where does this segment store 
come from? Anyways, I think it would be great to also state how stream 
time advances in the KIP.

2.
How does the buffer behave in case of a failure? I think I understand 
that the buffer will use an implementation of TimeOrderedKeyValueBuffer 
and therefore the records in the buffer will be replicated to a topic in 
Kafka, but I am not completely sure. Could you elaborate on this in the KIP?

3.
I agree with Matthias about dropping late records. We use grace periods 
in scenarios where we records are grouped like in windowed aggregations 
and windowed joins. The stream buffer you propose does not really group 
any records. It rather delays records and reorders them. I am not sure 
if grace period is the right naming/concept to apply here. Instead of 
dropping records that fall outside of the buffer's time interval the 
join should skip the buffer and try to join the record immediately. In 
the end, a stream-table join is a unwindowed join, i.e., no grouping is 
applied to the records.
What do you and other folks think about this proposal?

4.
How does the proposed buffer, affects processing latency? Could you 
please add some words about this to the KIP?


Best,
Bruno




On 31.05.23 01:49, Walker Carlson wrote:
> Thanks for all the additional comments. I will either address them here or
> update the kip accordingly.
> 
> 
> I mentioned a follow kip to add extra features before and in the responses.
> I will try to briefly summarize what options and optimizations I plan to
> include. If a concern is not covered in this list I for sure talk about it
> below.
> 
> * Allowing non versioned tables to still use the stream buffer
> * Automatically materializing tables instead of forcing the user to do it
> * Configurable for in memory buffer
> * Order the records in offset order or in time order
> * Non memory use buffer (offset order, delayed pull from stream.)
> * Time synced between stream and table side (maybe)
> * Do not drop late records and process them as they come in instead.
> 
> 
> First, Victoria.
> 
> 1) (One of your nits covers this, but you are correct it doesn't make
> sense. so I removed that part of the example.)
> For those examples with the "bad" join results I said without buffering the
> stream it would look like that, but that was incomplete. If the look up was
> simply looking at the latest version of the table when the stream records
> came in then the results were possible. If we are using the point in time
> lookup that versioned tables let us then you are correct the future results
> are not possible.
> 
> 2) I'll get to this later as Matthias brought up something related.
> 
> To your additional thoughts, I agree that we need to call those things out
> in the documentation. I'm writing up a follow up kip with a lot of the
> ideas we have discussed so that we can improve this feature beyond the base
> implementation if it's needed.
> 
> I addressed the nits in the kip. I somehow missed the table stream table
> join processor improvement, it makes your first question make a lot more
> sense.  Table history retention is a much cleaner way to describe it.
> 
> As to your mention of the syncing the time for the table and stream.
> Matthias mentioned that as well. I will address both here. I plan to bring
> that up in the future, but for now we will leave it out. I suppose it will
> be more useful after the table history retention is separable from the
> table grace period.
> 
> 
> To address Matthias comments.
> 
> You are correct by saying the in memory store shouldn't cause any semantic
> concerns. My concern would be more with if we limited the number of records
> on the buffer and what we would do if we hit said limits, (emitting those
> records might be an issue, throwing an error and halting would not). I
> think we can leave this discussion to the follow up kip along with a few
> other options.
> 
> I will go through your proposals now.
> 
>    - don't support non-versioned KTables
> 
> Sure, we can always expand this later on. Will include as part of the of
> the improvement kip
> 
>    - if grace period is added, users need to explicitly materialize the
> table as version (either directly, or upstream. Upstream only works if
> downstream tables "inherit" versioned semantics -- cf KIP-914)
> 
> again, that works for me for now, if we find a use we can always add later.
> 
>    - the table's history retention time must be larger than the grace
> period (should be easy to check at runtime, when we build the topology)
> 
> agreed
> 
>    - because switching from non-versioned to version stores is not
> backward compatibly (cf KIP-914), users need to take care of this
> themselves, and this also implies that adding grace period is not a
> backward compatible change (even only if via indirect means)
> 
> sure, this works
> 
> As to the dropping of late records, I'm not sure. One one hand I like not
> dropping things. But on the other I struggle to see how a user can filter
> out late records that might have incomplete join results. The point in time
> look up will aggressively expire old data and if new data has been replaced
> it will return null if outside of the retention. This seems like it could
> corrupt the integrity of the join output. Seeing that we drop late records
> on the table side as well I would think it makes sense to drop late records
> on the stream buffer. I could be convinced otherwise I suppose, I could see
> adding this as an option in a follow up kip. It would be very easy to
> implement either way. For now unless no one else objects I'm going to stick
> with dropping the records for the sake of getting this kip passed. It is
> functionally a small change to make and we can update later if you feel
> strongly about it.
> 
> For the ordering. I have to say that it would be more complicated to
> implement it to be in offset order, if the goal it to get as many of the
> records validly joined as possible. Because we would process as things left
> the buffer a sufficiency early enough record could hold up records that
> would otherwise be valid past the table history retention. To fix this we
> could process by timestamp then store in a second queue and emit by offset,
> but that would be a lot more complicated. If we didn't care about not
> missing some valid joins we could just have no store and pull from the
> topic at a delay only caring about the timestamp of the next offset. For
> now I want to stick with the timestamp ordering as it makes much more sense
> to me, but would propose we add both of the other options I have laid out
> here in the follow up kip.
> 
> Lastly, I think having an empty store with zero grace period would be super
> simple and not costly, so we might as well make it even if nothing gets
> entered.
> 
> I hope that address all your concerns,
> 
> Walker
> 
> On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org> wrote:
> 
>> Walker,
>>
>> thanks for the updates. The KIP itself reads fine (of course Victoria
>> made good comments about some phrases), but there is a couple of things
>> from your latest reply I don't understand, and that I still think need
>> some more discussions.
>>
>> Lukas, asked about in-memory option and `WindowStoreSupplier` and you
>> mention "semantic concerns". There should not be any semantic difference
>> from the underlying buffer implementation, so I am not sure what you
>> mean here (also the relationship to suppress() is unclear to me)? -- I
>> am ok to not make it configurable for now. We can always do it via a
>> follow up KIP, and keep interface changes limited for now.
>>
>> Does it really make sense to allow a grace period if the table is
>> non-versioned? You also say: "If table is not materialized it will
>> materialize it as versioned." -- What history retention time would we
>> pick for this case (also asked by Victoria)? Or should we rather not
>> support this and force the user to materialize the table explicitly, and
>> thus explicitly picking a history retention time? It's tradeoff between
>> usability and guiding uses that there will be a significant impact on
>> disk usage. There is also compatibility concerns: If the table is not
>> explicitly materialized in the old program, we would already need to
>> materialize it also in the old program (of course, we would use a
>> non-versioned store so far). Thus, if somebody adds a grace period, we
>> cannot just switch the store type, as it would be a breaking change,
>> potentially required an application re-set, or following the upgrade
>> path for versioned state stores, and also changing the program to
>> explicitly materialize using a versioned store. Also note, that we might
>> not materialize the actual join table, but only an upstream table, and
>> use `ValueGetter` to access the upstream data.
>>
>> To this end, as you already mentioned, history retention of the table
>> should be at least grace period. You proposed to include this in a
>> follow up KIP, but I am wondering if it's a fundamental requirement and
>> thus we should put a check in place right away and reject an invalid
>> configuration? (It always easier to lift restriction than to introduce
>> them later.) This would also imply that a non-versioned table cannot be
>> supported, because it does not have a history retention that is larger
>> than grace period, and maybe also answer the requirement about
>> materialization: as we already always materialize something on the
>> tablet side as non-versioned store right now, it seems difficult to
>> migrate the store to a versioned store. Ie, it might be ok to push the
>> burden onto the user and say: if you start using grace period, you also
>> need to manually switch from non-versioned to versioned KTables. Doing
>> stuff automatically under the hood if very complex for this case, we if
>> we push the burden onto the user, it might be ok to not complicate this
>> KIP significantly.
>>
>> To summarize the last two paragraphs, I would propose to:
>>    - don't support non-versioned KTables
>>    - if grace period is added, users need to explicitly materialize the
>> table as version (either directly, or upstream. Upstream only works if
>> downstream tables "inherit" versioned semantics -- cf KIP-914)
>>    - the table's history retention time must be larger than the grace
>> period (should be easy to check at runtime, when we build the topology)
>>    - because switching from non-versioned to version stores is not
>> backward compatibly (cf KIP-914), users need to take care of this
>> themselves, and this also implies that adding grace period is not a
>> backward compatible change (even only if via indirect means)
>>
>> About dropping late records: wondering if we should never drop a
>> stream-side record for a left-join, even if it's late? In general, one
>> thing I observed over the years is, that it's easier to keep stuff and
>> let users filter explicitly downstream (or make it configurable),
>> instead of dropping pro-actively, because users have no good way to
>> resurrect record that got already dropped.
>>
>> For ordering, sounds reasonable to me only start with one
>> implementation, and maybe make it configurable as a follow up. However,
>> I am wondering if starting with offset order might be the better option
>> as it seems to align more with what we do so far? So instead of storing
>> record ordered by timestamp, we can just store them ordered by offset,
>> and still "poll" from the buffer based on the head records timestamp. Or
>> would this complicate the implementation significantly?
>>
>> I also think it's ok to not "sync" stream-time between the table and the
>> stream in this KIP, but we should consider doing this as a follow up
>> change (not sure if we would need a KIP or not for a change this this).
>>
>> About increasing/decreasing grace period: what you describe make sense
>> to me. If decreased, the next record would just trigger emitting a lot
>> of records, and for increase, the buffer would just need to "fill up"
>> again. For reprocessing getting a different result with a different
>> grace period is expected, so that's ok IMHO. -- There seems to be one
>> special corner case: grace period zero. For this case, we actually don't
>> need any store, and the stream-side could be stateless. I think it can
>> have the same behavior, but if we want to "add / remove" the store
>> dynamically, we need to add specific code for it. For example, even if
>> we start up with a grace period of zero, we would need to check if there
>> is a local store, and still emit everything in it, before we can ditch
>> the store (not sure if that's even easily done at all). Or: we would
>> need to have a store for _all_ cases, even if grace period is zero (the
>> store would be empty all the time though), to avoid super complex code?
>>
>>
>> -Matthias
>>
>>
>> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
>>> Hi Walker,
>>>
>>> thanks for your responses. That makes sense. I guess there is always
>>> the option to make the implementation more configurable later on, if
>>> users request it. Also thanks for the clarifications. From my side,
>>> the KIP is good to go.
>>>
>>> Cheers,
>>> Lucas
>>>
>>> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
>>> <vi...@confluent.io.invalid> wrote:
>>>>
>>>> Thanks for the updates, Walker! Looks great, though I do have a couple
>>>> questions about the latest updates:
>>>>
>>>>      1. The new example says that without stream-side buffering, "ex" and
>>>>      "fy" are possible join results. How could those join results
>> happen? The
>>>>      example versioned table suggests that table record "x" has
>> timestamp 2, and
>>>>      table record "y" has timestamp 3. If stream record "e" has
>> timestamp 1,
>>>>      then it can never be joined against record "x", and similarly for
>> stream
>>>>      record "f" with timestamp 2 being joined against "y".
>>>>      2. I see in your replies above that "If table is not materialized it
>>>>      will materialize it as versioned" but I don't see this called out
>> in the
>>>>      KIP -- seems worth calling out. Also, what will the history
>> retention for
>>>>      the versioned table be? Will it be the same as the join grace
>> period, or
>>>>      will it be greater?
>>>>
>>>>
>>>> And some additional thoughts:
>>>>
>>>> Sounds like there are a few things users should watch out for when
>> enabling
>>>> the stream-side buffer:
>>>>
>>>>      - Records will get "stuck" if there are no newer records to advance
>>>>      stream time.
>>>>      - If there are large gaps between the timestamps of stream-side
>> records,
>>>>      then it's possible that versioned store history retention will have
>> expired
>>>>      by the time a record is evicted from the join buffer, leading to a
>> join
>>>>      "miss." For example, if the join grace period and table history
>> retention
>>>>      are both 10, and records come in the order:
>>>>
>>>>      table side t0 with ts=0
>>>>      stream side s1 with ts=1 <-- enters buffer
>>>>      table side t10 with ts=10
>>>>      table side t20 with ts=20
>>>>      stream side s21 with ts=21 <-- evicts record s1 from buffer, but
>>>>      versioned store no longer contains data for ts=1 due to history
>> retention
>>>>      having elapsed
>>>>
>>>>      This will result in the join result (s1, null) even though it
>> should've
>>>>      been (s1, t0), due to t0 having been expired from the versioned
>> store
>>>>      already.
>>>>      - Out-of-order records from the stream-side will be reordered, and
>> late
>>>>      records will be dropped.
>>>>
>>>> I don't think any of these are reasons to not go forward with this KIP,
>> but
>>>> it'd be good to call them out in the eventual documentation to decrease
>> the
>>>> chance users get tripped up.
>>>>
>>>>> We could maybe do an improvement later to advance stream time from
>> table
>>>> side as well, but that might be debatable as we might get more late
>> records.
>>>>
>>>> Yes, the likelihood of late records increases but also the likelihood of
>>>> "join misses" due to versioned store history retention having elapsed
>>>> decreases, which feels important for certain use cases. Either way,
>> agreed
>>>> that it can be a discussion for the future as incorporating this would
>>>> substantially complicate the implementation.
>>>>
>>>> Also a couple nits:
>>>>
>>>>      - The KIP currently says "We recently added versioned tables which
>> allow
>>>>      the table side of the a join [...] but it is not taken advantage of
>> in
>>>>      joins," but this doesn't seem true? If the table of a stream-table
>> join is
>>>>      versioned, then the DSL's stream-table join processor will
>> automatically
>>>>      perform timestamped lookups into the table, in order to take
>> advantage of
>>>>      the new timestamp-aware store to provide better join semantics.
>>>>      - The KIP mentions "grace period" for versioned stores in a number
>> of
>>>>      places but I think you actually mean "history retention"? The two
>> happen to
>>>>      be the same today (it is not an option for users to configure the
>> two
>>>>      separately) but this need not be true in the future. "History
>> retention"
>>>>      governs how far back in time reads may occur, which is the relevant
>>>>      parameter for performing lookups as part of the stream-table join.
>> "Grace
>>>>      period" in the context of versioned stores refers to how far back
>> in time
>>>>      out-of-order writes may occur, which probably isn't directly
>> relevant for
>>>>      introducing a stream-side buffer, though it's also possible I've
>> overlooked
>>>>      something. (As a bonus, switching from "table grace period" in the
>> KIP to
>>>>      "table history retention" also helps to clarify/distinguish that
>> it's a
>>>>      different parameter from the "join grace period," which I could see
>> being
>>>>      confusing to readers. :) )
>>>>
>>>>
>>>> Cheers,
>>>> Victoria
>>>>
>>>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
>>>> <wc...@confluent.io.invalid> wrote:
>>>>
>>>>> Hey all,
>>>>>
>>>>> Thanks for the comments, they gave me a lot to think about. I'll try to
>>>>> address them all inorder. I have made some updates to the kip related
>> to
>>>>> them, but I mention where below.
>>>>>
>>>>> Lucas
>>>>>
>>>>> Good idea about the example. I added a simple one.
>>>>>
>>>>> 1) I have thought about including options for the underlying buffer
>>>>> configuration. One of which might be adding an in memory option. My
>> biggest
>>>>> concern is about the semantic guarantees. This isn't like suppress or
>> with
>>>>> windows where producing incomplete results is repetitively harmless.
>> Here
>>>>> we would be possibly producing incorrect results. I also would like to
>> keep
>>>>> the interface changes as simple as I can. Making more than this change
>> to
>>>>> Joined I feel could make this more complicated than it needs to be. If
>> we
>>>>> really want to I could see adding a grace() option with a BufferConifg
>> in
>>>>> there or something, but I would rather not.
>>>>>
>>>>> 2) The buffer will be independent of if the table is versioned or not.
>> If
>>>>> table is not materialized it will materialize it as versioned. It might
>>>>> make sense to do a follow up kip where we force the retention period
>> of
>>>>> the versioned to be greater than whatever the max of the stream buffer
>> is.
>>>>>
>>>>> Victoria
>>>>>
>>>>> 1) Yes, records will exit in timestamp order not in offset order.
>>>>> 2) Late records will be dropped (Late as out of the grace period).
>>  From my
>>>>> understanding that is the point of a grace period, no? Doesn't the same
>>>>> thing happen with versioned stores?
>>>>> 3) The segment store already has an observed stream time, we advance
>> based
>>>>> on that. That should only advance based on records that enter the
>> store. So
>>>>> yes, only stream side records. We could maybe do an improvement later
>> to
>>>>> advance stream time from table side as well, but that might be
>> debatable as
>>>>> we might get more late records. Anyways I would rather have that as a
>>>>> separate discussion.
>>>>>
>>>>> in memory option? We can do that, for the buffer I plan to use the
>>>>> TimeOrderedKeyValueBuffer interface which already has an in memory
>>>>> implantation, so it would be simple.
>>>>>
>>>>> I said more in my answer to Lucas's question. The concern I have with
>>>>> buffer configs or in memory is complicating the interface. Also
>> semantic
>>>>> guarantees but in memory shouldn't effect that
>>>>>
>>>>> Matthias
>>>>>
>>>>> 1) fixed out of order vs late terminology in the kip.
>>>>>
>>>>> 2) I was referring to having a stream. So after this kip we can have a
>>>>> buffered stream or a normal one. For the table we can use a versioned
>> table
>>>>> or a normal table.
>>>>>
>>>>> 3 Good call out. I clarified this as "If the table side uses a
>> materialized
>>>>> version store, it can store multiple versions of each record within its
>>>>> defined grace period." and modified the rest of the paragraph a bit.
>>>>>
>>>>> 4) I get the preserving off offset ordering, but if the stream is
>> buffered
>>>>> to join on timestamp instead of offset doesn't it already seem like we
>> care
>>>>> more about time in this case?
>>>>>
>>>>> If we end up adding more options it might make sense to do this. Maybe
>>>>> offset order processing can be a follow up?
>>>>>
>>>>> I'll add a section for this in Rejected Alternatives. I think it makes
>>>>> sense to do something like this but maybe in a follow up.
>>>>>
>>>>> 5) I hadn't thought about this. I suppose if they changed this in an
>>>>> upgrade the next record would either evict a lot of records (if the
>> grace
>>>>> period decreased) or there would be a pause until the new grace period
>>>>> reached. Increasing is a bit more problematic, especially if the table
>>>>> grace period and retention time stays the same. If the data is
>> reprocessed
>>>>> after a change like that then there would be different results, but I
>> feel
>>>>> like that would be expected after such a change.
>>>>>
>>>>> What do you think should happen?
>>>>>
>>>>> Hopefully this answers your questions!
>>>>>
>>>>> Walker
>>>>>
>>>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org>
>> wrote:
>>>>>
>>>>>> Thanks for the KIP! Also some question/comments from my side:
>>>>>>
>>>>>> 10) Notation: you use the term "late data" but I think you mean
>>>>>> out-of-order. We reserve the term "late" to records that arrive after
>>>>>> grace period passed, and thus, "late == out-of-order data that is
>>>>> dropped".
>>>>>>
>>>>>>
>>>>>> 20) "There is only one option from the stream side and only recently
>> is
>>>>>> there a second option on the table side."
>>>>>>
>>>>>> What are those options? Victoria already asked about the table side,
>> but
>>>>>> I am also not sure what option you mean for the stream side?
>>>>>>
>>>>>>
>>>>>> 30) "If the table side uses a materialized version store the value is
>>>>>> the latest by stream time rather than by offset within its defined
>> grace
>>>>>> period."
>>>>>>
>>>>>> The phrase "the value is the latest by stream time" is confusing -- in
>>>>>> the end, a versioned stores multiple versions, not just one.
>>>>>>
>>>>>>
>>>>>> 40) I am also wondering about ordering. In general, KS tries to
>> preserve
>>>>>> offset-order during processing (with some exception, when offset order
>>>>>> preservation is not clearly defined). Given that the stream-side
>> buffer
>>>>>> is really just a "linear buffer", we could easily preserve
>> offset-order.
>>>>>> But I also see a benefit of re-ordering and emitting out-of-order data
>>>>>> right away when read (instead of blocking them behind in-order records
>>>>>> that are not ready yet). -- It might even be a possibility, to let
>> users
>>>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
>>>>>> placeholder).
>>>>>>
>>>>>> The KIP should explain this in more detail and also discuss different
>>>>>> options and mention them in "Rejected alternatives" in case we don't
>>>>>> want to include them.
>>>>>>
>>>>>>
>>>>>> 50) What happens when users change the grace period? Especially, when
>>>>>> they turn it on/off (but also increasing/decreasing is an interesting
>>>>>> point)? I think we should try to support this if possible; the
>>>>>> "Compatibility" section needs to cover switching on/off in more
>> detail.
>>>>>>
>>>>>>
>>>>>> -Matthias
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
>>>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
>>>>>>>
>>>>>>> A few clarifications:
>>>>>>>
>>>>>>> 1. Is the order that records exit the buffer in necessarily the same
>> as
>>>>>> the
>>>>>>> order that records enter the buffer in, or no? Based on the
>> description
>>>>>> in
>>>>>>> the KIP, it sounds like the answer is no, i.e., records will exit the
>>>>>>> buffer in increasing timestamp order, which means that they may be
>>>>>> ordered
>>>>>>> (even for the same key) compared to the input order.
>>>>>>>
>>>>>>> 2. What happens if the join grace period is nonzero, and a
>> stream-side
>>>>>>> record arrives with a timestamp that is older than the current stream
>>>>>> time
>>>>>>> minus the grace period? Will this record trigger a join result, or
>> will
>>>>>> it
>>>>>>> be dropped? Based on the description for what happens when the join
>>>>> grace
>>>>>>> period is set to zero, it sounds like the late record will be
>> dropped,
>>>>>> even
>>>>>>> if the join grace period is nonzero. Is that true?
>>>>>>>
>>>>>>> 3. What could cause stream time to advance, for purposes of removing
>>>>>>> records from the join buffer? For example, will new records arriving
>> on
>>>>>> the
>>>>>>> table side of the join cause stream time to advance? From the KIP it
>>>>>> sounds
>>>>>>> like only stream-side records will advance stream time -- does that
>>>>> mean
>>>>>>> that the join processor itself will have to track this stream time?
>>>>>>>
>>>>>>> Also +1 to Lucas's question about what options will be available for
>>>>>>> configuring the join buffer. Will users have the option to choose
>>>>> whether
>>>>>>> they want the buffer to be in-memory vs persistent?
>>>>>>>
>>>>>>> - Victoria
>>>>>>>
>>>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
>>>>>>> <lb...@confluent.io.invalid> wrote:
>>>>>>>
>>>>>>>> HI Walker,
>>>>>>>>
>>>>>>>> thanks for the KIP! We definitely need this. I have two questions:
>>>>>>>>
>>>>>>>>     - Have you considered allowing the customization of the
>> underlying
>>>>>>>> buffer implementation? As I can see, `StreamJoined` lets you
>> customize
>>>>>>>> the underlying store via a `WindowStoreSupplier`. Would it make
>> sense
>>>>>>>> for `Joined` to have this as well? I can imagine one may want to
>> limit
>>>>>>>> the number of records in the buffer, for example. If we hit the
>>>>>>>> maximum, the only option would be to drop semantic guarantees, but
>>>>>>>> users may still want to do this.
>>>>>>>>     - With "second option on the table side" you are referring to
>>>>>>>> versioned tables, right? Will the buffer on the stream side behave
>> any
>>>>>>>> different whether the table side is versioned or not?
>>>>>>>>
>>>>>>>> Finally, I think a simple example in the motivation section could
>> help
>>>>>>>> non-experts understand the KIP.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Lucas
>>>>>>>>
>>>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>>>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>>>
>>>>>>>>> Hello everybody,
>>>>>>>>>
>>>>>>>>> I have a stream proposal to improve the stream table join by
>> adding a
>>>>>>>> grace
>>>>>>>>> period and buffer to the stream side of the join to allow
>> processing
>>>>> in
>>>>>>>>> timestamp order matching the recent improvements of the versioned
>>>>>> tables.
>>>>>>>>>
>>>>>>>>> Please take a look here <
>>>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
>>>>>>>> and
>>>>>>>>> share your thoughts.
>>>>>>>>>
>>>>>>>>> best,
>>>>>>>>> Walker
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Thanks for all the additional comments. I will either address them here or
update the kip accordingly.


I mentioned a follow kip to add extra features before and in the responses.
I will try to briefly summarize what options and optimizations I plan to
include. If a concern is not covered in this list I for sure talk about it
below.

* Allowing non versioned tables to still use the stream buffer
* Automatically materializing tables instead of forcing the user to do it
* Configurable for in memory buffer
* Order the records in offset order or in time order
* Non memory use buffer (offset order, delayed pull from stream.)
* Time synced between stream and table side (maybe)
* Do not drop late records and process them as they come in instead.


First, Victoria.

1) (One of your nits covers this, but you are correct it doesn't make
sense. so I removed that part of the example.)
For those examples with the "bad" join results I said without buffering the
stream it would look like that, but that was incomplete. If the look up was
simply looking at the latest version of the table when the stream records
came in then the results were possible. If we are using the point in time
lookup that versioned tables let us then you are correct the future results
are not possible.

2) I'll get to this later as Matthias brought up something related.

To your additional thoughts, I agree that we need to call those things out
in the documentation. I'm writing up a follow up kip with a lot of the
ideas we have discussed so that we can improve this feature beyond the base
implementation if it's needed.

I addressed the nits in the kip. I somehow missed the table stream table
join processor improvement, it makes your first question make a lot more
sense.  Table history retention is a much cleaner way to describe it.

As to your mention of the syncing the time for the table and stream.
Matthias mentioned that as well. I will address both here. I plan to bring
that up in the future, but for now we will leave it out. I suppose it will
be more useful after the table history retention is separable from the
table grace period.


To address Matthias comments.

You are correct by saying the in memory store shouldn't cause any semantic
concerns. My concern would be more with if we limited the number of records
on the buffer and what we would do if we hit said limits, (emitting those
records might be an issue, throwing an error and halting would not). I
think we can leave this discussion to the follow up kip along with a few
other options.

I will go through your proposals now.

  - don't support non-versioned KTables

Sure, we can always expand this later on. Will include as part of the of
the improvement kip

  - if grace period is added, users need to explicitly materialize the
table as version (either directly, or upstream. Upstream only works if
downstream tables "inherit" versioned semantics -- cf KIP-914)

again, that works for me for now, if we find a use we can always add later.

  - the table's history retention time must be larger than the grace
period (should be easy to check at runtime, when we build the topology)

agreed

  - because switching from non-versioned to version stores is not
backward compatibly (cf KIP-914), users need to take care of this
themselves, and this also implies that adding grace period is not a
backward compatible change (even only if via indirect means)

sure, this works

As to the dropping of late records, I'm not sure. One one hand I like not
dropping things. But on the other I struggle to see how a user can filter
out late records that might have incomplete join results. The point in time
look up will aggressively expire old data and if new data has been replaced
it will return null if outside of the retention. This seems like it could
corrupt the integrity of the join output. Seeing that we drop late records
on the table side as well I would think it makes sense to drop late records
on the stream buffer. I could be convinced otherwise I suppose, I could see
adding this as an option in a follow up kip. It would be very easy to
implement either way. For now unless no one else objects I'm going to stick
with dropping the records for the sake of getting this kip passed. It is
functionally a small change to make and we can update later if you feel
strongly about it.

For the ordering. I have to say that it would be more complicated to
implement it to be in offset order, if the goal it to get as many of the
records validly joined as possible. Because we would process as things left
the buffer a sufficiency early enough record could hold up records that
would otherwise be valid past the table history retention. To fix this we
could process by timestamp then store in a second queue and emit by offset,
but that would be a lot more complicated. If we didn't care about not
missing some valid joins we could just have no store and pull from the
topic at a delay only caring about the timestamp of the next offset. For
now I want to stick with the timestamp ordering as it makes much more sense
to me, but would propose we add both of the other options I have laid out
here in the follow up kip.

Lastly, I think having an empty store with zero grace period would be super
simple and not costly, so we might as well make it even if nothing gets
entered.

I hope that address all your concerns,

Walker

On Thu, May 25, 2023 at 9:50 AM Matthias J. Sax <mj...@apache.org> wrote:

> Walker,
>
> thanks for the updates. The KIP itself reads fine (of course Victoria
> made good comments about some phrases), but there is a couple of things
> from your latest reply I don't understand, and that I still think need
> some more discussions.
>
> Lukas, asked about in-memory option and `WindowStoreSupplier` and you
> mention "semantic concerns". There should not be any semantic difference
> from the underlying buffer implementation, so I am not sure what you
> mean here (also the relationship to suppress() is unclear to me)? -- I
> am ok to not make it configurable for now. We can always do it via a
> follow up KIP, and keep interface changes limited for now.
>
> Does it really make sense to allow a grace period if the table is
> non-versioned? You also say: "If table is not materialized it will
> materialize it as versioned." -- What history retention time would we
> pick for this case (also asked by Victoria)? Or should we rather not
> support this and force the user to materialize the table explicitly, and
> thus explicitly picking a history retention time? It's tradeoff between
> usability and guiding uses that there will be a significant impact on
> disk usage. There is also compatibility concerns: If the table is not
> explicitly materialized in the old program, we would already need to
> materialize it also in the old program (of course, we would use a
> non-versioned store so far). Thus, if somebody adds a grace period, we
> cannot just switch the store type, as it would be a breaking change,
> potentially required an application re-set, or following the upgrade
> path for versioned state stores, and also changing the program to
> explicitly materialize using a versioned store. Also note, that we might
> not materialize the actual join table, but only an upstream table, and
> use `ValueGetter` to access the upstream data.
>
> To this end, as you already mentioned, history retention of the table
> should be at least grace period. You proposed to include this in a
> follow up KIP, but I am wondering if it's a fundamental requirement and
> thus we should put a check in place right away and reject an invalid
> configuration? (It always easier to lift restriction than to introduce
> them later.) This would also imply that a non-versioned table cannot be
> supported, because it does not have a history retention that is larger
> than grace period, and maybe also answer the requirement about
> materialization: as we already always materialize something on the
> tablet side as non-versioned store right now, it seems difficult to
> migrate the store to a versioned store. Ie, it might be ok to push the
> burden onto the user and say: if you start using grace period, you also
> need to manually switch from non-versioned to versioned KTables. Doing
> stuff automatically under the hood if very complex for this case, we if
> we push the burden onto the user, it might be ok to not complicate this
> KIP significantly.
>
> To summarize the last two paragraphs, I would propose to:
>   - don't support non-versioned KTables
>   - if grace period is added, users need to explicitly materialize the
> table as version (either directly, or upstream. Upstream only works if
> downstream tables "inherit" versioned semantics -- cf KIP-914)
>   - the table's history retention time must be larger than the grace
> period (should be easy to check at runtime, when we build the topology)
>   - because switching from non-versioned to version stores is not
> backward compatibly (cf KIP-914), users need to take care of this
> themselves, and this also implies that adding grace period is not a
> backward compatible change (even only if via indirect means)
>
> About dropping late records: wondering if we should never drop a
> stream-side record for a left-join, even if it's late? In general, one
> thing I observed over the years is, that it's easier to keep stuff and
> let users filter explicitly downstream (or make it configurable),
> instead of dropping pro-actively, because users have no good way to
> resurrect record that got already dropped.
>
> For ordering, sounds reasonable to me only start with one
> implementation, and maybe make it configurable as a follow up. However,
> I am wondering if starting with offset order might be the better option
> as it seems to align more with what we do so far? So instead of storing
> record ordered by timestamp, we can just store them ordered by offset,
> and still "poll" from the buffer based on the head records timestamp. Or
> would this complicate the implementation significantly?
>
> I also think it's ok to not "sync" stream-time between the table and the
> stream in this KIP, but we should consider doing this as a follow up
> change (not sure if we would need a KIP or not for a change this this).
>
> About increasing/decreasing grace period: what you describe make sense
> to me. If decreased, the next record would just trigger emitting a lot
> of records, and for increase, the buffer would just need to "fill up"
> again. For reprocessing getting a different result with a different
> grace period is expected, so that's ok IMHO. -- There seems to be one
> special corner case: grace period zero. For this case, we actually don't
> need any store, and the stream-side could be stateless. I think it can
> have the same behavior, but if we want to "add / remove" the store
> dynamically, we need to add specific code for it. For example, even if
> we start up with a grace period of zero, we would need to check if there
> is a local store, and still emit everything in it, before we can ditch
> the store (not sure if that's even easily done at all). Or: we would
> need to have a store for _all_ cases, even if grace period is zero (the
> store would be empty all the time though), to avoid super complex code?
>
>
> -Matthias
>
>
> On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> > Hi Walker,
> >
> > thanks for your responses. That makes sense. I guess there is always
> > the option to make the implementation more configurable later on, if
> > users request it. Also thanks for the clarifications. From my side,
> > the KIP is good to go.
> >
> > Cheers,
> > Lucas
> >
> > On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> > <vi...@confluent.io.invalid> wrote:
> >>
> >> Thanks for the updates, Walker! Looks great, though I do have a couple
> >> questions about the latest updates:
> >>
> >>     1. The new example says that without stream-side buffering, "ex" and
> >>     "fy" are possible join results. How could those join results
> happen? The
> >>     example versioned table suggests that table record "x" has
> timestamp 2, and
> >>     table record "y" has timestamp 3. If stream record "e" has
> timestamp 1,
> >>     then it can never be joined against record "x", and similarly for
> stream
> >>     record "f" with timestamp 2 being joined against "y".
> >>     2. I see in your replies above that "If table is not materialized it
> >>     will materialize it as versioned" but I don't see this called out
> in the
> >>     KIP -- seems worth calling out. Also, what will the history
> retention for
> >>     the versioned table be? Will it be the same as the join grace
> period, or
> >>     will it be greater?
> >>
> >>
> >> And some additional thoughts:
> >>
> >> Sounds like there are a few things users should watch out for when
> enabling
> >> the stream-side buffer:
> >>
> >>     - Records will get "stuck" if there are no newer records to advance
> >>     stream time.
> >>     - If there are large gaps between the timestamps of stream-side
> records,
> >>     then it's possible that versioned store history retention will have
> expired
> >>     by the time a record is evicted from the join buffer, leading to a
> join
> >>     "miss." For example, if the join grace period and table history
> retention
> >>     are both 10, and records come in the order:
> >>
> >>     table side t0 with ts=0
> >>     stream side s1 with ts=1 <-- enters buffer
> >>     table side t10 with ts=10
> >>     table side t20 with ts=20
> >>     stream side s21 with ts=21 <-- evicts record s1 from buffer, but
> >>     versioned store no longer contains data for ts=1 due to history
> retention
> >>     having elapsed
> >>
> >>     This will result in the join result (s1, null) even though it
> should've
> >>     been (s1, t0), due to t0 having been expired from the versioned
> store
> >>     already.
> >>     - Out-of-order records from the stream-side will be reordered, and
> late
> >>     records will be dropped.
> >>
> >> I don't think any of these are reasons to not go forward with this KIP,
> but
> >> it'd be good to call them out in the eventual documentation to decrease
> the
> >> chance users get tripped up.
> >>
> >>> We could maybe do an improvement later to advance stream time from
> table
> >> side as well, but that might be debatable as we might get more late
> records.
> >>
> >> Yes, the likelihood of late records increases but also the likelihood of
> >> "join misses" due to versioned store history retention having elapsed
> >> decreases, which feels important for certain use cases. Either way,
> agreed
> >> that it can be a discussion for the future as incorporating this would
> >> substantially complicate the implementation.
> >>
> >> Also a couple nits:
> >>
> >>     - The KIP currently says "We recently added versioned tables which
> allow
> >>     the table side of the a join [...] but it is not taken advantage of
> in
> >>     joins," but this doesn't seem true? If the table of a stream-table
> join is
> >>     versioned, then the DSL's stream-table join processor will
> automatically
> >>     perform timestamped lookups into the table, in order to take
> advantage of
> >>     the new timestamp-aware store to provide better join semantics.
> >>     - The KIP mentions "grace period" for versioned stores in a number
> of
> >>     places but I think you actually mean "history retention"? The two
> happen to
> >>     be the same today (it is not an option for users to configure the
> two
> >>     separately) but this need not be true in the future. "History
> retention"
> >>     governs how far back in time reads may occur, which is the relevant
> >>     parameter for performing lookups as part of the stream-table join.
> "Grace
> >>     period" in the context of versioned stores refers to how far back
> in time
> >>     out-of-order writes may occur, which probably isn't directly
> relevant for
> >>     introducing a stream-side buffer, though it's also possible I've
> overlooked
> >>     something. (As a bonus, switching from "table grace period" in the
> KIP to
> >>     "table history retention" also helps to clarify/distinguish that
> it's a
> >>     different parameter from the "join grace period," which I could see
> being
> >>     confusing to readers. :) )
> >>
> >>
> >> Cheers,
> >> Victoria
> >>
> >> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> >> <wc...@confluent.io.invalid> wrote:
> >>
> >>> Hey all,
> >>>
> >>> Thanks for the comments, they gave me a lot to think about. I'll try to
> >>> address them all inorder. I have made some updates to the kip related
> to
> >>> them, but I mention where below.
> >>>
> >>> Lucas
> >>>
> >>> Good idea about the example. I added a simple one.
> >>>
> >>> 1) I have thought about including options for the underlying buffer
> >>> configuration. One of which might be adding an in memory option. My
> biggest
> >>> concern is about the semantic guarantees. This isn't like suppress or
> with
> >>> windows where producing incomplete results is repetitively harmless.
> Here
> >>> we would be possibly producing incorrect results. I also would like to
> keep
> >>> the interface changes as simple as I can. Making more than this change
> to
> >>> Joined I feel could make this more complicated than it needs to be. If
> we
> >>> really want to I could see adding a grace() option with a BufferConifg
> in
> >>> there or something, but I would rather not.
> >>>
> >>> 2) The buffer will be independent of if the table is versioned or not.
> If
> >>> table is not materialized it will materialize it as versioned. It might
> >>> make sense to do a follow up kip where we force the retention period
> of
> >>> the versioned to be greater than whatever the max of the stream buffer
> is.
> >>>
> >>> Victoria
> >>>
> >>> 1) Yes, records will exit in timestamp order not in offset order.
> >>> 2) Late records will be dropped (Late as out of the grace period).
> From my
> >>> understanding that is the point of a grace period, no? Doesn't the same
> >>> thing happen with versioned stores?
> >>> 3) The segment store already has an observed stream time, we advance
> based
> >>> on that. That should only advance based on records that enter the
> store. So
> >>> yes, only stream side records. We could maybe do an improvement later
> to
> >>> advance stream time from table side as well, but that might be
> debatable as
> >>> we might get more late records. Anyways I would rather have that as a
> >>> separate discussion.
> >>>
> >>> in memory option? We can do that, for the buffer I plan to use the
> >>> TimeOrderedKeyValueBuffer interface which already has an in memory
> >>> implantation, so it would be simple.
> >>>
> >>> I said more in my answer to Lucas's question. The concern I have with
> >>> buffer configs or in memory is complicating the interface. Also
> semantic
> >>> guarantees but in memory shouldn't effect that
> >>>
> >>> Matthias
> >>>
> >>> 1) fixed out of order vs late terminology in the kip.
> >>>
> >>> 2) I was referring to having a stream. So after this kip we can have a
> >>> buffered stream or a normal one. For the table we can use a versioned
> table
> >>> or a normal table.
> >>>
> >>> 3 Good call out. I clarified this as "If the table side uses a
> materialized
> >>> version store, it can store multiple versions of each record within its
> >>> defined grace period." and modified the rest of the paragraph a bit.
> >>>
> >>> 4) I get the preserving off offset ordering, but if the stream is
> buffered
> >>> to join on timestamp instead of offset doesn't it already seem like we
> care
> >>> more about time in this case?
> >>>
> >>> If we end up adding more options it might make sense to do this. Maybe
> >>> offset order processing can be a follow up?
> >>>
> >>> I'll add a section for this in Rejected Alternatives. I think it makes
> >>> sense to do something like this but maybe in a follow up.
> >>>
> >>> 5) I hadn't thought about this. I suppose if they changed this in an
> >>> upgrade the next record would either evict a lot of records (if the
> grace
> >>> period decreased) or there would be a pause until the new grace period
> >>> reached. Increasing is a bit more problematic, especially if the table
> >>> grace period and retention time stays the same. If the data is
> reprocessed
> >>> after a change like that then there would be different results, but I
> feel
> >>> like that would be expected after such a change.
> >>>
> >>> What do you think should happen?
> >>>
> >>> Hopefully this answers your questions!
> >>>
> >>> Walker
> >>>
> >>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org>
> wrote:
> >>>
> >>>> Thanks for the KIP! Also some question/comments from my side:
> >>>>
> >>>> 10) Notation: you use the term "late data" but I think you mean
> >>>> out-of-order. We reserve the term "late" to records that arrive after
> >>>> grace period passed, and thus, "late == out-of-order data that is
> >>> dropped".
> >>>>
> >>>>
> >>>> 20) "There is only one option from the stream side and only recently
> is
> >>>> there a second option on the table side."
> >>>>
> >>>> What are those options? Victoria already asked about the table side,
> but
> >>>> I am also not sure what option you mean for the stream side?
> >>>>
> >>>>
> >>>> 30) "If the table side uses a materialized version store the value is
> >>>> the latest by stream time rather than by offset within its defined
> grace
> >>>> period."
> >>>>
> >>>> The phrase "the value is the latest by stream time" is confusing -- in
> >>>> the end, a versioned stores multiple versions, not just one.
> >>>>
> >>>>
> >>>> 40) I am also wondering about ordering. In general, KS tries to
> preserve
> >>>> offset-order during processing (with some exception, when offset order
> >>>> preservation is not clearly defined). Given that the stream-side
> buffer
> >>>> is really just a "linear buffer", we could easily preserve
> offset-order.
> >>>> But I also see a benefit of re-ordering and emitting out-of-order data
> >>>> right away when read (instead of blocking them behind in-order records
> >>>> that are not ready yet). -- It might even be a possibility, to let
> users
> >>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
> >>>> placeholder).
> >>>>
> >>>> The KIP should explain this in more detail and also discuss different
> >>>> options and mention them in "Rejected alternatives" in case we don't
> >>>> want to include them.
> >>>>
> >>>>
> >>>> 50) What happens when users change the grace period? Especially, when
> >>>> they turn it on/off (but also increasing/decreasing is an interesting
> >>>> point)? I think we should try to support this if possible; the
> >>>> "Compatibility" section needs to cover switching on/off in more
> detail.
> >>>>
> >>>>
> >>>> -Matthias
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
> >>>>> Cool KIP, Walker! Thanks for sharing this proposal.
> >>>>>
> >>>>> A few clarifications:
> >>>>>
> >>>>> 1. Is the order that records exit the buffer in necessarily the same
> as
> >>>> the
> >>>>> order that records enter the buffer in, or no? Based on the
> description
> >>>> in
> >>>>> the KIP, it sounds like the answer is no, i.e., records will exit the
> >>>>> buffer in increasing timestamp order, which means that they may be
> >>>> ordered
> >>>>> (even for the same key) compared to the input order.
> >>>>>
> >>>>> 2. What happens if the join grace period is nonzero, and a
> stream-side
> >>>>> record arrives with a timestamp that is older than the current stream
> >>>> time
> >>>>> minus the grace period? Will this record trigger a join result, or
> will
> >>>> it
> >>>>> be dropped? Based on the description for what happens when the join
> >>> grace
> >>>>> period is set to zero, it sounds like the late record will be
> dropped,
> >>>> even
> >>>>> if the join grace period is nonzero. Is that true?
> >>>>>
> >>>>> 3. What could cause stream time to advance, for purposes of removing
> >>>>> records from the join buffer? For example, will new records arriving
> on
> >>>> the
> >>>>> table side of the join cause stream time to advance? From the KIP it
> >>>> sounds
> >>>>> like only stream-side records will advance stream time -- does that
> >>> mean
> >>>>> that the join processor itself will have to track this stream time?
> >>>>>
> >>>>> Also +1 to Lucas's question about what options will be available for
> >>>>> configuring the join buffer. Will users have the option to choose
> >>> whether
> >>>>> they want the buffer to be in-memory vs persistent?
> >>>>>
> >>>>> - Victoria
> >>>>>
> >>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> >>>>> <lb...@confluent.io.invalid> wrote:
> >>>>>
> >>>>>> HI Walker,
> >>>>>>
> >>>>>> thanks for the KIP! We definitely need this. I have two questions:
> >>>>>>
> >>>>>>    - Have you considered allowing the customization of the
> underlying
> >>>>>> buffer implementation? As I can see, `StreamJoined` lets you
> customize
> >>>>>> the underlying store via a `WindowStoreSupplier`. Would it make
> sense
> >>>>>> for `Joined` to have this as well? I can imagine one may want to
> limit
> >>>>>> the number of records in the buffer, for example. If we hit the
> >>>>>> maximum, the only option would be to drop semantic guarantees, but
> >>>>>> users may still want to do this.
> >>>>>>    - With "second option on the table side" you are referring to
> >>>>>> versioned tables, right? Will the buffer on the stream side behave
> any
> >>>>>> different whether the table side is versioned or not?
> >>>>>>
> >>>>>> Finally, I think a simple example in the motivation section could
> help
> >>>>>> non-experts understand the KIP.
> >>>>>>
> >>>>>> Best,
> >>>>>> Lucas
> >>>>>>
> >>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> >>>>>> <wc...@confluent.io.invalid> wrote:
> >>>>>>>
> >>>>>>> Hello everybody,
> >>>>>>>
> >>>>>>> I have a stream proposal to improve the stream table join by
> adding a
> >>>>>> grace
> >>>>>>> period and buffer to the stream side of the join to allow
> processing
> >>> in
> >>>>>>> timestamp order matching the recent improvements of the versioned
> >>>> tables.
> >>>>>>>
> >>>>>>> Please take a look here <
> >>> https://cwiki.apache.org/confluence/x/lAs0Dw>
> >>>>>> and
> >>>>>>> share your thoughts.
> >>>>>>>
> >>>>>>> best,
> >>>>>>> Walker
> >>>>>>
> >>>>>
> >>>>
> >>>
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by "Matthias J. Sax" <mj...@apache.org>.

Walker,

thanks for the updates. The KIP itself reads fine (of course Victoria 
made good comments about some phrases), but there is a couple of things 
from your latest reply I don't understand, and that I still think need 
some more discussions.

Lukas, asked about in-memory option and `WindowStoreSupplier` and you 
mention "semantic concerns". There should not be any semantic difference 
from the underlying buffer implementation, so I am not sure what you 
mean here (also the relationship to suppress() is unclear to me)? -- I 
am ok to not make it configurable for now. We can always do it via a 
follow up KIP, and keep interface changes limited for now.

Does it really make sense to allow a grace period if the table is 
non-versioned? You also say: "If table is not materialized it will 
materialize it as versioned." -- What history retention time would we 
pick for this case (also asked by Victoria)? Or should we rather not 
support this and force the user to materialize the table explicitly, and 
thus explicitly picking a history retention time? It's tradeoff between 
usability and guiding uses that there will be a significant impact on 
disk usage. There is also compatibility concerns: If the table is not 
explicitly materialized in the old program, we would already need to 
materialize it also in the old program (of course, we would use a 
non-versioned store so far). Thus, if somebody adds a grace period, we 
cannot just switch the store type, as it would be a breaking change, 
potentially required an application re-set, or following the upgrade 
path for versioned state stores, and also changing the program to 
explicitly materialize using a versioned store. Also note, that we might 
not materialize the actual join table, but only an upstream table, and 
use `ValueGetter` to access the upstream data.

To this end, as you already mentioned, history retention of the table 
should be at least grace period. You proposed to include this in a 
follow up KIP, but I am wondering if it's a fundamental requirement and 
thus we should put a check in place right away and reject an invalid 
configuration? (It always easier to lift restriction than to introduce 
them later.) This would also imply that a non-versioned table cannot be 
supported, because it does not have a history retention that is larger 
than grace period, and maybe also answer the requirement about 
materialization: as we already always materialize something on the 
tablet side as non-versioned store right now, it seems difficult to 
migrate the store to a versioned store. Ie, it might be ok to push the 
burden onto the user and say: if you start using grace period, you also 
need to manually switch from non-versioned to versioned KTables. Doing 
stuff automatically under the hood if very complex for this case, we if 
we push the burden onto the user, it might be ok to not complicate this 
KIP significantly.

To summarize the last two paragraphs, I would propose to:
  - don't support non-versioned KTables
  - if grace period is added, users need to explicitly materialize the 
table as version (either directly, or upstream. Upstream only works if 
downstream tables "inherit" versioned semantics -- cf KIP-914)
  - the table's history retention time must be larger than the grace 
period (should be easy to check at runtime, when we build the topology)
  - because switching from non-versioned to version stores is not 
backward compatibly (cf KIP-914), users need to take care of this 
themselves, and this also implies that adding grace period is not a 
backward compatible change (even only if via indirect means)

About dropping late records: wondering if we should never drop a 
stream-side record for a left-join, even if it's late? In general, one 
thing I observed over the years is, that it's easier to keep stuff and 
let users filter explicitly downstream (or make it configurable), 
instead of dropping pro-actively, because users have no good way to 
resurrect record that got already dropped.

For ordering, sounds reasonable to me only start with one 
implementation, and maybe make it configurable as a follow up. However, 
I am wondering if starting with offset order might be the better option 
as it seems to align more with what we do so far? So instead of storing 
record ordered by timestamp, we can just store them ordered by offset, 
and still "poll" from the buffer based on the head records timestamp. Or 
would this complicate the implementation significantly?

I also think it's ok to not "sync" stream-time between the table and the 
stream in this KIP, but we should consider doing this as a follow up 
change (not sure if we would need a KIP or not for a change this this).

About increasing/decreasing grace period: what you describe make sense 
to me. If decreased, the next record would just trigger emitting a lot 
of records, and for increase, the buffer would just need to "fill up" 
again. For reprocessing getting a different result with a different 
grace period is expected, so that's ok IMHO. -- There seems to be one 
special corner case: grace period zero. For this case, we actually don't 
need any store, and the stream-side could be stateless. I think it can 
have the same behavior, but if we want to "add / remove" the store 
dynamically, we need to add specific code for it. For example, even if 
we start up with a grace period of zero, we would need to check if there 
is a local store, and still emit everything in it, before we can ditch 
the store (not sure if that's even easily done at all). Or: we would 
need to have a store for _all_ cases, even if grace period is zero (the 
store would be empty all the time though), to avoid super complex code?


-Matthias


On 5/25/23 10:53 AM, Lucas Brutschy wrote:
> Hi Walker,
> 
> thanks for your responses. That makes sense. I guess there is always
> the option to make the implementation more configurable later on, if
> users request it. Also thanks for the clarifications. From my side,
> the KIP is good to go.
> 
> Cheers,
> Lucas
> 
> On Wed, May 24, 2023 at 11:54 PM Victoria Xia
> <vi...@confluent.io.invalid> wrote:
>>
>> Thanks for the updates, Walker! Looks great, though I do have a couple
>> questions about the latest updates:
>>
>>     1. The new example says that without stream-side buffering, "ex" and
>>     "fy" are possible join results. How could those join results happen? The
>>     example versioned table suggests that table record "x" has timestamp 2, and
>>     table record "y" has timestamp 3. If stream record "e" has timestamp 1,
>>     then it can never be joined against record "x", and similarly for stream
>>     record "f" with timestamp 2 being joined against "y".
>>     2. I see in your replies above that "If table is not materialized it
>>     will materialize it as versioned" but I don't see this called out in the
>>     KIP -- seems worth calling out. Also, what will the history retention for
>>     the versioned table be? Will it be the same as the join grace period, or
>>     will it be greater?
>>
>>
>> And some additional thoughts:
>>
>> Sounds like there are a few things users should watch out for when enabling
>> the stream-side buffer:
>>
>>     - Records will get "stuck" if there are no newer records to advance
>>     stream time.
>>     - If there are large gaps between the timestamps of stream-side records,
>>     then it's possible that versioned store history retention will have expired
>>     by the time a record is evicted from the join buffer, leading to a join
>>     "miss." For example, if the join grace period and table history retention
>>     are both 10, and records come in the order:
>>
>>     table side t0 with ts=0
>>     stream side s1 with ts=1 <-- enters buffer
>>     table side t10 with ts=10
>>     table side t20 with ts=20
>>     stream side s21 with ts=21 <-- evicts record s1 from buffer, but
>>     versioned store no longer contains data for ts=1 due to history retention
>>     having elapsed
>>
>>     This will result in the join result (s1, null) even though it should've
>>     been (s1, t0), due to t0 having been expired from the versioned store
>>     already.
>>     - Out-of-order records from the stream-side will be reordered, and late
>>     records will be dropped.
>>
>> I don't think any of these are reasons to not go forward with this KIP, but
>> it'd be good to call them out in the eventual documentation to decrease the
>> chance users get tripped up.
>>
>>> We could maybe do an improvement later to advance stream time from table
>> side as well, but that might be debatable as we might get more late records.
>>
>> Yes, the likelihood of late records increases but also the likelihood of
>> "join misses" due to versioned store history retention having elapsed
>> decreases, which feels important for certain use cases. Either way, agreed
>> that it can be a discussion for the future as incorporating this would
>> substantially complicate the implementation.
>>
>> Also a couple nits:
>>
>>     - The KIP currently says "We recently added versioned tables which allow
>>     the table side of the a join [...] but it is not taken advantage of in
>>     joins," but this doesn't seem true? If the table of a stream-table join is
>>     versioned, then the DSL's stream-table join processor will automatically
>>     perform timestamped lookups into the table, in order to take advantage of
>>     the new timestamp-aware store to provide better join semantics.
>>     - The KIP mentions "grace period" for versioned stores in a number of
>>     places but I think you actually mean "history retention"? The two happen to
>>     be the same today (it is not an option for users to configure the two
>>     separately) but this need not be true in the future. "History retention"
>>     governs how far back in time reads may occur, which is the relevant
>>     parameter for performing lookups as part of the stream-table join. "Grace
>>     period" in the context of versioned stores refers to how far back in time
>>     out-of-order writes may occur, which probably isn't directly relevant for
>>     introducing a stream-side buffer, though it's also possible I've overlooked
>>     something. (As a bonus, switching from "table grace period" in the KIP to
>>     "table history retention" also helps to clarify/distinguish that it's a
>>     different parameter from the "join grace period," which I could see being
>>     confusing to readers. :) )
>>
>>
>> Cheers,
>> Victoria
>>
>> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
>> <wc...@confluent.io.invalid> wrote:
>>
>>> Hey all,
>>>
>>> Thanks for the comments, they gave me a lot to think about. I'll try to
>>> address them all inorder. I have made some updates to the kip related to
>>> them, but I mention where below.
>>>
>>> Lucas
>>>
>>> Good idea about the example. I added a simple one.
>>>
>>> 1) I have thought about including options for the underlying buffer
>>> configuration. One of which might be adding an in memory option. My biggest
>>> concern is about the semantic guarantees. This isn't like suppress or with
>>> windows where producing incomplete results is repetitively harmless. Here
>>> we would be possibly producing incorrect results. I also would like to keep
>>> the interface changes as simple as I can. Making more than this change to
>>> Joined I feel could make this more complicated than it needs to be. If we
>>> really want to I could see adding a grace() option with a BufferConifg in
>>> there or something, but I would rather not.
>>>
>>> 2) The buffer will be independent of if the table is versioned or not. If
>>> table is not materialized it will materialize it as versioned. It might
>>> make sense to do a follow up kip where we force the retention period  of
>>> the versioned to be greater than whatever the max of the stream buffer is.
>>>
>>> Victoria
>>>
>>> 1) Yes, records will exit in timestamp order not in offset order.
>>> 2) Late records will be dropped (Late as out of the grace period). From my
>>> understanding that is the point of a grace period, no? Doesn't the same
>>> thing happen with versioned stores?
>>> 3) The segment store already has an observed stream time, we advance based
>>> on that. That should only advance based on records that enter the store. So
>>> yes, only stream side records. We could maybe do an improvement later to
>>> advance stream time from table side as well, but that might be debatable as
>>> we might get more late records. Anyways I would rather have that as a
>>> separate discussion.
>>>
>>> in memory option? We can do that, for the buffer I plan to use the
>>> TimeOrderedKeyValueBuffer interface which already has an in memory
>>> implantation, so it would be simple.
>>>
>>> I said more in my answer to Lucas's question. The concern I have with
>>> buffer configs or in memory is complicating the interface. Also semantic
>>> guarantees but in memory shouldn't effect that
>>>
>>> Matthias
>>>
>>> 1) fixed out of order vs late terminology in the kip.
>>>
>>> 2) I was referring to having a stream. So after this kip we can have a
>>> buffered stream or a normal one. For the table we can use a versioned table
>>> or a normal table.
>>>
>>> 3 Good call out. I clarified this as "If the table side uses a materialized
>>> version store, it can store multiple versions of each record within its
>>> defined grace period." and modified the rest of the paragraph a bit.
>>>
>>> 4) I get the preserving off offset ordering, but if the stream is buffered
>>> to join on timestamp instead of offset doesn't it already seem like we care
>>> more about time in this case?
>>>
>>> If we end up adding more options it might make sense to do this. Maybe
>>> offset order processing can be a follow up?
>>>
>>> I'll add a section for this in Rejected Alternatives. I think it makes
>>> sense to do something like this but maybe in a follow up.
>>>
>>> 5) I hadn't thought about this. I suppose if they changed this in an
>>> upgrade the next record would either evict a lot of records (if the grace
>>> period decreased) or there would be a pause until the new grace period
>>> reached. Increasing is a bit more problematic, especially if the table
>>> grace period and retention time stays the same. If the data is reprocessed
>>> after a change like that then there would be different results, but I feel
>>> like that would be expected after such a change.
>>>
>>> What do you think should happen?
>>>
>>> Hopefully this answers your questions!
>>>
>>> Walker
>>>
>>> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org> wrote:
>>>
>>>> Thanks for the KIP! Also some question/comments from my side:
>>>>
>>>> 10) Notation: you use the term "late data" but I think you mean
>>>> out-of-order. We reserve the term "late" to records that arrive after
>>>> grace period passed, and thus, "late == out-of-order data that is
>>> dropped".
>>>>
>>>>
>>>> 20) "There is only one option from the stream side and only recently is
>>>> there a second option on the table side."
>>>>
>>>> What are those options? Victoria already asked about the table side, but
>>>> I am also not sure what option you mean for the stream side?
>>>>
>>>>
>>>> 30) "If the table side uses a materialized version store the value is
>>>> the latest by stream time rather than by offset within its defined grace
>>>> period."
>>>>
>>>> The phrase "the value is the latest by stream time" is confusing -- in
>>>> the end, a versioned stores multiple versions, not just one.
>>>>
>>>>
>>>> 40) I am also wondering about ordering. In general, KS tries to preserve
>>>> offset-order during processing (with some exception, when offset order
>>>> preservation is not clearly defined). Given that the stream-side buffer
>>>> is really just a "linear buffer", we could easily preserve offset-order.
>>>> But I also see a benefit of re-ordering and emitting out-of-order data
>>>> right away when read (instead of blocking them behind in-order records
>>>> that are not ready yet). -- It might even be a possibility, to let users
>>>> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
>>>> placeholder).
>>>>
>>>> The KIP should explain this in more detail and also discuss different
>>>> options and mention them in "Rejected alternatives" in case we don't
>>>> want to include them.
>>>>
>>>>
>>>> 50) What happens when users change the grace period? Especially, when
>>>> they turn it on/off (but also increasing/decreasing is an interesting
>>>> point)? I think we should try to support this if possible; the
>>>> "Compatibility" section needs to cover switching on/off in more detail.
>>>>
>>>>
>>>> -Matthias
>>>>
>>>>
>>>>
>>>>
>>>> On 5/2/23 2:06 PM, Victoria Xia wrote:
>>>>> Cool KIP, Walker! Thanks for sharing this proposal.
>>>>>
>>>>> A few clarifications:
>>>>>
>>>>> 1. Is the order that records exit the buffer in necessarily the same as
>>>> the
>>>>> order that records enter the buffer in, or no? Based on the description
>>>> in
>>>>> the KIP, it sounds like the answer is no, i.e., records will exit the
>>>>> buffer in increasing timestamp order, which means that they may be
>>>> ordered
>>>>> (even for the same key) compared to the input order.
>>>>>
>>>>> 2. What happens if the join grace period is nonzero, and a stream-side
>>>>> record arrives with a timestamp that is older than the current stream
>>>> time
>>>>> minus the grace period? Will this record trigger a join result, or will
>>>> it
>>>>> be dropped? Based on the description for what happens when the join
>>> grace
>>>>> period is set to zero, it sounds like the late record will be dropped,
>>>> even
>>>>> if the join grace period is nonzero. Is that true?
>>>>>
>>>>> 3. What could cause stream time to advance, for purposes of removing
>>>>> records from the join buffer? For example, will new records arriving on
>>>> the
>>>>> table side of the join cause stream time to advance? From the KIP it
>>>> sounds
>>>>> like only stream-side records will advance stream time -- does that
>>> mean
>>>>> that the join processor itself will have to track this stream time?
>>>>>
>>>>> Also +1 to Lucas's question about what options will be available for
>>>>> configuring the join buffer. Will users have the option to choose
>>> whether
>>>>> they want the buffer to be in-memory vs persistent?
>>>>>
>>>>> - Victoria
>>>>>
>>>>> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
>>>>> <lb...@confluent.io.invalid> wrote:
>>>>>
>>>>>> HI Walker,
>>>>>>
>>>>>> thanks for the KIP! We definitely need this. I have two questions:
>>>>>>
>>>>>>    - Have you considered allowing the customization of the underlying
>>>>>> buffer implementation? As I can see, `StreamJoined` lets you customize
>>>>>> the underlying store via a `WindowStoreSupplier`. Would it make sense
>>>>>> for `Joined` to have this as well? I can imagine one may want to limit
>>>>>> the number of records in the buffer, for example. If we hit the
>>>>>> maximum, the only option would be to drop semantic guarantees, but
>>>>>> users may still want to do this.
>>>>>>    - With "second option on the table side" you are referring to
>>>>>> versioned tables, right? Will the buffer on the stream side behave any
>>>>>> different whether the table side is versioned or not?
>>>>>>
>>>>>> Finally, I think a simple example in the motivation section could help
>>>>>> non-experts understand the KIP.
>>>>>>
>>>>>> Best,
>>>>>> Lucas
>>>>>>
>>>>>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>>>>>> <wc...@confluent.io.invalid> wrote:
>>>>>>>
>>>>>>> Hello everybody,
>>>>>>>
>>>>>>> I have a stream proposal to improve the stream table join by adding a
>>>>>> grace
>>>>>>> period and buffer to the stream side of the join to allow processing
>>> in
>>>>>>> timestamp order matching the recent improvements of the versioned
>>>> tables.
>>>>>>>
>>>>>>> Please take a look here <
>>> https://cwiki.apache.org/confluence/x/lAs0Dw>
>>>>>> and
>>>>>>> share your thoughts.
>>>>>>>
>>>>>>> best,
>>>>>>> Walker
>>>>>>
>>>>>
>>>>
>>>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Lucas Brutschy <lb...@confluent.io.INVALID>.

Hi Walker,

thanks for your responses. That makes sense. I guess there is always
the option to make the implementation more configurable later on, if
users request it. Also thanks for the clarifications. From my side,
the KIP is good to go.

Cheers,
Lucas

On Wed, May 24, 2023 at 11:54 PM Victoria Xia
<vi...@confluent.io.invalid> wrote:
>
> Thanks for the updates, Walker! Looks great, though I do have a couple
> questions about the latest updates:
>
>    1. The new example says that without stream-side buffering, "ex" and
>    "fy" are possible join results. How could those join results happen? The
>    example versioned table suggests that table record "x" has timestamp 2, and
>    table record "y" has timestamp 3. If stream record "e" has timestamp 1,
>    then it can never be joined against record "x", and similarly for stream
>    record "f" with timestamp 2 being joined against "y".
>    2. I see in your replies above that "If table is not materialized it
>    will materialize it as versioned" but I don't see this called out in the
>    KIP -- seems worth calling out. Also, what will the history retention for
>    the versioned table be? Will it be the same as the join grace period, or
>    will it be greater?
>
>
> And some additional thoughts:
>
> Sounds like there are a few things users should watch out for when enabling
> the stream-side buffer:
>
>    - Records will get "stuck" if there are no newer records to advance
>    stream time.
>    - If there are large gaps between the timestamps of stream-side records,
>    then it's possible that versioned store history retention will have expired
>    by the time a record is evicted from the join buffer, leading to a join
>    "miss." For example, if the join grace period and table history retention
>    are both 10, and records come in the order:
>
>    table side t0 with ts=0
>    stream side s1 with ts=1 <-- enters buffer
>    table side t10 with ts=10
>    table side t20 with ts=20
>    stream side s21 with ts=21 <-- evicts record s1 from buffer, but
>    versioned store no longer contains data for ts=1 due to history retention
>    having elapsed
>
>    This will result in the join result (s1, null) even though it should've
>    been (s1, t0), due to t0 having been expired from the versioned store
>    already.
>    - Out-of-order records from the stream-side will be reordered, and late
>    records will be dropped.
>
> I don't think any of these are reasons to not go forward with this KIP, but
> it'd be good to call them out in the eventual documentation to decrease the
> chance users get tripped up.
>
> > We could maybe do an improvement later to advance stream time from table
> side as well, but that might be debatable as we might get more late records.
>
> Yes, the likelihood of late records increases but also the likelihood of
> "join misses" due to versioned store history retention having elapsed
> decreases, which feels important for certain use cases. Either way, agreed
> that it can be a discussion for the future as incorporating this would
> substantially complicate the implementation.
>
> Also a couple nits:
>
>    - The KIP currently says "We recently added versioned tables which allow
>    the table side of the a join [...] but it is not taken advantage of in
>    joins," but this doesn't seem true? If the table of a stream-table join is
>    versioned, then the DSL's stream-table join processor will automatically
>    perform timestamped lookups into the table, in order to take advantage of
>    the new timestamp-aware store to provide better join semantics.
>    - The KIP mentions "grace period" for versioned stores in a number of
>    places but I think you actually mean "history retention"? The two happen to
>    be the same today (it is not an option for users to configure the two
>    separately) but this need not be true in the future. "History retention"
>    governs how far back in time reads may occur, which is the relevant
>    parameter for performing lookups as part of the stream-table join. "Grace
>    period" in the context of versioned stores refers to how far back in time
>    out-of-order writes may occur, which probably isn't directly relevant for
>    introducing a stream-side buffer, though it's also possible I've overlooked
>    something. (As a bonus, switching from "table grace period" in the KIP to
>    "table history retention" also helps to clarify/distinguish that it's a
>    different parameter from the "join grace period," which I could see being
>    confusing to readers. :) )
>
>
> Cheers,
> Victoria
>
> On Thu, May 18, 2023 at 1:43 PM Walker Carlson
> <wc...@confluent.io.invalid> wrote:
>
> > Hey all,
> >
> > Thanks for the comments, they gave me a lot to think about. I'll try to
> > address them all inorder. I have made some updates to the kip related to
> > them, but I mention where below.
> >
> > Lucas
> >
> > Good idea about the example. I added a simple one.
> >
> > 1) I have thought about including options for the underlying buffer
> > configuration. One of which might be adding an in memory option. My biggest
> > concern is about the semantic guarantees. This isn't like suppress or with
> > windows where producing incomplete results is repetitively harmless. Here
> > we would be possibly producing incorrect results. I also would like to keep
> > the interface changes as simple as I can. Making more than this change to
> > Joined I feel could make this more complicated than it needs to be. If we
> > really want to I could see adding a grace() option with a BufferConifg in
> > there or something, but I would rather not.
> >
> > 2) The buffer will be independent of if the table is versioned or not. If
> > table is not materialized it will materialize it as versioned. It might
> > make sense to do a follow up kip where we force the retention period  of
> > the versioned to be greater than whatever the max of the stream buffer is.
> >
> > Victoria
> >
> > 1) Yes, records will exit in timestamp order not in offset order.
> > 2) Late records will be dropped (Late as out of the grace period). From my
> > understanding that is the point of a grace period, no? Doesn't the same
> > thing happen with versioned stores?
> > 3) The segment store already has an observed stream time, we advance based
> > on that. That should only advance based on records that enter the store. So
> > yes, only stream side records. We could maybe do an improvement later to
> > advance stream time from table side as well, but that might be debatable as
> > we might get more late records. Anyways I would rather have that as a
> > separate discussion.
> >
> > in memory option? We can do that, for the buffer I plan to use the
> > TimeOrderedKeyValueBuffer interface which already has an in memory
> > implantation, so it would be simple.
> >
> > I said more in my answer to Lucas's question. The concern I have with
> > buffer configs or in memory is complicating the interface. Also semantic
> > guarantees but in memory shouldn't effect that
> >
> > Matthias
> >
> > 1) fixed out of order vs late terminology in the kip.
> >
> > 2) I was referring to having a stream. So after this kip we can have a
> > buffered stream or a normal one. For the table we can use a versioned table
> > or a normal table.
> >
> > 3 Good call out. I clarified this as "If the table side uses a materialized
> > version store, it can store multiple versions of each record within its
> > defined grace period." and modified the rest of the paragraph a bit.
> >
> > 4) I get the preserving off offset ordering, but if the stream is buffered
> > to join on timestamp instead of offset doesn't it already seem like we care
> > more about time in this case?
> >
> > If we end up adding more options it might make sense to do this. Maybe
> > offset order processing can be a follow up?
> >
> > I'll add a section for this in Rejected Alternatives. I think it makes
> > sense to do something like this but maybe in a follow up.
> >
> > 5) I hadn't thought about this. I suppose if they changed this in an
> > upgrade the next record would either evict a lot of records (if the grace
> > period decreased) or there would be a pause until the new grace period
> > reached. Increasing is a bit more problematic, especially if the table
> > grace period and retention time stays the same. If the data is reprocessed
> > after a change like that then there would be different results, but I feel
> > like that would be expected after such a change.
> >
> > What do you think should happen?
> >
> > Hopefully this answers your questions!
> >
> > Walker
> >
> > On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org> wrote:
> >
> > > Thanks for the KIP! Also some question/comments from my side:
> > >
> > > 10) Notation: you use the term "late data" but I think you mean
> > > out-of-order. We reserve the term "late" to records that arrive after
> > > grace period passed, and thus, "late == out-of-order data that is
> > dropped".
> > >
> > >
> > > 20) "There is only one option from the stream side and only recently is
> > > there a second option on the table side."
> > >
> > > What are those options? Victoria already asked about the table side, but
> > > I am also not sure what option you mean for the stream side?
> > >
> > >
> > > 30) "If the table side uses a materialized version store the value is
> > > the latest by stream time rather than by offset within its defined grace
> > > period."
> > >
> > > The phrase "the value is the latest by stream time" is confusing -- in
> > > the end, a versioned stores multiple versions, not just one.
> > >
> > >
> > > 40) I am also wondering about ordering. In general, KS tries to preserve
> > > offset-order during processing (with some exception, when offset order
> > > preservation is not clearly defined). Given that the stream-side buffer
> > > is really just a "linear buffer", we could easily preserve offset-order.
> > > But I also see a benefit of re-ordering and emitting out-of-order data
> > > right away when read (instead of blocking them behind in-order records
> > > that are not ready yet). -- It might even be a possibility, to let users
> > > pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
> > > placeholder).
> > >
> > > The KIP should explain this in more detail and also discuss different
> > > options and mention them in "Rejected alternatives" in case we don't
> > > want to include them.
> > >
> > >
> > > 50) What happens when users change the grace period? Especially, when
> > > they turn it on/off (but also increasing/decreasing is an interesting
> > > point)? I think we should try to support this if possible; the
> > > "Compatibility" section needs to cover switching on/off in more detail.
> > >
> > >
> > > -Matthias
> > >
> > >
> > >
> > >
> > > On 5/2/23 2:06 PM, Victoria Xia wrote:
> > > > Cool KIP, Walker! Thanks for sharing this proposal.
> > > >
> > > > A few clarifications:
> > > >
> > > > 1. Is the order that records exit the buffer in necessarily the same as
> > > the
> > > > order that records enter the buffer in, or no? Based on the description
> > > in
> > > > the KIP, it sounds like the answer is no, i.e., records will exit the
> > > > buffer in increasing timestamp order, which means that they may be
> > > ordered
> > > > (even for the same key) compared to the input order.
> > > >
> > > > 2. What happens if the join grace period is nonzero, and a stream-side
> > > > record arrives with a timestamp that is older than the current stream
> > > time
> > > > minus the grace period? Will this record trigger a join result, or will
> > > it
> > > > be dropped? Based on the description for what happens when the join
> > grace
> > > > period is set to zero, it sounds like the late record will be dropped,
> > > even
> > > > if the join grace period is nonzero. Is that true?
> > > >
> > > > 3. What could cause stream time to advance, for purposes of removing
> > > > records from the join buffer? For example, will new records arriving on
> > > the
> > > > table side of the join cause stream time to advance? From the KIP it
> > > sounds
> > > > like only stream-side records will advance stream time -- does that
> > mean
> > > > that the join processor itself will have to track this stream time?
> > > >
> > > > Also +1 to Lucas's question about what options will be available for
> > > > configuring the join buffer. Will users have the option to choose
> > whether
> > > > they want the buffer to be in-memory vs persistent?
> > > >
> > > > - Victoria
> > > >
> > > > On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> > > > <lb...@confluent.io.invalid> wrote:
> > > >
> > > >> HI Walker,
> > > >>
> > > >> thanks for the KIP! We definitely need this. I have two questions:
> > > >>
> > > >>   - Have you considered allowing the customization of the underlying
> > > >> buffer implementation? As I can see, `StreamJoined` lets you customize
> > > >> the underlying store via a `WindowStoreSupplier`. Would it make sense
> > > >> for `Joined` to have this as well? I can imagine one may want to limit
> > > >> the number of records in the buffer, for example. If we hit the
> > > >> maximum, the only option would be to drop semantic guarantees, but
> > > >> users may still want to do this.
> > > >>   - With "second option on the table side" you are referring to
> > > >> versioned tables, right? Will the buffer on the stream side behave any
> > > >> different whether the table side is versioned or not?
> > > >>
> > > >> Finally, I think a simple example in the motivation section could help
> > > >> non-experts understand the KIP.
> > > >>
> > > >> Best,
> > > >> Lucas
> > > >>
> > > >> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> > > >> <wc...@confluent.io.invalid> wrote:
> > > >>>
> > > >>> Hello everybody,
> > > >>>
> > > >>> I have a stream proposal to improve the stream table join by adding a
> > > >> grace
> > > >>> period and buffer to the stream side of the join to allow processing
> > in
> > > >>> timestamp order matching the recent improvements of the versioned
> > > tables.
> > > >>>
> > > >>> Please take a look here <
> > https://cwiki.apache.org/confluence/x/lAs0Dw>
> > > >> and
> > > >>> share your thoughts.
> > > >>>
> > > >>> best,
> > > >>> Walker
> > > >>
> > > >
> > >
> >

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Victoria Xia <vi...@confluent.io.INVALID>.

Thanks for the updates, Walker! Looks great, though I do have a couple
questions about the latest updates:

   1. The new example says that without stream-side buffering, "ex" and
   "fy" are possible join results. How could those join results happen? The
   example versioned table suggests that table record "x" has timestamp 2, and
   table record "y" has timestamp 3. If stream record "e" has timestamp 1,
   then it can never be joined against record "x", and similarly for stream
   record "f" with timestamp 2 being joined against "y".
   2. I see in your replies above that "If table is not materialized it
   will materialize it as versioned" but I don't see this called out in the
   KIP -- seems worth calling out. Also, what will the history retention for
   the versioned table be? Will it be the same as the join grace period, or
   will it be greater?


And some additional thoughts:

Sounds like there are a few things users should watch out for when enabling
the stream-side buffer:

   - Records will get "stuck" if there are no newer records to advance
   stream time.
   - If there are large gaps between the timestamps of stream-side records,
   then it's possible that versioned store history retention will have expired
   by the time a record is evicted from the join buffer, leading to a join
   "miss." For example, if the join grace period and table history retention
   are both 10, and records come in the order:

   table side t0 with ts=0
   stream side s1 with ts=1 <-- enters buffer
   table side t10 with ts=10
   table side t20 with ts=20
   stream side s21 with ts=21 <-- evicts record s1 from buffer, but
   versioned store no longer contains data for ts=1 due to history retention
   having elapsed

   This will result in the join result (s1, null) even though it should've
   been (s1, t0), due to t0 having been expired from the versioned store
   already.
   - Out-of-order records from the stream-side will be reordered, and late
   records will be dropped.

I don't think any of these are reasons to not go forward with this KIP, but
it'd be good to call them out in the eventual documentation to decrease the
chance users get tripped up.

> We could maybe do an improvement later to advance stream time from table
side as well, but that might be debatable as we might get more late records.

Yes, the likelihood of late records increases but also the likelihood of
"join misses" due to versioned store history retention having elapsed
decreases, which feels important for certain use cases. Either way, agreed
that it can be a discussion for the future as incorporating this would
substantially complicate the implementation.

Also a couple nits:

   - The KIP currently says "We recently added versioned tables which allow
   the table side of the a join [...] but it is not taken advantage of in
   joins," but this doesn't seem true? If the table of a stream-table join is
   versioned, then the DSL's stream-table join processor will automatically
   perform timestamped lookups into the table, in order to take advantage of
   the new timestamp-aware store to provide better join semantics.
   - The KIP mentions "grace period" for versioned stores in a number of
   places but I think you actually mean "history retention"? The two happen to
   be the same today (it is not an option for users to configure the two
   separately) but this need not be true in the future. "History retention"
   governs how far back in time reads may occur, which is the relevant
   parameter for performing lookups as part of the stream-table join. "Grace
   period" in the context of versioned stores refers to how far back in time
   out-of-order writes may occur, which probably isn't directly relevant for
   introducing a stream-side buffer, though it's also possible I've overlooked
   something. (As a bonus, switching from "table grace period" in the KIP to
   "table history retention" also helps to clarify/distinguish that it's a
   different parameter from the "join grace period," which I could see being
   confusing to readers. :) )


Cheers,
Victoria

On Thu, May 18, 2023 at 1:43 PM Walker Carlson
<wc...@confluent.io.invalid> wrote:

> Hey all,
>
> Thanks for the comments, they gave me a lot to think about. I'll try to
> address them all inorder. I have made some updates to the kip related to
> them, but I mention where below.
>
> Lucas
>
> Good idea about the example. I added a simple one.
>
> 1) I have thought about including options for the underlying buffer
> configuration. One of which might be adding an in memory option. My biggest
> concern is about the semantic guarantees. This isn't like suppress or with
> windows where producing incomplete results is repetitively harmless. Here
> we would be possibly producing incorrect results. I also would like to keep
> the interface changes as simple as I can. Making more than this change to
> Joined I feel could make this more complicated than it needs to be. If we
> really want to I could see adding a grace() option with a BufferConifg in
> there or something, but I would rather not.
>
> 2) The buffer will be independent of if the table is versioned or not. If
> table is not materialized it will materialize it as versioned. It might
> make sense to do a follow up kip where we force the retention period  of
> the versioned to be greater than whatever the max of the stream buffer is.
>
> Victoria
>
> 1) Yes, records will exit in timestamp order not in offset order.
> 2) Late records will be dropped (Late as out of the grace period). From my
> understanding that is the point of a grace period, no? Doesn't the same
> thing happen with versioned stores?
> 3) The segment store already has an observed stream time, we advance based
> on that. That should only advance based on records that enter the store. So
> yes, only stream side records. We could maybe do an improvement later to
> advance stream time from table side as well, but that might be debatable as
> we might get more late records. Anyways I would rather have that as a
> separate discussion.
>
> in memory option? We can do that, for the buffer I plan to use the
> TimeOrderedKeyValueBuffer interface which already has an in memory
> implantation, so it would be simple.
>
> I said more in my answer to Lucas's question. The concern I have with
> buffer configs or in memory is complicating the interface. Also semantic
> guarantees but in memory shouldn't effect that
>
> Matthias
>
> 1) fixed out of order vs late terminology in the kip.
>
> 2) I was referring to having a stream. So after this kip we can have a
> buffered stream or a normal one. For the table we can use a versioned table
> or a normal table.
>
> 3 Good call out. I clarified this as "If the table side uses a materialized
> version store, it can store multiple versions of each record within its
> defined grace period." and modified the rest of the paragraph a bit.
>
> 4) I get the preserving off offset ordering, but if the stream is buffered
> to join on timestamp instead of offset doesn't it already seem like we care
> more about time in this case?
>
> If we end up adding more options it might make sense to do this. Maybe
> offset order processing can be a follow up?
>
> I'll add a section for this in Rejected Alternatives. I think it makes
> sense to do something like this but maybe in a follow up.
>
> 5) I hadn't thought about this. I suppose if they changed this in an
> upgrade the next record would either evict a lot of records (if the grace
> period decreased) or there would be a pause until the new grace period
> reached. Increasing is a bit more problematic, especially if the table
> grace period and retention time stays the same. If the data is reprocessed
> after a change like that then there would be different results, but I feel
> like that would be expected after such a change.
>
> What do you think should happen?
>
> Hopefully this answers your questions!
>
> Walker
>
> On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org> wrote:
>
> > Thanks for the KIP! Also some question/comments from my side:
> >
> > 10) Notation: you use the term "late data" but I think you mean
> > out-of-order. We reserve the term "late" to records that arrive after
> > grace period passed, and thus, "late == out-of-order data that is
> dropped".
> >
> >
> > 20) "There is only one option from the stream side and only recently is
> > there a second option on the table side."
> >
> > What are those options? Victoria already asked about the table side, but
> > I am also not sure what option you mean for the stream side?
> >
> >
> > 30) "If the table side uses a materialized version store the value is
> > the latest by stream time rather than by offset within its defined grace
> > period."
> >
> > The phrase "the value is the latest by stream time" is confusing -- in
> > the end, a versioned stores multiple versions, not just one.
> >
> >
> > 40) I am also wondering about ordering. In general, KS tries to preserve
> > offset-order during processing (with some exception, when offset order
> > preservation is not clearly defined). Given that the stream-side buffer
> > is really just a "linear buffer", we could easily preserve offset-order.
> > But I also see a benefit of re-ordering and emitting out-of-order data
> > right away when read (instead of blocking them behind in-order records
> > that are not ready yet). -- It might even be a possibility, to let users
> > pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
> > placeholder).
> >
> > The KIP should explain this in more detail and also discuss different
> > options and mention them in "Rejected alternatives" in case we don't
> > want to include them.
> >
> >
> > 50) What happens when users change the grace period? Especially, when
> > they turn it on/off (but also increasing/decreasing is an interesting
> > point)? I think we should try to support this if possible; the
> > "Compatibility" section needs to cover switching on/off in more detail.
> >
> >
> > -Matthias
> >
> >
> >
> >
> > On 5/2/23 2:06 PM, Victoria Xia wrote:
> > > Cool KIP, Walker! Thanks for sharing this proposal.
> > >
> > > A few clarifications:
> > >
> > > 1. Is the order that records exit the buffer in necessarily the same as
> > the
> > > order that records enter the buffer in, or no? Based on the description
> > in
> > > the KIP, it sounds like the answer is no, i.e., records will exit the
> > > buffer in increasing timestamp order, which means that they may be
> > ordered
> > > (even for the same key) compared to the input order.
> > >
> > > 2. What happens if the join grace period is nonzero, and a stream-side
> > > record arrives with a timestamp that is older than the current stream
> > time
> > > minus the grace period? Will this record trigger a join result, or will
> > it
> > > be dropped? Based on the description for what happens when the join
> grace
> > > period is set to zero, it sounds like the late record will be dropped,
> > even
> > > if the join grace period is nonzero. Is that true?
> > >
> > > 3. What could cause stream time to advance, for purposes of removing
> > > records from the join buffer? For example, will new records arriving on
> > the
> > > table side of the join cause stream time to advance? From the KIP it
> > sounds
> > > like only stream-side records will advance stream time -- does that
> mean
> > > that the join processor itself will have to track this stream time?
> > >
> > > Also +1 to Lucas's question about what options will be available for
> > > configuring the join buffer. Will users have the option to choose
> whether
> > > they want the buffer to be in-memory vs persistent?
> > >
> > > - Victoria
> > >
> > > On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> > > <lb...@confluent.io.invalid> wrote:
> > >
> > >> HI Walker,
> > >>
> > >> thanks for the KIP! We definitely need this. I have two questions:
> > >>
> > >>   - Have you considered allowing the customization of the underlying
> > >> buffer implementation? As I can see, `StreamJoined` lets you customize
> > >> the underlying store via a `WindowStoreSupplier`. Would it make sense
> > >> for `Joined` to have this as well? I can imagine one may want to limit
> > >> the number of records in the buffer, for example. If we hit the
> > >> maximum, the only option would be to drop semantic guarantees, but
> > >> users may still want to do this.
> > >>   - With "second option on the table side" you are referring to
> > >> versioned tables, right? Will the buffer on the stream side behave any
> > >> different whether the table side is versioned or not?
> > >>
> > >> Finally, I think a simple example in the motivation section could help
> > >> non-experts understand the KIP.
> > >>
> > >> Best,
> > >> Lucas
> > >>
> > >> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> > >> <wc...@confluent.io.invalid> wrote:
> > >>>
> > >>> Hello everybody,
> > >>>
> > >>> I have a stream proposal to improve the stream table join by adding a
> > >> grace
> > >>> period and buffer to the stream side of the join to allow processing
> in
> > >>> timestamp order matching the recent improvements of the versioned
> > tables.
> > >>>
> > >>> Please take a look here <
> https://cwiki.apache.org/confluence/x/lAs0Dw>
> > >> and
> > >>> share your thoughts.
> > >>>
> > >>> best,
> > >>> Walker
> > >>
> > >
> >
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by Walker Carlson <wc...@confluent.io.INVALID>.

Hey all,

Thanks for the comments, they gave me a lot to think about. I'll try to
address them all inorder. I have made some updates to the kip related to
them, but I mention where below.

Lucas

Good idea about the example. I added a simple one.

1) I have thought about including options for the underlying buffer
configuration. One of which might be adding an in memory option. My biggest
concern is about the semantic guarantees. This isn't like suppress or with
windows where producing incomplete results is repetitively harmless. Here
we would be possibly producing incorrect results. I also would like to keep
the interface changes as simple as I can. Making more than this change to
Joined I feel could make this more complicated than it needs to be. If we
really want to I could see adding a grace() option with a BufferConifg in
there or something, but I would rather not.

2) The buffer will be independent of if the table is versioned or not. If
table is not materialized it will materialize it as versioned. It might
make sense to do a follow up kip where we force the retention period  of
the versioned to be greater than whatever the max of the stream buffer is.

Victoria

1) Yes, records will exit in timestamp order not in offset order.
2) Late records will be dropped (Late as out of the grace period). From my
understanding that is the point of a grace period, no? Doesn't the same
thing happen with versioned stores?
3) The segment store already has an observed stream time, we advance based
on that. That should only advance based on records that enter the store. So
yes, only stream side records. We could maybe do an improvement later to
advance stream time from table side as well, but that might be debatable as
we might get more late records. Anyways I would rather have that as a
separate discussion.

in memory option? We can do that, for the buffer I plan to use the
TimeOrderedKeyValueBuffer interface which already has an in memory
implantation, so it would be simple.

I said more in my answer to Lucas's question. The concern I have with
buffer configs or in memory is complicating the interface. Also semantic
guarantees but in memory shouldn't effect that

Matthias

1) fixed out of order vs late terminology in the kip.

2) I was referring to having a stream. So after this kip we can have a
buffered stream or a normal one. For the table we can use a versioned table
or a normal table.

3 Good call out. I clarified this as "If the table side uses a materialized
version store, it can store multiple versions of each record within its
defined grace period." and modified the rest of the paragraph a bit.

4) I get the preserving off offset ordering, but if the stream is buffered
to join on timestamp instead of offset doesn't it already seem like we care
more about time in this case?

If we end up adding more options it might make sense to do this. Maybe
offset order processing can be a follow up?

I'll add a section for this in Rejected Alternatives. I think it makes
sense to do something like this but maybe in a follow up.

5) I hadn't thought about this. I suppose if they changed this in an
upgrade the next record would either evict a lot of records (if the grace
period decreased) or there would be a pause until the new grace period
reached. Increasing is a bit more problematic, especially if the table
grace period and retention time stays the same. If the data is reprocessed
after a change like that then there would be different results, but I feel
like that would be expected after such a change.

What do you think should happen?

Hopefully this answers your questions!

Walker

On Mon, May 8, 2023 at 11:32 AM Matthias J. Sax <mj...@apache.org> wrote:

> Thanks for the KIP! Also some question/comments from my side:
>
> 10) Notation: you use the term "late data" but I think you mean
> out-of-order. We reserve the term "late" to records that arrive after
> grace period passed, and thus, "late == out-of-order data that is dropped".
>
>
> 20) "There is only one option from the stream side and only recently is
> there a second option on the table side."
>
> What are those options? Victoria already asked about the table side, but
> I am also not sure what option you mean for the stream side?
>
>
> 30) "If the table side uses a materialized version store the value is
> the latest by stream time rather than by offset within its defined grace
> period."
>
> The phrase "the value is the latest by stream time" is confusing -- in
> the end, a versioned stores multiple versions, not just one.
>
>
> 40) I am also wondering about ordering. In general, KS tries to preserve
> offset-order during processing (with some exception, when offset order
> preservation is not clearly defined). Given that the stream-side buffer
> is really just a "linear buffer", we could easily preserve offset-order.
> But I also see a benefit of re-ordering and emitting out-of-order data
> right away when read (instead of blocking them behind in-order records
> that are not ready yet). -- It might even be a possibility, to let users
> pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a
> placeholder).
>
> The KIP should explain this in more detail and also discuss different
> options and mention them in "Rejected alternatives" in case we don't
> want to include them.
>
>
> 50) What happens when users change the grace period? Especially, when
> they turn it on/off (but also increasing/decreasing is an interesting
> point)? I think we should try to support this if possible; the
> "Compatibility" section needs to cover switching on/off in more detail.
>
>
> -Matthias
>
>
>
>
> On 5/2/23 2:06 PM, Victoria Xia wrote:
> > Cool KIP, Walker! Thanks for sharing this proposal.
> >
> > A few clarifications:
> >
> > 1. Is the order that records exit the buffer in necessarily the same as
> the
> > order that records enter the buffer in, or no? Based on the description
> in
> > the KIP, it sounds like the answer is no, i.e., records will exit the
> > buffer in increasing timestamp order, which means that they may be
> ordered
> > (even for the same key) compared to the input order.
> >
> > 2. What happens if the join grace period is nonzero, and a stream-side
> > record arrives with a timestamp that is older than the current stream
> time
> > minus the grace period? Will this record trigger a join result, or will
> it
> > be dropped? Based on the description for what happens when the join grace
> > period is set to zero, it sounds like the late record will be dropped,
> even
> > if the join grace period is nonzero. Is that true?
> >
> > 3. What could cause stream time to advance, for purposes of removing
> > records from the join buffer? For example, will new records arriving on
> the
> > table side of the join cause stream time to advance? From the KIP it
> sounds
> > like only stream-side records will advance stream time -- does that mean
> > that the join processor itself will have to track this stream time?
> >
> > Also +1 to Lucas's question about what options will be available for
> > configuring the join buffer. Will users have the option to choose whether
> > they want the buffer to be in-memory vs persistent?
> >
> > - Victoria
> >
> > On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> > <lb...@confluent.io.invalid> wrote:
> >
> >> HI Walker,
> >>
> >> thanks for the KIP! We definitely need this. I have two questions:
> >>
> >>   - Have you considered allowing the customization of the underlying
> >> buffer implementation? As I can see, `StreamJoined` lets you customize
> >> the underlying store via a `WindowStoreSupplier`. Would it make sense
> >> for `Joined` to have this as well? I can imagine one may want to limit
> >> the number of records in the buffer, for example. If we hit the
> >> maximum, the only option would be to drop semantic guarantees, but
> >> users may still want to do this.
> >>   - With "second option on the table side" you are referring to
> >> versioned tables, right? Will the buffer on the stream side behave any
> >> different whether the table side is versioned or not?
> >>
> >> Finally, I think a simple example in the motivation section could help
> >> non-experts understand the KIP.
> >>
> >> Best,
> >> Lucas
> >>
> >> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
> >> <wc...@confluent.io.invalid> wrote:
> >>>
> >>> Hello everybody,
> >>>
> >>> I have a stream proposal to improve the stream table join by adding a
> >> grace
> >>> period and buffer to the stream side of the join to allow processing in
> >>> timestamp order matching the recent improvements of the versioned
> tables.
> >>>
> >>> Please take a look here <https://cwiki.apache.org/confluence/x/lAs0Dw>
> >> and
> >>> share your thoughts.
> >>>
> >>> best,
> >>> Walker
> >>
> >
>

Re: [DISCUSS] KIP-923: Add A Grace Period to Stream Table Join

Posted by "Matthias J. Sax" <mj...@apache.org>.

Thanks for the KIP! Also some question/comments from my side:

10) Notation: you use the term "late data" but I think you mean 
out-of-order. We reserve the term "late" to records that arrive after 
grace period passed, and thus, "late == out-of-order data that is dropped".

20) "There is only one option from the stream side and only recently is 
there a second option on the table side."

What are those options? Victoria already asked about the table side, but 
I am also not sure what option you mean for the stream side?

30) "If the table side uses a materialized version store the value is 
the latest by stream time rather than by offset within its defined grace 
period."

The phrase "the value is the latest by stream time" is confusing -- in 
the end, a versioned stores multiple versions, not just one.

40) I am also wondering about ordering. In general, KS tries to preserve 
offset-order during processing (with some exception, when offset order 
preservation is not clearly defined). Given that the stream-side buffer 
is really just a "linear buffer", we could easily preserve offset-order. 
But I also see a benefit of re-ordering and emitting out-of-order data 
right away when read (instead of blocking them behind in-order records 
that are not ready yet). -- It might even be a possibility, to let users 
pick a emit strategy eg "EmitStrategy.preserveOffsets" (name just a 
placeholder).

The KIP should explain this in more detail and also discuss different 
options and mention them in "Rejected alternatives" in case we don't 
want to include them.

50) What happens when users change the grace period? Especially, when 
they turn it on/off (but also increasing/decreasing is an interesting 
point)? I think we should try to support this if possible; the 
"Compatibility" section needs to cover switching on/off in more detail.

-Matthias

On 5/2/23 2:06 PM, Victoria Xia wrote:
> Cool KIP, Walker! Thanks for sharing this proposal.
> 
> A few clarifications:
> 
> 1. Is the order that records exit the buffer in necessarily the same as the
> order that records enter the buffer in, or no? Based on the description in
> the KIP, it sounds like the answer is no, i.e., records will exit the
> buffer in increasing timestamp order, which means that they may be ordered
> (even for the same key) compared to the input order.
> 
> 2. What happens if the join grace period is nonzero, and a stream-side
> record arrives with a timestamp that is older than the current stream time
> minus the grace period? Will this record trigger a join result, or will it
> be dropped? Based on the description for what happens when the join grace
> period is set to zero, it sounds like the late record will be dropped, even
> if the join grace period is nonzero. Is that true?
> 
> 3. What could cause stream time to advance, for purposes of removing
> records from the join buffer? For example, will new records arriving on the
> table side of the join cause stream time to advance? From the KIP it sounds
> like only stream-side records will advance stream time -- does that mean
> that the join processor itself will have to track this stream time?
> 
> Also +1 to Lucas's question about what options will be available for
> configuring the join buffer. Will users have the option to choose whether
> they want the buffer to be in-memory vs persistent?
> 
> - Victoria
> 
> On Fri, Apr 28, 2023 at 11:54 AM Lucas Brutschy
> <lb...@confluent.io.invalid> wrote:
> 
>> HI Walker,
>>
>> thanks for the KIP! We definitely need this. I have two questions:
>>
>>   - Have you considered allowing the customization of the underlying
>> buffer implementation? As I can see, `StreamJoined` lets you customize
>> the underlying store via a `WindowStoreSupplier`. Would it make sense
>> for `Joined` to have this as well? I can imagine one may want to limit
>> the number of records in the buffer, for example. If we hit the
>> maximum, the only option would be to drop semantic guarantees, but
>> users may still want to do this.
>>   - With "second option on the table side" you are referring to
>> versioned tables, right? Will the buffer on the stream side behave any
>> different whether the table side is versioned or not?
>>
>> Finally, I think a simple example in the motivation section could help
>> non-experts understand the KIP.
>>
>> Best,
>> Lucas
>>
>> On Tue, Apr 25, 2023 at 9:13 PM Walker Carlson
>> <wc...@confluent.io.invalid> wrote:
>>>
>>> Hello everybody,
>>>
>>> I have a stream proposal to improve the stream table join by adding a
>> grace
>>> period and buffer to the stream side of the join to allow processing in
>>> timestamp order matching the recent improvements of the versioned tables.
>>>
>>> Please take a look here <https://cwiki.apache.org/confluence/x/lAs0Dw>
>> and
>>> share your thoughts.
>>>
>>> best,
>>> Walker
>>
>