You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Jon Yeargers <jo...@cedexis.com> on 2017/05/03 17:51:48 UTC

joining two windowed aggregations

I want to collect data in two windowed groups - 4 hours with a one hour
overlap and a 5 minute / 1 minute. I want to compare the values in the
_oldest_ window for each group.

Seems like this would be a standard join operation but Im not clear on how
to limit which window the join operates on. I could keep a timestamp in
each aggregate and if it isn't what I want (IE < 4 hours old) then ignore
the join but this seems v inefficient.

Likely Im missing the big-picture here again w/re KStreams. I keep running
into situations where it seems like Kafka Streams would be a great tool but
it just doesn't quite fit. Kind of like having a drawer with mixed
metric/std wrenches.

Re: joining two windowed aggregations

Posted by "Matthias J. Sax" <ma...@confluent.io>.

> Seems like this would be a standard join operation 

Not sure, if I would consider this a "standard" join... Your windows
have different size and thus "move" differently fast.

Kafka Stream joins provide sliding join semantics. Similar to a SQL
query like this (conceptually):

> SELECT * FROM stream1, stream2
>   WHERE
>     stream1.key = stream2.key
>     AND
>     stream1.ts - windowSize <= stream2.ts AND stream2.ts <= stream1.ts + windowSize

Ie, it joins all records that are timely close to each other -- what is
a very natural way to express a stream join (what I would consider a
"standard" join). (Btw: this applies _one_ window over both streams)

The hopping window semantics you describe, are also possible with a
little extra work though:

You would first use an aggregation that "collects" your windows (ie,
your aggregation function build up a list of input records as
aggregation result). You can apply this to both your streams with
according TimeWindow.of().advanceBy() settings.

For the join itself, you would merge both streams via
`KStreamBuilder#merge` and apply a stateful (Value)Transformer
downstream. In your Transformer#process you can write custom logic to
compare your windows.

Does this help?

-Matthias

On 5/3/17 10:51 AM, Jon Yeargers wrote:
> I want to collect data in two windowed groups - 4 hours with a one hour
> overlap and a 5 minute / 1 minute. I want to compare the values in the
> _oldest_ window for each group.
> 
> Seems like this would be a standard join operation but Im not clear on how
> to limit which window the join operates on. I could keep a timestamp in
> each aggregate and if it isn't what I want (IE < 4 hours old) then ignore
> the join but this seems v inefficient.
> 
> Likely Im missing the big-picture here again w/re KStreams. I keep running
> into situations where it seems like Kafka Streams would be a great tool but
> it just doesn't quite fit. Kind of like having a drawer with mixed
> metric/std wrenches.
>