You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Leif Wickland <lw...@rubiconproject.com> on 2017/07/21 23:00:46 UTC

Joining two topics and emitting each key only once within a sliding window.

Howdy,

I'm trying to make a Mafka application which will join two topics with
different semantics than the KStreams join has natively.

Specifically I'd like to:
- Immediately emit matches to a topic.
- Track when matches are found so subsequent matches (which within the join
window would be considered duplicates) aren't reemitted.
- Emit unmatched items to a different topic when their time is a
configurable amount beyond the low water mark.
- Only emit a given key once within a window, either to the matched or
unmatched topic.

I'm still working through how (or if) this can be accomplished. Has someone
done something similar to this in the past?

One approach I considered was doing an outer join followed by a stateful
transformer which would implement the logic above, using punctuate() to
trigger flushes of unmatched items at the appropriate time past the
watermark. The biggest challenge I see to that approach is that the
StateStore interface doesn't provide a way to iterate over all keys in a
time window. The underlying RocksDB store has that capacity, but bridging
that gap looks painful.

The other approach I was considering trying to implement all that logic in
a single processor in order to avoid having the overhead of storing the
data first in the join implementation's store and then again in the
transform's store. I don't understand what challenges I'd have around
reading transactionally from two topics. I suspect that may not be
insignificantly difficult.

I'd be grateful for any insight or suggestions you could offer.

Re: Joining two topics and emitting each key only once within a sliding window.

Posted by "Matthias J. Sax" <ma...@confluent.io>.
I am not sure exactly what semantics you want to have. Note, that Kafka
Streams provides a sliding window join between two stream. Thus, I am
not sure what you mean by

>> Track when matches are found so subsequent matches (which within the join
>> window would be considered duplicates) aren't reemitted.

Also note, that result records are timestamped with current "stream
time". If you have late arriving records, the timestamp would not fall
in the actual time-window and thus subsequent filtering would not work
as you cannot know the correct time window a result record originates from.

Thus, I think it would be best to implement the join as a custom
operator using a stateful transform.


-Matthias


On 7/22/17 1:00 AM, Leif Wickland wrote:
> Howdy,
> 
> I'm trying to make a Mafka application which will join two topics with
> different semantics than the KStreams join has natively.
> 
> Specifically I'd like to:
> - Immediately emit matches to a topic.
> - Track when matches are found so subsequent matches (which within the join
> window would be considered duplicates) aren't reemitted.
> - Emit unmatched items to a different topic when their time is a
> configurable amount beyond the low water mark.
> - Only emit a given key once within a window, either to the matched or
> unmatched topic.
> 
> I'm still working through how (or if) this can be accomplished. Has someone
> done something similar to this in the past?
> 
> One approach I considered was doing an outer join followed by a stateful
> transformer which would implement the logic above, using punctuate() to
> trigger flushes of unmatched items at the appropriate time past the
> watermark. The biggest challenge I see to that approach is that the
> StateStore interface doesn't provide a way to iterate over all keys in a
> time window. The underlying RocksDB store has that capacity, but bridging
> that gap looks painful.
> 
> The other approach I was considering trying to implement all that logic in
> a single processor in order to avoid having the overhead of storing the
> data first in the join implementation's store and then again in the
> transform's store. I don't understand what challenges I'd have around
> reading transactionally from two topics. I suspect that may not be
> insignificantly difficult.
> 
> I'd be grateful for any insight or suggestions you could offer.
>