You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Guillermo Lammers Corral <gu...@tecsisa.com> on 2016/09/05 16:25:03 UTC

Kafka Streams: joins without windowing (KStream) and without being KTables

Hi,

I've been thinking how to solve with Kafka Streams one of my business
process without success for the moment. Hope someone can help me.

I am reading from two topics events like that (I'll simplify the problem at
this point):

ObjectX
Key: String
Value: String

ObjectY
Key: String
Value: String

I want to do some kind of "join" for all events without windowing but also
without being KTables...

Example:

==============================

ObjectX("0001", "a") -> TopicA

Expected output TopicResult:

nothing

==============================

ObjectX("0001", "b") -> Topic A

Expected output TopicResult:

nothing

==============================

ObjectY("0001", "d") -> Topic B:

Expected output TopicResult:

ObjectZ("0001", ("a", "d"))
ObjectZ("0001", ("b", "d"))

==============================

==============================

ObjectY("0001", "e") -> Topic B:

Expected output TopicResult:

ObjectZ("0001", ("a", "e"))
ObjectZ("0001", ("b", "e"))

==============================

TopicResult at the end:

ObjectZ("0001", ("a", "d"))
ObjectZ("0001", ("b", "d"))
ObjectZ("0001", ("a", "e"))
ObjectZ("0001", ("b", "e"))

==============================

I think I can't use KTable-KTable join because I want to match all the
events from the beginning of time. Hence, I can't use KStream-KStream join
because force me to use windowing. Same for KStream-KTable join...

Any expert using Kafka Streams could help me with some tips?

Thanks in advance.

Re: Kafka Streams: joins without windowing (KStream) and without being KTables

Posted by "Matthias J. Sax" <ma...@confluent.io>.
Hey,

are you sure, you want to join everything? This will result in a huge
memory footprint of your application. You are right, that you cannot use
KTable, however, windowed KStream joins would work -- you only need to
specify a huge window (ie, use Long.MAX_VALUE; this will effectively be
"infinitely large") thus that all data falls into a single window.

The issue will be, that all data will be buffered in memory, thus, if
your application run very long, it will eventually fail (I would
assume). Thus, again my initial question: are you sure, you want to join
everything? (It's stream processing, not batch processing...)

If the answer is still yes, and you hit a memory issue, you will need to
fall back to use Processor API instead of DSL to spill data to disk if
it does not fit into memory and more (ie, you will need to implement
your own version of an symmetric-hash-join that spills to disk). Of
course, the disk usage will also be huge. Eventually, your disc might
also become too small...

Can you clarify, why you want to join everything? This does not sound
like a good idea. Very large windows are handleable, but "infinite"
windows are very problematic in stream processing.


-Matthias


On 09/05/2016 06:25 PM, Guillermo Lammers Corral wrote:
> Hi,
> 
> I've been thinking how to solve with Kafka Streams one of my business
> process without success for the moment. Hope someone can help me.
> 
> I am reading from two topics events like that (I'll simplify the problem at
> this point):
> 
> ObjectX
> Key: String
> Value: String
> 
> ObjectY
> Key: String
> Value: String
> 
> I want to do some kind of "join" for all events without windowing but also
> without being KTables...
> 
> Example:
> 
> ==============================
> 
> ObjectX("0001", "a") -> TopicA
> 
> Expected output TopicResult:
> 
> nothing
> 
> ==============================
> 
> ObjectX("0001", "b") -> Topic A
> 
> Expected output TopicResult:
> 
> nothing
> 
> ==============================
> 
> ObjectY("0001", "d") -> Topic B:
> 
> Expected output TopicResult:
> 
> ObjectZ("0001", ("a", "d"))
> ObjectZ("0001", ("b", "d"))
> 
> ==============================
> 
> ==============================
> 
> ObjectY("0001", "e") -> Topic B:
> 
> Expected output TopicResult:
> 
> ObjectZ("0001", ("a", "e"))
> ObjectZ("0001", ("b", "e"))
> 
> ==============================
> 
> TopicResult at the end:
> 
> ObjectZ("0001", ("a", "d"))
> ObjectZ("0001", ("b", "d"))
> ObjectZ("0001", ("a", "e"))
> ObjectZ("0001", ("b", "e"))
> 
> ==============================
> 
> I think I can't use KTable-KTable join because I want to match all the
> events from the beginning of time. Hence, I can't use KStream-KStream join
> because force me to use windowing. Same for KStream-KTable join...
> 
> Any expert using Kafka Streams could help me with some tips?
> 
> Thanks in advance.
>