You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@kafka.apache.org by Dumitru-Nicolae Marasoui <Ni...@kaluza.com> on 2020/07/13 19:30:35 UTC

3 kafka-streams or a single kafka streams

Hello kafka community,
I imagine the following behavior that I could code in 3 kafka-streams
pipelines and wondering if it can be done in fewer kafka streams with the
same guarantees:

I have 3 compacted topics, t1, t2 and t3, where t2 is the link (many-many)
between t1 & t3.
The same about t4-t6.

I need to replicate t1-3 in t4-6 with a slightly different domain and move
a column (avro attribute) from t1 to t6 to adapt the domains.

In a multi-kafka-streams-pipelines paradigm I could do the following:

I would create 2 or 3 intermediary topics:
- a topic keyed in t1 key which gets events copies from both t1 & t2, both
message types keyed in t1 key (kafka-streams-1 which merges 2 streams based
on t1 & t2 with slight re-keying)
- a topic keyed in t3 key which gets events copies from both t3 & t2, both
message types keyed in t3 key (kafka-streams-2 which merges 2 streams based
on t3 & t2 with slight re-keying)

Until now I have created 2 joins, and I know joins exist in KafkaStreams,
but these joins I can understand with my mind why they would be
linearizable, since the messages on topics t1 & t2 will be sent to a common
pipe with a total order between messages keyed in any particular t1 key, so
the state processing would be linearizable at the level of each t1 entity
(including t1 messages & t2 messages).

Now the output topics are having the same key: t2 key (which is t1+t3 keys
combined).

Now I can join these topics, again sending them both to a 3rd topic with
this join via merge strategy which i understand will not lose combinations
because it cannot allow concurrency, so that all the messages keyed in a
specific t2 key (a specific t1+t3 key combination) are read in order and
applied single thread fashion, linearizable (kafka-streams-3).

So from these 2+1 pipelines I can have an output back to t6 where I rewrite
t6 records with records that contain a new value that is taken from t1.

Does this make sense?
Would you think that it is doable in less pipelines?
Would using joins instead of these merges allow any such guarantees of
single threaded processing across topics? I think not?
Thank you,

-- 

Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com <https://www.kaluza.com/>

LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
<https://twitter.com/Kaluza_tech>

Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?

Re: 3 kafka-streams or a single kafka streams

Posted by Dumitru-Nicolae Marasoui <Ni...@kaluza.com>.
Hello kafka community,
Instead of the consumer, I can also have kafka-streams for stateful
transformation (aggregation / fold) of both types of events & detection
when both sides are present to emit a pair.
In fact I thought a bit more and in my particular case, if we will agree
that I do not need to rewrite t6 which is concurrency dangerous, but we
could create a t7 topic just with the association info i need to emit, and
downstreams consumers could join (and hopefully in a concurrency-safe way &
eventually consistent way), then i can do all with 2 kafka stream: a merge
and an aggregation, potentially skipping the topic in the middle - but i am
not sure about the ordering, i think i will have the topic in between kafka
streams instead of just one kafka streams, to make sure about the ordering
and single threaded processing at the right granularity,
Thank you,

On Mon, 13 Jul 2020 at 20:38, Dumitru-Nicolae Marasoui <
Nicolae.Marasoiu@kaluza.com> wrote:

> Hello kafka community,
> Sorry, besides the 3 kafka streams that just merge topics into another
> common topic, there must be consumers from those merged topics, that keep
> local state, and that, once both t1 and t2 values for a particular t1 key
> exist, emit the pair, and will keep emitting pairs multiple times for each
> t1 key as long as it has new values in either t1 or t2, with the condition
> that both t1 and t2 records are recorded in the local state for a
> particular t1 key.
>
> So these consumers would do the join itself with the help of a local
> database.
>
> I know it sounds a lot like RocksDb from kafka-streams but this is what I
> can understand how it would work and prevent unsafe concurrency (e.g. race
> conditions).
>
> So each consumer consumes from a single topic with 2 event/message types
> and as soon as it has a message of type t1 to which a local state t2 pair
> exists (or the other way around), it emits (sends) a message to its output
> (join) topic.
>
> Can this system, of 3 merges (kafka streams) + 3 joins (consumers), which
> takes care for things to be single threaded at the right shards/partitions,
> and not to lose pairs or combinations due to incorrect concurrency, can
> this be done in a simpler way?
>
> Thanks,
> Nicolae
>
> On Mon, 13 Jul 2020 at 20:30, Dumitru-Nicolae Marasoui <
> Nicolae.Marasoiu@kaluza.com> wrote:
>
>>
>> Hello kafka community,
>> I imagine the following behavior that I could code in 3 kafka-streams
>> pipelines and wondering if it can be done in fewer kafka streams with the
>> same guarantees:
>>
>> I have 3 compacted topics, t1, t2 and t3, where t2 is the link
>> (many-many) between t1 & t3.
>> The same about t4-t6.
>>
>> I need to replicate t1-3 in t4-6 with a slightly different domain and
>> move a column (avro attribute) from t1 to t6 to adapt the domains.
>>
>> In a multi-kafka-streams-pipelines paradigm I could do the following:
>>
>> I would create 2 or 3 intermediary topics:
>> - a topic keyed in t1 key which gets events copies from both t1 & t2,
>> both message types keyed in t1 key (kafka-streams-1 which merges 2 streams
>> based on t1 & t2 with slight re-keying)
>> - a topic keyed in t3 key which gets events copies from both t3 & t2,
>> both message types keyed in t3 key (kafka-streams-2 which merges 2 streams
>> based on t3 & t2 with slight re-keying)
>>
>> Until now I have created 2 joins, and I know joins exist in KafkaStreams,
>> but these joins I can understand with my mind why they would be
>> linearizable, since the messages on topics t1 & t2 will be sent to a common
>> pipe with a total order between messages keyed in any particular t1 key, so
>> the state processing would be linearizable at the level of each t1 entity
>> (including t1 messages & t2 messages).
>>
>> Now the output topics are having the same key: t2 key (which is t1+t3
>> keys combined).
>>
>> Now I can join these topics, again sending them both to a 3rd topic with
>> this join via merge strategy which i understand will not lose combinations
>> because it cannot allow concurrency, so that all the messages keyed in a
>> specific t2 key (a specific t1+t3 key combination) are read in order and
>> applied single thread fashion, linearizable (kafka-streams-3).
>>
>> So from these 2+1 pipelines I can have an output back to t6 where I
>> rewrite t6 records with records that contain a new value that is taken from
>> t1.
>>
>> Does this make sense?
>> Would you think that it is doable in less pipelines?
>> Would using joins instead of these merges allow any such guarantees of
>> single threaded processing across topics? I think not?
>> Thank you,
>>
>> --
>>
>> Dumitru-Nicolae Marasoui
>>
>> Software Engineer
>>
>>
>>
>> w kaluza.com <https://www.kaluza.com/>
>>
>> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
>> <https://twitter.com/Kaluza_tech>
>>
>> Kaluza Ltd. registered in England and Wales No. 08785057
>>
>> VAT No. 100119879
>>
>> Help save paper - do you need to print this email?
>>
>
>
> --
>
> Dumitru-Nicolae Marasoui
>
> Software Engineer
>
>
>
> w kaluza.com <https://www.kaluza.com/>
>
> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
> <https://twitter.com/Kaluza_tech>
>
> Kaluza Ltd. registered in England and Wales No. 08785057
>
> VAT No. 100119879
>
> Help save paper - do you need to print this email?
>


-- 

Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com <https://www.kaluza.com/>

LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
<https://twitter.com/Kaluza_tech>

Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?

Re: 3 kafka-streams or a single kafka streams

Posted by Dumitru-Nicolae Marasoui <Ni...@kaluza.com>.
Hello kafka community,
Sorry, besides the 3 kafka streams that just merge topics into another
common topic, there must be consumers from those merged topics, that keep
local state, and that, once both t1 and t2 values for a particular t1 key
exist, emit the pair, and will keep emitting pairs multiple times for each
t1 key as long as it has new values in either t1 or t2, with the condition
that both t1 and t2 records are recorded in the local state for a
particular t1 key.

So these consumers would do the join itself with the help of a local
database.

I know it sounds a lot like RocksDb from kafka-streams but this is what I
can understand how it would work and prevent unsafe concurrency (e.g. race
conditions).

So each consumer consumes from a single topic with 2 event/message types
and as soon as it has a message of type t1 to which a local state t2 pair
exists (or the other way around), it emits (sends) a message to its output
(join) topic.

Can this system, of 3 merges (kafka streams) + 3 joins (consumers), which
takes care for things to be single threaded at the right shards/partitions,
and not to lose pairs or combinations due to incorrect concurrency, can
this be done in a simpler way?

Thanks,
Nicolae

On Mon, 13 Jul 2020 at 20:30, Dumitru-Nicolae Marasoui <
Nicolae.Marasoiu@kaluza.com> wrote:

>
> Hello kafka community,
> I imagine the following behavior that I could code in 3 kafka-streams
> pipelines and wondering if it can be done in fewer kafka streams with the
> same guarantees:
>
> I have 3 compacted topics, t1, t2 and t3, where t2 is the link (many-many)
> between t1 & t3.
> The same about t4-t6.
>
> I need to replicate t1-3 in t4-6 with a slightly different domain and move
> a column (avro attribute) from t1 to t6 to adapt the domains.
>
> In a multi-kafka-streams-pipelines paradigm I could do the following:
>
> I would create 2 or 3 intermediary topics:
> - a topic keyed in t1 key which gets events copies from both t1 & t2, both
> message types keyed in t1 key (kafka-streams-1 which merges 2 streams based
> on t1 & t2 with slight re-keying)
> - a topic keyed in t3 key which gets events copies from both t3 & t2, both
> message types keyed in t3 key (kafka-streams-2 which merges 2 streams based
> on t3 & t2 with slight re-keying)
>
> Until now I have created 2 joins, and I know joins exist in KafkaStreams,
> but these joins I can understand with my mind why they would be
> linearizable, since the messages on topics t1 & t2 will be sent to a common
> pipe with a total order between messages keyed in any particular t1 key, so
> the state processing would be linearizable at the level of each t1 entity
> (including t1 messages & t2 messages).
>
> Now the output topics are having the same key: t2 key (which is t1+t3 keys
> combined).
>
> Now I can join these topics, again sending them both to a 3rd topic with
> this join via merge strategy which i understand will not lose combinations
> because it cannot allow concurrency, so that all the messages keyed in a
> specific t2 key (a specific t1+t3 key combination) are read in order and
> applied single thread fashion, linearizable (kafka-streams-3).
>
> So from these 2+1 pipelines I can have an output back to t6 where I
> rewrite t6 records with records that contain a new value that is taken from
> t1.
>
> Does this make sense?
> Would you think that it is doable in less pipelines?
> Would using joins instead of these merges allow any such guarantees of
> single threaded processing across topics? I think not?
> Thank you,
>
> --
>
> Dumitru-Nicolae Marasoui
>
> Software Engineer
>
>
>
> w kaluza.com <https://www.kaluza.com/>
>
> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
> <https://twitter.com/Kaluza_tech>
>
> Kaluza Ltd. registered in England and Wales No. 08785057
>
> VAT No. 100119879
>
> Help save paper - do you need to print this email?
>


-- 

Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com <https://www.kaluza.com/>

LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
<https://twitter.com/Kaluza_tech>

Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?