You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@kafka.apache.org by Dumitru-Nicolae Marasoui <Ni...@kaluza.com> on 2020/07/14 11:43:31 UTC

ktable join & data loss / indeterminism risk prevention

Hello kafka community,
As I understand it, a kafka-streams join that involves a kTable: “the
KTable lookup is done on the current KTable state, and thus, out-of-order
records can yield non-deterministic result” [1]

Does the solution below involving an intermediate topic sound right to you?
1. kafka streams 1:

   - map topic1 in: key: the key as the join key between topic1 and topic2;
   value: topic1Or2UnionPayload,
   - map topic2 in key: the key as the join key between topic1 and topic2;
   value: topic1Or2UnionPayload,
   - merge mapped topics above into a single stream; the keys are identical
   for elements that need to be joined on both mapped topics/streams
   - write result to a new topic
   -

2. transformation part 2: (kafka consumer/processor or kafka streams if a
suitable stateful transformation can be applied):

   - input the topic result from above with both mapped messages coming on
   a single pipe, linearised by join key
   - a state is need, likely a database (in case kafka-streams is
   applicable, great, it already embeds one)
   - when any two sides for the same join key are in the state, a pair can
   be emitted downstream
   -

Does this make sense to you? Do you have any other experiences of possible
approaches that you would like to share? Thank you

[1]
https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics


Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com <https://www.kaluza.com/>

LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
<https://twitter.com/Kaluza_tech>

Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?

Re: ktable join & data loss / indeterminism risk prevention

Posted by "Matthias J. Sax" <mj...@apache.org>.

I am not sure if I fully understand what your question is?

Are you talking about stream-table or table-table join?

For (1), why do you `merge()`? The merge operator is defined on
KStreams, not KTable and a merge is also not a join?


-Matthias


On 7/15/20 3:27 AM, Dumitru-Nicolae Marasoui wrote:
> Hello kafka community,
> Writing black on white to be more visible,
> This is a thought on making join more clear to me and less prone to
> concurrency issues that would be risky, not knowing the underlying
> implementation of join:
> Waiting your feedback,
> Thanks,
> 
> 1. kafka streams 1:
> map topic1 in: key: the key as the join key between topic1 and topic2;
> value: topic1Or2UnionPayload,
> map topic2 in key: the key as the join key between topic1 and topic2;
> value: topic1Or2UnionPayload,
> merge mapped topics above into a single stream; the keys are identical for
> elements that need to be joined on both mapped topics/streams
> write result to a new topic
> 
> 2. kafka streams 2 (with join processor)
> input the topic result from above with both mapped messages coming on a
> single pipe, linearised by join key
> groupBy join key
> pipe each outcoming stream into a Processor that will store data for both
> sides of the join and as soon as it has both will start emitting joined
> records
> write joined records to the result topic (the safe join result)
> 
> On Wed, 15 Jul 2020 at 11:25, Dumitru-Nicolae Marasoui <
> Nicolae.Marasoiu@kaluza.com> wrote:
> 
>> Hello kafka community,
>> Refining the step 2 and some questions:
>> - is the indeterminism of the ktable join a real problem?
>> - how is the ktable join implemented?
>> - do you think the solution outlined is a step in the right direction?
>> - does the ktable join implement such a strategy in a future version upon
>> configuration/demand?
>> Thank you,
>> 1. kafka streams 1:
>>
>>    - map topic1 in: key: the key as the join key between topic1 and
>>    topic2; value: topic1Or2UnionPayload,
>>    - map topic2 in key: the key as the join key between topic1 and
>>    topic2; value: topic1Or2UnionPayload,
>>    - merge mapped topics above into a single stream; the keys are
>>    identical for elements that need to be joined on both mapped topics/streams
>>    - write result to a new topic
>>    -
>>
>> 2. transformation part 2: (kafka kafka streams+processor):
>>
>>    - input the topic result from above with both mapped messages coming
>>    on a single pipe, linearised by join key
>>    - groupBy join key
>>    - pipe each outcoming stream into a Processor that will store data for
>>    both sides of the join and as soon as it has both will start emitting
>>    joined records
>>    - write joined records to the result topic (the safe join result)
>>    -
>>
>>
>> On Tue, 14 Jul 2020 at 12:43, Dumitru-Nicolae Marasoui <
>> Nicolae.Marasoiu@kaluza.com> wrote:
>>
>>> Hello kafka community,
>>> As I understand it, a kafka-streams join that involves a kTable: “the
>>> KTable lookup is done on the current KTable state, and thus, out-of-order
>>> records can yield non-deterministic result” [1]
>>>
>>> Does the solution below involving an intermediate topic sound right
>>> to you?
>>> 1. kafka streams 1:
>>>
>>>    - map topic1 in: key: the key as the join key between topic1 and
>>>    topic2; value: topic1Or2UnionPayload,
>>>    - map topic2 in key: the key as the join key between topic1 and
>>>    topic2; value: topic1Or2UnionPayload,
>>>    - merge mapped topics above into a single stream; the keys are
>>>    identical for elements that need to be joined on both mapped topics/streams
>>>    - write result to a new topic
>>>    -
>>>
>>> 2. transformation part 2: (kafka consumer/processor or kafka streams if a
>>> suitable stateful transformation can be applied):
>>>
>>>    - input the topic result from above with both mapped messages coming
>>>    on a single pipe, linearised by join key
>>>    - a state is need, likely a database (in case kafka-streams is
>>>    applicable, great, it already embeds one)
>>>    - when any two sides for the same join key are in the state, a pair
>>>    can be emitted downstream
>>>    -
>>>
>>> Does this make sense to you? Do you have any other experiences of
>>> possible approaches that you would like to share? Thank you
>>>
>>> [1]
>>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics
>>>
>>>
>>> Dumitru-Nicolae Marasoui
>>>
>>> Software Engineer
>>>
>>>
>>>
>>> w kaluza.com <https://www.kaluza.com/>
>>>
>>> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
>>> <https://twitter.com/Kaluza_tech>
>>>
>>> Kaluza Ltd. registered in England and Wales No. 08785057
>>>
>>> VAT No. 100119879
>>>
>>> Help save paper - do you need to print this email?
>>>
>>
>>
>> --
>>
>> Dumitru-Nicolae Marasoui
>>
>> Software Engineer
>>
>>
>>
>> w kaluza.com <https://www.kaluza.com/>
>>
>> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
>> <https://twitter.com/Kaluza_tech>
>>
>> Kaluza Ltd. registered in England and Wales No. 08785057
>>
>> VAT No. 100119879
>>
>> Help save paper - do you need to print this email?
>>
> 
>

Re: ktable join & data loss / indeterminism risk prevention

Posted by Dumitru-Nicolae Marasoui <Ni...@kaluza.com>.

Hello kafka community,
Writing black on white to be more visible,
This is a thought on making join more clear to me and less prone to
concurrency issues that would be risky, not knowing the underlying
implementation of join:
Waiting your feedback,
Thanks,

1. kafka streams 1:
map topic1 in: key: the key as the join key between topic1 and topic2;
value: topic1Or2UnionPayload,
map topic2 in key: the key as the join key between topic1 and topic2;
value: topic1Or2UnionPayload,
merge mapped topics above into a single stream; the keys are identical for
elements that need to be joined on both mapped topics/streams
write result to a new topic

2. kafka streams 2 (with join processor)
input the topic result from above with both mapped messages coming on a
single pipe, linearised by join key
groupBy join key
pipe each outcoming stream into a Processor that will store data for both
sides of the join and as soon as it has both will start emitting joined
records
write joined records to the result topic (the safe join result)

On Wed, 15 Jul 2020 at 11:25, Dumitru-Nicolae Marasoui <
Nicolae.Marasoiu@kaluza.com> wrote:

> Hello kafka community,
> Refining the step 2 and some questions:
> - is the indeterminism of the ktable join a real problem?
> - how is the ktable join implemented?
> - do you think the solution outlined is a step in the right direction?
> - does the ktable join implement such a strategy in a future version upon
> configuration/demand?
> Thank you,
> 1. kafka streams 1:
>
>    - map topic1 in: key: the key as the join key between topic1 and
>    topic2; value: topic1Or2UnionPayload,
>    - map topic2 in key: the key as the join key between topic1 and
>    topic2; value: topic1Or2UnionPayload,
>    - merge mapped topics above into a single stream; the keys are
>    identical for elements that need to be joined on both mapped topics/streams
>    - write result to a new topic
>    -
>
> 2. transformation part 2: (kafka kafka streams+processor):
>
>    - input the topic result from above with both mapped messages coming
>    on a single pipe, linearised by join key
>    - groupBy join key
>    - pipe each outcoming stream into a Processor that will store data for
>    both sides of the join and as soon as it has both will start emitting
>    joined records
>    - write joined records to the result topic (the safe join result)
>    -
>
>
> On Tue, 14 Jul 2020 at 12:43, Dumitru-Nicolae Marasoui <
> Nicolae.Marasoiu@kaluza.com> wrote:
>
>> Hello kafka community,
>> As I understand it, a kafka-streams join that involves a kTable: “the
>> KTable lookup is done on the current KTable state, and thus, out-of-order
>> records can yield non-deterministic result” [1]
>>
>> Does the solution below involving an intermediate topic sound right
>> to you?
>> 1. kafka streams 1:
>>
>>    - map topic1 in: key: the key as the join key between topic1 and
>>    topic2; value: topic1Or2UnionPayload,
>>    - map topic2 in key: the key as the join key between topic1 and
>>    topic2; value: topic1Or2UnionPayload,
>>    - merge mapped topics above into a single stream; the keys are
>>    identical for elements that need to be joined on both mapped topics/streams
>>    - write result to a new topic
>>    -
>>
>> 2. transformation part 2: (kafka consumer/processor or kafka streams if a
>> suitable stateful transformation can be applied):
>>
>>    - input the topic result from above with both mapped messages coming
>>    on a single pipe, linearised by join key
>>    - a state is need, likely a database (in case kafka-streams is
>>    applicable, great, it already embeds one)
>>    - when any two sides for the same join key are in the state, a pair
>>    can be emitted downstream
>>    -
>>
>> Does this make sense to you? Do you have any other experiences of
>> possible approaches that you would like to share? Thank you
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics
>>
>>
>> Dumitru-Nicolae Marasoui
>>
>> Software Engineer
>>
>>
>>
>> w kaluza.com <https://www.kaluza.com/>
>>
>> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
>> <https://twitter.com/Kaluza_tech>
>>
>> Kaluza Ltd. registered in England and Wales No. 08785057
>>
>> VAT No. 100119879
>>
>> Help save paper - do you need to print this email?
>>
>
>
> --
>
> Dumitru-Nicolae Marasoui
>
> Software Engineer
>
>
>
> w kaluza.com <https://www.kaluza.com/>
>
> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
> <https://twitter.com/Kaluza_tech>
>
> Kaluza Ltd. registered in England and Wales No. 08785057
>
> VAT No. 100119879
>
> Help save paper - do you need to print this email?
>


-- 

Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com <https://www.kaluza.com/>

LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
<https://twitter.com/Kaluza_tech>

Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?

Re: ktable join & data loss / indeterminism risk prevention

Posted by Dumitru-Nicolae Marasoui <Ni...@kaluza.com>.

Hello kafka community,
Refining the step 2 and some questions:
- is the indeterminism of the ktable join a real problem?
- how is the ktable join implemented?
- do you think the solution outlined is a step in the right direction?
- does the ktable join implement such a strategy in a future version upon
configuration/demand?
Thank you,
1. kafka streams 1:

   - map topic1 in: key: the key as the join key between topic1 and topic2;
   value: topic1Or2UnionPayload,
   - map topic2 in key: the key as the join key between topic1 and topic2;
   value: topic1Or2UnionPayload,
   - merge mapped topics above into a single stream; the keys are identical
   for elements that need to be joined on both mapped topics/streams
   - write result to a new topic
   -

2. transformation part 2: (kafka kafka streams+processor):

   - input the topic result from above with both mapped messages coming on
   a single pipe, linearised by join key
   - groupBy join key
   - pipe each outcoming stream into a Processor that will store data for
   both sides of the join and as soon as it has both will start emitting
   joined records
   - write joined records to the result topic (the safe join result)
   -


On Tue, 14 Jul 2020 at 12:43, Dumitru-Nicolae Marasoui <
Nicolae.Marasoiu@kaluza.com> wrote:

> Hello kafka community,
> As I understand it, a kafka-streams join that involves a kTable: “the
> KTable lookup is done on the current KTable state, and thus, out-of-order
> records can yield non-deterministic result” [1]
>
> Does the solution below involving an intermediate topic sound right to you?
> 1. kafka streams 1:
>
>    - map topic1 in: key: the key as the join key between topic1 and
>    topic2; value: topic1Or2UnionPayload,
>    - map topic2 in key: the key as the join key between topic1 and
>    topic2; value: topic1Or2UnionPayload,
>    - merge mapped topics above into a single stream; the keys are
>    identical for elements that need to be joined on both mapped topics/streams
>    - write result to a new topic
>    -
>
> 2. transformation part 2: (kafka consumer/processor or kafka streams if a
> suitable stateful transformation can be applied):
>
>    - input the topic result from above with both mapped messages coming
>    on a single pipe, linearised by join key
>    - a state is need, likely a database (in case kafka-streams is
>    applicable, great, it already embeds one)
>    - when any two sides for the same join key are in the state, a pair
>    can be emitted downstream
>    -
>
> Does this make sense to you? Do you have any other experiences of possible
> approaches that you would like to share? Thank you
>
> [1]
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Streams+Join+Semantics
>
>
> Dumitru-Nicolae Marasoui
>
> Software Engineer
>
>
>
> w kaluza.com <https://www.kaluza.com/>
>
> LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
> <https://twitter.com/Kaluza_tech>
>
> Kaluza Ltd. registered in England and Wales No. 08785057
>
> VAT No. 100119879
>
> Help save paper - do you need to print this email?
>


-- 

Dumitru-Nicolae Marasoui

Software Engineer



w kaluza.com <https://www.kaluza.com/>

LinkedIn <https://www.linkedin.com/company/kaluza> | Twitter
<https://twitter.com/Kaluza_tech>

Kaluza Ltd. registered in England and Wales No. 08785057

VAT No. 100119879

Help save paper - do you need to print this email?