You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kafka.apache.org by Nag Y <an...@gmail.com> on 2020/07/22 15:22:18 UTC

Confluent Platform- KTable clarification

I understood A KStream is an abstraction of a record stream and A KTable is
an abstraction of a changelog stream ( updates or inserts) and the
semantics around it.

However, this is where some confusion arises .. From confluent documentation
<https://docs.confluent.io/current/streams/concepts.html>

To illustrate, let’s imagine the following two data records are being sent
to the stream:

("alice", 1) --> ("alice", 3)

*If your stream processing application were to sum the values per user*, it
would return 3 for alice. Why? Because the second data record would be
considered an update of the previous record. Compare this behavior of
KTable with the illustration for KStream above, which would return 4 for
alice.

Coming to the highlighted area , *if we were to sum the values* , it should
be 4 . right ? However, *if we were to look at the "updated" view of the
logs* , yes , it is 3 as KTable maintains either updates or inserts . Did I
get it right ?

Re: Confluent Platform- KTable clarification

Posted by "Matthias J. Sax" <mj...@apache.org>.
The ides is "sum value by key" in this example, what is maybe not the
perfect example.

However, if you have a KTable, you can do a
`groupBy(...).aggregate(...)` and the same update logic applies:

k1 : a,1
k2 : a,2
k3 : b,3
k4 : b,4

If you groupBy the first attribute in the value and sum, you get the
result table:

a : 3
b : 7

Now, if k1 is updated to for example "b,5", the old "a,1" is
removed/subtracted from the result table and the new "b,5" is added to
the result table giving you

a : 2
b : 12

(If you apply the same example and let groupBy return the original key,
you get what the docs describe; the example is not ideal but correct.)



-Matthias

On 7/22/20 8:22 AM, Nag Y wrote:
> I understood A KStream is an abstraction of a record stream and A KTable is
> an abstraction of a changelog stream ( updates or inserts) and the
> semantics around it.
> 
> However, this is where some confusion arises .. From confluent documentation
> <https://docs.confluent.io/current/streams/concepts.html>
> 
> To illustrate, let’s imagine the following two data records are being sent
> to the stream:
> 
> ("alice", 1) --> ("alice", 3)
> 
> *If your stream processing application were to sum the values per user*, it
> would return 3 for alice. Why? Because the second data record would be
> considered an update of the previous record. Compare this behavior of
> KTable with the illustration for KStream above, which would return 4 for
> alice.
> 
> Coming to the highlighted area , *if we were to sum the values* , it should
> be 4 . right ? However, *if we were to look at the "updated" view of the
> logs* , yes , it is 3 as KTable maintains either updates or inserts . Did I
> get it right ?
> 


Re: Confluent Platform- KTable clarification

Posted by "Matthias J. Sax" <mj...@apache.org>.
The ides is "sum value by key" in this example, what is maybe not the
perfect example.

However, if you have a KTable, you can do a
`groupBy(...).aggregate(...)` and the same update logic applies:

k1 : a,1
k2 : a,2
k3 : b,3
k4 : b,4

If you groupBy the first attribute in the value and sum, you get the
result table:

a : 3
b : 7

Now, if k1 is updated to for example "b,5", the old "a,1" is
removed/subtracted from the result table and the new "b,5" is added to
the result table giving you

a : 2
b : 12

(If you apply the same example and let groupBy return the original key,
you get what the docs describe; the example is not ideal but correct.)



-Matthias

On 7/22/20 8:57 PM, John Roesler wrote:
> Hello Nag,
> 
> Yes, your conclusion sounds right.
> 
> “Sum the values per key” is a statement that doesn’t really make sense in a KTable context, since there is always just one value per key (the latest update).
> 
> I think the docs are just trying to drive the point home that in a KTable, there is just one value per key, whereas in a KStream, each key has a sequence of values. 
> 
> Thanks,
> John
> 
> On Wed, Jul 22, 2020, at 10:22, Nag Y wrote:
>> I understood A KStream is an abstraction of a record stream and A KTable is
>> an abstraction of a changelog stream ( updates or inserts) and the
>> semantics around it.
>>
>> However, this is where some confusion arises .. From confluent documentation
>> <https://docs.confluent.io/current/streams/concepts.html>
>>
>> To illustrate, let’s imagine the following two data records are being sent
>> to the stream:
>>
>> ("alice", 1) --> ("alice", 3)
>>
>> *If your stream processing application were to sum the values per user*, it
>> would return 3 for alice. Why? Because the second data record would be
>> considered an update of the previous record. Compare this behavior of
>> KTable with the illustration for KStream above, which would return 4 for
>> alice.
>>
>> Coming to the highlighted area , *if we were to sum the values* , it should
>> be 4 . right ? However, *if we were to look at the "updated" view of the
>> logs* , yes , it is 3 as KTable maintains either updates or inserts . Did I
>> get it right ?
>>


Re: Confluent Platform- KTable clarification

Posted by John Roesler <vv...@apache.org>.
Hello Nag,

Yes, your conclusion sounds right.

“Sum the values per key” is a statement that doesn’t really make sense in a KTable context, since there is always just one value per key (the latest update).

I think the docs are just trying to drive the point home that in a KTable, there is just one value per key, whereas in a KStream, each key has a sequence of values. 

Thanks,
John

On Wed, Jul 22, 2020, at 10:22, Nag Y wrote:
> I understood A KStream is an abstraction of a record stream and A KTable is
> an abstraction of a changelog stream ( updates or inserts) and the
> semantics around it.
> 
> However, this is where some confusion arises .. From confluent documentation
> <https://docs.confluent.io/current/streams/concepts.html>
> 
> To illustrate, let’s imagine the following two data records are being sent
> to the stream:
> 
> ("alice", 1) --> ("alice", 3)
> 
> *If your stream processing application were to sum the values per user*, it
> would return 3 for alice. Why? Because the second data record would be
> considered an update of the previous record. Compare this behavior of
> KTable with the illustration for KStream above, which would return 4 for
> alice.
> 
> Coming to the highlighted area , *if we were to sum the values* , it should
> be 4 . right ? However, *if we were to look at the "updated" view of the
> logs* , yes , it is 3 as KTable maintains either updates or inserts . Did I
> get it right ?
>