You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "ghassan Yammine (Jira)" <ji...@apache.org> on 2020/02/21 03:46:00 UTC
[jira] [Commented] (KAFKA-8770) Either switch to or add an option for emit-on-change

    [ https://issues.apache.org/jira/browse/KAFKA-8770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041511#comment-17041511 ] 

ghassan Yammine commented on KAFKA-8770:
----------------------------------------

[~vvcephei] et al.

Emit-on-change would be very beneficial to us. Our data model is foreign-key heavy with some relationships on the order 400,000:1 (we rely heavily on KIP-213, the FK-join). Thus a single change - even if it does not modify 
the output - will generate a tremendous amount of unnecessary traffic for the downstream apps. This is due to the embedded #flatMap operation in the FK-join. Considering the size of some the associations, a single, idempotent, update can generate a large amount of duplicate records.

Finally, we actually do observe many idempotent/superfluous updates due to the lack of emit-on-change behavior.  We could do this ourselves by injecting a deduplicating Transformer() in the Stream but we would have to do it for every KStream app that we have and it would not be as efficient as a "native" implementation.  We are currently at 12 apps and planning for ultimately around 100-200.

In summary, this KIP would save us from computing, storing, and transmitting anywhere from millions to billions of idempotent updates a day.

> Either switch to or add an option for emit-on-change
> ----------------------------------------------------
>
>                 Key: KAFKA-8770
>                 URL: https://issues.apache.org/jira/browse/KAFKA-8770
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: John Roesler
>            Priority: Major
>              Labels: needs-kip
>
> Currently, Streams offers two emission models:
> * emit-on-window-close: (using Suppression)
> * emit-on-update: (i.e., emit a new result whenever a new record is processed, regardless of whether the result has changed)
> There is also an option to drop some intermediate results, either using caching or suppression.
> However, there is no support for emit-on-change, in which results would be forwarded only if the result has changed. This has been reported to be extremely valuable as a performance optimizations for some high-traffic applications, and it reduces the computational burden both internally for downstream Streams operations, as well as for external systems that consume the results, and currently have to deal with a lot of "no-op" changes.
> It would be pretty straightforward to implement this, by loading the prior results before a stateful operation and comparing with the new result before persisting or forwarding. In many cases, we load the prior result anyway, so it may not be a significant performance impact either.
> One design challenge is what to do with timestamps. If we get one record at time 1 that produces a result, and then another at time 2 that produces a no-op, what should be the timestamp of the result, 1 or 2? emit-on-change would require us to say 1.
> Clearly, we'd need to do some serious benchmarks to evaluate any potential implementation of emit-on-change.
> Another design challenge is to decide if we should just automatically provide emit-on-change for stateful operators, or if it should be configurable. Configuration increases complexity, so unless the performance impact is high, we may just want to change the emission model without a configuration.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)