You are viewing a plain text version of this content. The canonical link for it is here.

Posted to jira@kafka.apache.org by "Vicky Papavasileiou (Jira)" <ji...@apache.org> on 2022/09/21 16:04:00 UTC

[jira] [Commented] (KAFKA-14251) Improve CPU usage of self-joins by sacrificing order

    [ https://issues.apache.org/jira/browse/KAFKA-14251?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607839#comment-17607839 ] 

Vicky Papavasileiou commented on KAFKA-14251:
---------------------------------------------

From private discussion with [~guozhang] 

Reasons in favor of this optimization:
 * since the {{thisJoinWindowBefore == otherJoinWindowAfter}} and {{thisJoinWindowAfter == otherJoinWindowBefore}} ; for most cases where {{before == after}} we would just emit all records twice
 * for self joins there’s usually a filter along with the join (like in SQL {{{}CREATE stream2 as SELECT .. FROM stream1 A, stream1 B WHERE A.field <> B.field{}}}, which would be translated as a {{stream.join().filter()}} topology) to filter on some value conditions because otherwise users are just squaring the stream events.
 * even without the join conditions translated as a filter right after the join, users may not want to have two joined records {{joined(value1, value2)}} and {{joined(value2, value1)}} in their final streams, and they may want to do some de-duping to just keep one, and hence they’d have to add such de-duping operator after the join anyways.

Reasons against this optimization:
 * Users that are upgrading and enabling this optimization, will see different ordering in their results. 
 * The semantics of the operator are predetermined and easy to reason about. If we make the change, the semantics depend on the input and it makes it harder to test and debug. 

> Improve CPU usage of self-joins by sacrificing order
> ----------------------------------------------------
>
>                 Key: KAFKA-14251
>                 URL: https://issues.apache.org/jira/browse/KAFKA-14251
>             Project: Kafka
>          Issue Type: Improvement
>            Reporter: Vicky Papavasileiou
>            Priority: Minor
>
> The current self-join operator implementation ensures that records in the output follow the same order as if the join was implemented using an inner-join. To achieve this, the self-join operator needs to use two loops, each doing a window store fetch, to simulate the left-hand side of the join probing the join and the right-hand side probing the join. 
> As an optimization, if we don't care about the ordering of the join results, we can avoid doing two loops and instead do one where the window store fetch will use the union of the left and righ-side windows.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)