You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Sophie Blee-Goldman (Jira)" <ji...@apache.org> on 2020/04/27 19:01:00 UTC

[jira] [Comment Edited] (KAFKA-9923) Join window store duplicates can be compacted in changelog

    [ https://issues.apache.org/jira/browse/KAFKA-9923?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17093842#comment-17093842 ] 

Sophie Blee-Goldman edited comment on KAFKA-9923 at 4/27/20, 7:00 PM:
----------------------------------------------------------------------

The root cause is the same but the resulting problems are different: caching with duplicates doesn't seem to really make sense (should we even allow that combination?), however changelogging + duplicates definitely does. But this seems to be broken in a way that has correctness implications as we may be losing records during compaction


was (Author: ableegoldman):
The root cause is the same but the implications are different: caching with duplicates doesn't seem to really make sense, however changelogging + duplicates definitely does but this seems to have correctness implications as we may be losing records during compaction

> Join window store duplicates can be compacted in changelog 
> -----------------------------------------------------------
>
>                 Key: KAFKA-9923
>                 URL: https://issues.apache.org/jira/browse/KAFKA-9923
>             Project: Kafka
>          Issue Type: Bug
>          Components: streams
>            Reporter: Sophie Blee-Goldman
>            Priority: Critical
>
> Stream-stream joins use the regular `WindowStore` implementation but with `retainDuplicates` set to true. To allow for duplicates while using the same unique-key underlying stores we just wrap the key with an incrementing sequence number before inserting it.
> This wrapping occurs at the innermost layer of the store hierarchy, which means the duplicates must first pass through the changelogging layer. At this point the keys are still identical. So, we end up sending the records to the changelog without distinct keys and therefore may lose the older of the duplicates during compaction.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)