You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Lucas Brutschy (Jira)" <ji...@apache.org> on 2023/05/03 14:18:00 UTC

[jira] [Assigned] (KAFKA-12693) Consecutive rebalances with zombie instances may cause corrupted changelogs

     [ https://issues.apache.org/jira/browse/KAFKA-12693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Lucas Brutschy reassigned KAFKA-12693:
--------------------------------------

    Assignee: Lucas Brutschy

> Consecutive rebalances with zombie instances may cause corrupted changelogs
> ---------------------------------------------------------------------------
>
>                 Key: KAFKA-12693
>                 URL: https://issues.apache.org/jira/browse/KAFKA-12693
>             Project: Kafka
>          Issue Type: Bug
>            Reporter: Guozhang Wang
>            Assignee: Lucas Brutschy
>            Priority: Major
>              Labels: new-streams-runtime-should-fix, streams
>
> When an instance (or thread within an instance) of Kafka Streams has a soft failure and the group coordinator triggers a rebalance, that instance would temporarily become a "zombie writer". That is, this instance does not know there's already a new rebalance and hence its partitions have been migrated out, until it tries to commit and then got notified of the illegal-generation error and realize itself is the "zombie" already. During this period until the commit, this zombie may still be writing data to the changelogs of the migrated tasks as the new owner has already taken over and also writing to the changelogs.
> When EOS is enabled, this would not be a problem: when the zombie tries to commit and got notified that it's fenced, its zombie appends would be aborted. With EOS disabled, though, such shared writes would be interleaved on the changelogs where a zombie append may arrive later after the new writer's append, effectively overwriting that new append.
> Note that such interleaving writes do not necessarily cause corrupted data: as long as the new producer keep appending after the old zombie stops, and all the corrupted keys are overwritten again by the new values, then it is fine. However, if there are consecutive rebalances where right after the changelogs are corrupted by zombie writers, and before the new writer can overwrite them again, the task gets migrated again and needs to be restored from changelogs, the old values would be restored instead of the new values, effectively causing data loss.
> Although this should be a rare event, we should fix it asap still. One idea is to have producers get a PID even under ALOS: that is, we set the transactional id in the producer config, but did not trigger any txn APIs; when there are zombie producers, they would then be immediately fenced on appends and hence there's no interleaved appends. I think this may require a KIP still, since today one has to call initTxn in order to register and get the PID.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)