You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@kafka.apache.org by "Matthias J. Sax (JIRA)" <ji...@apache.org> on 2018/04/04 05:50:00 UTC

[jira] [Updated] (KAFKA-6730) Simplify state store recovery

     [ https://issues.apache.org/jira/browse/KAFKA-6730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Matthias J. Sax updated KAFKA-6730:
-----------------------------------
    Description: 
In the current code base, we restore state stores in the main thread (in contrast to older code that did restore state stored in the rebalance call back). This has multiple advantages and allows us the further simplify restore code.

In the original code base, during a long restore phase, it was possible that a instance misses a rebalance and drops out of the consumer group. To detect this case, we apply a check during the restore phase, that the end-offset of the changelog topic does not change. A changed offset implies a missed rebalance as another thread started to write into the changelog topic (ie, the current thread does not own the task/store/changelog-topic anymore).

With the new code, that restores in the main-loop, it's ensured that `poll()` is called regularly and thus, a rebalance will be detected automatically. This make the check about an changing changelog-end-offset unnecessary.

We can simplify the restore logic, to just consuming until `poll()` does not return any data. For this case, we fetch the end-offset to see if we did fully restore. If yes, we resume processing, if not, we continue the restore.

  was:
In the current code base, we restore state stores in the main thread (in contrast to older code that did restore state stored in the rebalance call back). This has multiple advantages and allows us the further simplify restore code.

In the original code base, during a long restore phase, it was possible that a instance misses a rebalance and drops out of the consumer group. To detect this case, we apply a check during the restore phase, that the end-offset of the changelog topic does not change. A changed offset implies a missed rebalance as another thread started to write into the changelog topic (ie, the current thread does not own the task/store/changelog-topic anymore).

With the new code, that restores in the main-loop, it's ensured that poll() is called regularly and thus, a rebalance will be detected automatically. This make the check about an changing changleog-end-offset unnecessary.

We can simplify the restore logic, to just consuming until `pol()` does not return any data. For this case, we fetch the end-offset to see if we did fully restore. If yes, we resume processing, if not, we continue the restore.


> Simplify state store recovery
> -----------------------------
>
>                 Key: KAFKA-6730
>                 URL: https://issues.apache.org/jira/browse/KAFKA-6730
>             Project: Kafka
>          Issue Type: Improvement
>          Components: streams
>            Reporter: Matthias J. Sax
>            Priority: Major
>             Fix For: 1.2.0
>
>
> In the current code base, we restore state stores in the main thread (in contrast to older code that did restore state stored in the rebalance call back). This has multiple advantages and allows us the further simplify restore code.
> In the original code base, during a long restore phase, it was possible that a instance misses a rebalance and drops out of the consumer group. To detect this case, we apply a check during the restore phase, that the end-offset of the changelog topic does not change. A changed offset implies a missed rebalance as another thread started to write into the changelog topic (ie, the current thread does not own the task/store/changelog-topic anymore).
> With the new code, that restores in the main-loop, it's ensured that `poll()` is called regularly and thus, a rebalance will be detected automatically. This make the check about an changing changelog-end-offset unnecessary.
> We can simplify the restore logic, to just consuming until `poll()` does not return any data. For this case, we fetch the end-offset to see if we did fully restore. If yes, we resume processing, if not, we continue the restore.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)