You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Anton Vinogradov (Jira)" <ji...@apache.org> on 2022/09/30 17:47:00 UTC

[jira] [Updated] (IGNITE-17793) Historical rebalance must use HWM instead of LWM to seek the proper checkpoint

     [ https://issues.apache.org/jira/browse/IGNITE-17793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Vinogradov updated IGNITE-17793:
--------------------------------------
    Labels: iep-31 ise  (was: )

> Historical rebalance must use HWM instead of LWM to seek the proper checkpoint
> ------------------------------------------------------------------------------
>
>                 Key: IGNITE-17793
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17793
>             Project: Ignite
>          Issue Type: Sub-task
>            Reporter: Anton Vinogradov
>            Priority: Major
>              Labels: iep-31, ise
>
> Currently, historical rebalance at {{CheckpointHistory#searchEarliestWalPointer}} seeks for the newest checkpoint with counter less that lowest entry has to be rebalanced.
> Unfortunately, 
> 1) We may have more that one checkpoint with the same counter and it's impossible to use the newest one as a rebalance start point.
> For example, we have partition with LWM=100, some gaps and HWM=200.
> Checkpoint will have the counter == 100.
> Then we may close some gaps, exluding 101 (to keep LWM == 100).
> And again, checkpoint will have counter == 100.
> Newest checkpoint marked with counter 100 will not cointain all committed entries with counter > 100.
> And after the rebalance finish, we'll wee a warning "Some partition entries were missed during historical rebalance" and inconsistent cluster state.
> 2) After the cluster restart, we may face a situation that we have checkpoints before some counter but none of them can be used bor rebalancing.
> For example, we, again, have partition with LWM=100, some gaps and HWM=200.
> Restarting the cluster and first checkpoint marked at counter == 100.
> But this single checkpoint does not contain some committed entries with counter > 100.
> Possible solution is to use HWM instead of LWM during the search.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)