You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Aleksey Plekhanov (Jira)" <ji...@apache.org> on 2023/04/24 15:12:00 UTC
[jira] [Updated] (IGNITE-17793) Historical rebalance must use HWM instead of LWM to seek the proper checkpoint to avoid the data loss
[ https://issues.apache.org/jira/browse/IGNITE-17793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Aleksey Plekhanov updated IGNITE-17793:
---------------------------------------
Release Note: Fixed potential data loss on historical rebalance
> Historical rebalance must use HWM instead of LWM to seek the proper checkpoint to avoid the data loss
> -----------------------------------------------------------------------------------------------------
>
> Key: IGNITE-17793
> URL: https://issues.apache.org/jira/browse/IGNITE-17793
> Project: Ignite
> Issue Type: Sub-task
> Reporter: Anton Vinogradov
> Assignee: Vladimir Steshin
> Priority: Major
> Labels: iep-31, ise
> Fix For: 2.15
>
> Attachments: HistoricalRebalanceCheckpointTest.java
>
>
> Currently, historical rebalance at {{CheckpointHistory#searchEarliestWalPointer}} seeks for the newest checkpoint with counter less that lowest entry has to be rebalanced.
> Unfortunately, we may have more that one checkpoint with the same counter and it's impossible to use the newest one as a rebalance start point.
> For example, we have partition with LWM=100, some gaps and HWM=200.
> Checkpoint will have the counter == 100.
> Then we may close some gaps, exluding 101 (to keep LWM == 100).
> And again, checkpoint will have counter == 100.
> Newest checkpoint (marked with counter 100) will not cointain all committed entries with counter > 100.
> Then lets close the rest of the gaps to make historical rebalance possible.
> And after the rebalance finish, we'll see a warning "Some partition entries were missed during historical rebalance" and inconsistent cluster state.
> See reproducer at [^HistoricalRebalanceCheckpointTest.java]
> Possible solution is to use HWM instead of LWM during the search.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)