You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vladislav Pyatkov (Jira)" <ji...@apache.org> on 2020/07/02 12:26:00 UTC

[jira] [Commented] (IGNITE-13193) Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL

    [ https://issues.apache.org/jira/browse/IGNITE-13193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150228#comment-17150228 ] 

Vladislav Pyatkov commented on IGNITE-13193:
--------------------------------------------

[~slava.koptilin] I left three comments in PR.

Please look at those.

> Implement fallback to full partition rebalancing in case historical supplier failed to read all necessary data updates from WAL
> -------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13193
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13193
>             Project: Ignite
>          Issue Type: Improvement
>    Affects Versions: 2.8.1
>            Reporter: Vyacheslav Koptilin
>            Assignee: Vyacheslav Koptilin
>            Priority: Major
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Historical rebalance may fail for several reasons:
> 1) WAL on supplier node is corrupted - the supplier will trigger a failure handler in the current implementation.
> 2) After iteration over WAL demander node didn't receive all updates to make MOVING partition up-to-date (resulting update counter didn't converge with expected update counter of OWNING partition) - demander will silently ignore lack of updates in the current implementation.
> Such behavior negatively affects the stability of the cluster: an inappropriate state of historical WAL is not a reason to fail a supplier node.
> The more proper way to handle this scenario is:
>  - Either try to rebalance partition historically from another supplier
>  - Or use full partition rebalance for problem partition
> Once the supplier fails to provide data from part of the WAL, its corresponding sequence of checkpoints should be marked as inapplicable for historical rebalance in order to prevent further errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)