You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Scherbakov (Jira)" <ji...@apache.org> on 2020/07/17 10:21:00 UTC
[jira] [Comment Edited] (IGNITE-13171) Proper handling of a rebalancing with disabled WAL

    [ https://issues.apache.org/jira/browse/IGNITE-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159270#comment-17159270 ] 

Alexey Scherbakov edited comment on IGNITE-13171 at 7/17/20, 10:20 AM:
-----------------------------------------------------------------------

Turns out the change has become more complicated than I've expected.

The list of changes:

# Fixed a race with delayed partition owning (due to disabled group durability during rebalancing) and new topology event causing partitions owned while clearing (source of partition inconsistency).
# Cache version calculation has been improved. A calculation has been optimized and order is now correctly synced between primary and backups. 
# Fixed rebalance future compatilibity issues for delayed partition owning. For now rebalance future is not completed until checkpoint is done, and correctly checked for compatibility with newer topology versions. 
# Removed start and end versions for transactions.
# Checkpoint is triggered for each group having WAL disabled during rebalancing, so durability for a group is enabled as soon as group rebalanced. 
# Cache group sync future is completed only if rebalancing was finished without cancellation, meaning a data was loaded or some unrecoverable error. 
# Fixed partition state transtion from RENTING to OWNING/LOST when a last supplier had left. 
# Removed synchronous partition destroying to avoid deadlocks. All partitions now are destroyed by partition eviction manager.
# Removed delay equals to DFLT_PRELOAD_RESEND_TIMEOUT before owning state is acked for a group after rebalancing. As as side effect, ideal distribution is achieved faster in unit tests. 
# Fixed multiple flaky tests running on unstable topology, which behavior was changed after introducing 8 and 9.
# Fixed an attempt to rebalance OWNING partition after rebalance restarted after a cancellation.


was (Author: ascherbakov):
Turns out the change has become more complicated than I've expected.

The list of changes:

# Fixed a race with delayed partition owning (due to disabled group durability during rebalancing) and new topology event causing partitions owned while clearing (source of partition inconsistency).
# Cache version calculation has been improved. A calculation has been optimized and order is now correctly synced between primary and backups. 
# Fixed rebalance future compatilibity issues for delayed partition owning. For now rebalance future is not completed until checkpoint is done, and correctly checked for compatibility with newer topology versions. 
# Removed start and end versions for transactions.
# Checkpoint is triggered for each group having WAL disabled during rebalancing, so durability for a group is enabled as soon as group rebalanced. 
# Cache group sync future is completed only if rebalancing was finished without cancellation, meaning a data was loaded or some unrecoverable error. 
# Fixed partition state transtion from RENTING to OWNING/LOST when a last supplier had left. 
# Removed synchronous partition destroying to avoid deadlocks. All partitions now are destroyed by partition eviction manager.
# Removed delay equals to DFLT_PRELOAD_RESEND_TIMEOUT before owning state is acked for a group after rebalancing. As as side effect, ideal distribution is achieved faster in unit tests. 
# Fixed multiple flaky tests running on unstable topology, which behavior was changed after introducing 8 and 9.

> Proper handling of a rebalancing with disabled WAL
> --------------------------------------------------
>
>                 Key: IGNITE-13171
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13171
>             Project: Ignite
>          Issue Type: Bug
>    Affects Versions: 2.8
>            Reporter: Alexey Scherbakov
>            Assignee: Alexey Scherbakov
>            Priority: Major
>             Fix For: 2.10
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Current implementation of a optimized rebalancing with disabled WAL used when all local partitions are MOVING and persistence is enabled has multiple flaws:
>  # There are races between concurrent topology change and partition owning after a checkpoint causing consistency issues.
>  # Partitions will not be owned after a topology has changed and new topology version is compatible with previous.
> This is the reason for flaky tests [1] [2]
> [1] [https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=1140020093875959306&branch=%3Cdefault%3E&tab=testDetails]
> [2] https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=7421637930905964922&branch=%3Cdefault%3E&tab=testDetails



--
This message was sent by Atlassian Jira
(v8.3.4#803005)