You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Scherbakov (Jira)" <ji...@apache.org> on 2020/07/17 10:21:00 UTC
[jira] [Comment Edited] (IGNITE-13171) Proper handling of a
rebalancing with disabled WAL
[ https://issues.apache.org/jira/browse/IGNITE-13171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17159270#comment-17159270 ]
Alexey Scherbakov edited comment on IGNITE-13171 at 7/17/20, 10:20 AM:
-----------------------------------------------------------------------
Turns out the change has become more complicated than I've expected.
The list of changes:
# Fixed a race with delayed partition owning (due to disabled group durability during rebalancing) and new topology event causing partitions owned while clearing (source of partition inconsistency).
# Cache version calculation has been improved. A calculation has been optimized and order is now correctly synced between primary and backups.
# Fixed rebalance future compatilibity issues for delayed partition owning. For now rebalance future is not completed until checkpoint is done, and correctly checked for compatibility with newer topology versions.
# Removed start and end versions for transactions.
# Checkpoint is triggered for each group having WAL disabled during rebalancing, so durability for a group is enabled as soon as group rebalanced.
# Cache group sync future is completed only if rebalancing was finished without cancellation, meaning a data was loaded or some unrecoverable error.
# Fixed partition state transtion from RENTING to OWNING/LOST when a last supplier had left.
# Removed synchronous partition destroying to avoid deadlocks. All partitions now are destroyed by partition eviction manager.
# Removed delay equals to DFLT_PRELOAD_RESEND_TIMEOUT before owning state is acked for a group after rebalancing. As as side effect, ideal distribution is achieved faster in unit tests.
# Fixed multiple flaky tests running on unstable topology, which behavior was changed after introducing 8 and 9.
# Fixed an attempt to rebalance OWNING partition after rebalance restarted after a cancellation.
was (Author: ascherbakov):
Turns out the change has become more complicated than I've expected.
The list of changes:
# Fixed a race with delayed partition owning (due to disabled group durability during rebalancing) and new topology event causing partitions owned while clearing (source of partition inconsistency).
# Cache version calculation has been improved. A calculation has been optimized and order is now correctly synced between primary and backups.
# Fixed rebalance future compatilibity issues for delayed partition owning. For now rebalance future is not completed until checkpoint is done, and correctly checked for compatibility with newer topology versions.
# Removed start and end versions for transactions.
# Checkpoint is triggered for each group having WAL disabled during rebalancing, so durability for a group is enabled as soon as group rebalanced.
# Cache group sync future is completed only if rebalancing was finished without cancellation, meaning a data was loaded or some unrecoverable error.
# Fixed partition state transtion from RENTING to OWNING/LOST when a last supplier had left.
# Removed synchronous partition destroying to avoid deadlocks. All partitions now are destroyed by partition eviction manager.
# Removed delay equals to DFLT_PRELOAD_RESEND_TIMEOUT before owning state is acked for a group after rebalancing. As as side effect, ideal distribution is achieved faster in unit tests.
# Fixed multiple flaky tests running on unstable topology, which behavior was changed after introducing 8 and 9.
> Proper handling of a rebalancing with disabled WAL
> --------------------------------------------------
>
> Key: IGNITE-13171
> URL: https://issues.apache.org/jira/browse/IGNITE-13171
> Project: Ignite
> Issue Type: Bug
> Affects Versions: 2.8
> Reporter: Alexey Scherbakov
> Assignee: Alexey Scherbakov
> Priority: Major
> Fix For: 2.10
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Current implementation of a optimized rebalancing with disabled WAL used when all local partitions are MOVING and persistence is enabled has multiple flaws:
> # There are races between concurrent topology change and partition owning after a checkpoint causing consistency issues.
> # Partitions will not be owned after a topology has changed and new topology version is compatible with previous.
> This is the reason for flaky tests [1] [2]
> [1] [https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=1140020093875959306&branch=%3Cdefault%3E&tab=testDetails]
> [2] https://ci.ignite.apache.org/project.html?projectId=IgniteTests24Java8&testNameId=7421637930905964922&branch=%3Cdefault%3E&tab=testDetails
--
This message was sent by Atlassian Jira
(v8.3.4#803005)