You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Maxim Muzafarov (Jira)" <ji...@apache.org> on 2019/10/08 11:13:00 UTC

[jira] [Commented] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

    [ https://issues.apache.org/jira/browse/IGNITE-8391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16946721#comment-16946721 ] 

Maxim Muzafarov commented on IGNITE-8391:
-----------------------------------------

Moved to 2.9 due to inactivity. Please feel free to move back if you will be able to complete the ticket by AI 2.8 code freeze date, December 2, 2019.

> Removing some WAL history segments leads to WAL rebalance hanging
> -----------------------------------------------------------------
>
>                 Key: IGNITE-8391
>                 URL: https://issues.apache.org/jira/browse/IGNITE-8391
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.4
>            Reporter: Pavel Kovalenko
>            Assignee: Vladislav Pyatkov
>            Priority: Major
>             Fix For: 2.8
>
>
> Problem:
> 1) Start 2 nodes, load some data to it.
> 2) Stop node 2, load some data to cache.
> 3) Remove WAL archived segment which doesn't contain Checkpoint record needed to find start point for WAL rebalance, but contains necessary data for rebalancing. 
> 4) Start node 2, this node will start rebalance data from node 1 using WAL.
> Rebalance will be hanged with following assertion:
> {noformat}
> java.lang.AssertionError: Partitions after rebalance should be either done or missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
> 	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
> 	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
> 	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
> 	at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  
> This happened because we never reached necessary data and updateCounters contained in removed WAL segment.
> To resolve such problems we should introduce some fallback strategy if rebalance by WAL has been failed. Example of fallback strategy is - re-run full rebalance for partitions that were not able properly rebalanced using WAL.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)