You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Pavel Kovalenko (JIRA)" <ji...@apache.org> on 2018/04/25 18:15:00 UTC
[jira] [Created] (IGNITE-8391) Removing some WAL history segments leads to WAL rebalance hanging

Pavel Kovalenko created IGNITE-8391:
---------------------------------------

             Summary: Removing some WAL history segments leads to WAL rebalance hanging
                 Key: IGNITE-8391
                 URL: https://issues.apache.org/jira/browse/IGNITE-8391
             Project: Ignite
          Issue Type: Bug
          Components: cache
    Affects Versions: 2.4
            Reporter: Pavel Kovalenko
             Fix For: 2.6


Problem:
1) Start 2 nodes, load some data to it.
2) Stop node 2, load some data to cache.
3) Remove WAL archived segment which doesn't contain Checkpoint record needed to find start point for WAL rebalance, but contains necessary data for rebalancing. 
4) Start node 2, this node will start rebalance data from node 1 using WAL.

Rebalance will be hanged with following assertion:

{noformat}
java.lang.AssertionError: Partitions after rebalance should be either done or missing: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31]
	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionSupplier.handleDemandMessage(GridDhtPartitionSupplier.java:417)
	at org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPreloader.handleDemandMessage(GridDhtPreloader.java:364)
	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:379)
	at org.apache.ignite.internal.processors.cache.GridCachePartitionExchangeManager$5.apply(GridCachePartitionExchangeManager.java:364)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1054)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:579)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$700(GridCacheIoManager.java:99)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$OrderedMessageListener.onMessage(GridCacheIoManager.java:1603)
	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1556)
	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4100(GridIoManager.java:125)
	at org.apache.ignite.internal.managers.communication.GridIoManager$GridCommunicationMessageSet.unwind(GridIoManager.java:2752)
	at org.apache.ignite.internal.managers.communication.GridIoManager.unwindMessageSet(GridIoManager.java:1516)
	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4400(GridIoManager.java:125)
	at org.apache.ignite.internal.managers.communication.GridIoManager$10.run(GridIoManager.java:1485)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
{noformat}
 
This happened because we never reached necessary data and updateCounters contained in removed WAL segment.

To resolve such problems we should introduce some fallback strategy if rebalance by WAL has been failed. Example of fallback strategy is - re-run full rebalance for partitions that were not able properly rebalanced using WAL.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)