You are viewing a plain text version of this content. The canonical link for it is here.
Posted to notifications@ignite.apache.org by GitBox <gi...@apache.org> on 2022/11/23 15:05:09 UTC

[GitHub] [ignite] Vladsz83 opened a new pull request, #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Vladsz83 opened a new pull request, #10396:
URL: https://github.com/apache/ignite/pull/10396

   Thank you for submitting the pull request to the Apache Ignite.
   
   In order to streamline the review of the contribution 
   we ask you to ensure the following steps have been taken:
   
   ### The Contribution Checklist
   - [ ] There is a single JIRA ticket related to the pull request. 
   - [ ] The web-link to the pull request is attached to the JIRA ticket.
   - [ ] The JIRA ticket has the _Patch Available_ state.
   - [ ] The pull request body describes changes that have been made. 
   The description explains _WHAT_ and _WHY_ was made instead of _HOW_.
   - [ ] The pull request title is treated as the final commit message. 
   The following pattern must be used: `IGNITE-XXXX Change summary` where `XXXX` - number of JIRA issue.
   - [ ] A reviewer has been mentioned through the JIRA comments 
   (see [the Maintainers list](https://cwiki.apache.org/confluence/display/IGNITE/How+to+Contribute#HowtoContribute-ReviewProcessandMaintainers)) 
   - [ ] The pull request has been checked by the Teamcity Bot and 
   the `green visa` attached to the JIRA ticket (see [TC.Bot: Check PR](https://mtcga.gridgain.com/prs.html))
   
   ### Notes
   - [How to Contribute](https://cwiki.apache.org/confluence/display/IGNITE/How+to+Contribute)
   - [Coding abbreviation rules](https://cwiki.apache.org/confluence/display/IGNITE/Abbreviation+Rules)
   - [Coding Guidelines](https://cwiki.apache.org/confluence/display/IGNITE/Coding+Guidelines)
   - [Apache Ignite Teamcity Bot](https://cwiki.apache.org/confluence/display/IGNITE/Apache+Ignite+Teamcity+Bot)
   
   If you need any help, please email dev@ignite.apache.org or ask anу advice on http://asf.slack.com _#ignite_ channel.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1041003344


##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] anton-vinogradov merged pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
anton-vinogradov merged PR #10396:
URL: https://github.com/apache/ignite/pull/10396


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1041004553


##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1041012914


##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)
+            cachePutAsync.accept(++updateCnt);
+
+        blockLatch.get().await();
+
+        // Storing highest counters on backup.
+        forceCheckpoint();
+
+        String backName = backup.name();
+
+        backup.close();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages((n, m) -> m instanceof GridDhtPartitionDemandMessage ||
+            m instanceof GridDhtPartitionSupplyMessage
+        );
+
+        startGrid(backName);
+
+        awaitPartitionMapExchange();
+
+        // Primary commits transactions on node left. Ensure no rebalance occurs.
+        TestRecordingCommunicationSpi.spi(prim).waitForBlocked(1, 5_000);
+        assertFalse(TestRecordingCommunicationSpi.spi(prim).hasBlockedMessages());
+
+        IdleVerifyResultV2 checkRes = idleVerify(prim, DEFAULT_CACHE_NAME);
+        assertFalse(checkRes.hasConflicts());
+    }
+
+    /** */
+    private int prepareCluster(int nodes, int loadCnt) throws Exception {
+        assert nodes > 1;
+
+        int backupNodes = nodes - 1;
+
+        IgniteEx ignite = startGrids(nodes);
+
+        ignite.cluster().state(ClusterState.ACTIVE);
+
+        IgniteCache<Object, Object> cache = ignite.createCache(new CacheConfiguration<>()
+            .setAffinity(new RendezvousAffinityFunction(false, 1))
+            .setBackups(backupNodes)
+            .setName(DEFAULT_CACHE_NAME)
+            .setAtomicityMode(TRANSACTIONAL)
+            .setWriteSynchronizationMode(FULL_SYNC) // Allows to be sure that all messages are sent when put succeed.
+            .setReadFromBackup(true)); // Allows checking values on backups.
+
+        // Initial preloading enough to have historical rebalance.
+        for (int i = 0; i < loadCnt; i++)  //
+            cache.put(i, i);
+
+        // To have historical rebalance on cluster recovery. Decreases percent of updates in comparison to cache size.
+        stopAllGrids();
+        startGrids(nodes);
+
+        return backupNodes;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups.
+     *
+     * @param nodes Nodes number. The backups number is {@code nodes} - 1.
+     * @param putAfterGaps If {@code true}, does more puts to the cache after the simulated gaps.
+     */
+    private void doTestDelayedToBackupsRequests(int nodes, boolean putAfterGaps) throws Exception {
+        int updateCnt = 2_000;
+
+        int backupNodes = prepareCluster(nodes, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        List<Ignite> backups = backupNodes(0L, DEFAULT_CACHE_NAME);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+        AtomicBoolean finishBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if ((msg instanceof GridDhtTxPrepareRequest && prepareBlock.get()) ||
+                    (msg instanceof GridDhtTxFinishRequest && finishBlock.get())) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    assertTrue(latch.getCount() > 0);
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        try {
+            // Blocked at primary and backups.
+            prepareBlock.set(true);
+
+            blockLatch.set(new CountDownLatch(backupNodes * 20));
+
+            for (int i = 0; i < 20; i++)
+                cachePutAsync.accept(++updateCnt);
+
+            blockLatch.get().await();
+        }
+        finally {
+            prepareBlock.set(false);
+        }
+
+        if (backupNodes > 1) {
+            try {
+                // Blocked at backups only.
+                finishBlock.set(true);
+
+                blockLatch.set(new CountDownLatch(backupNodes * 30));
+
+                for (int i = 0; i < 30; i++)
+                    cachePutAsync.accept(++updateCnt);
+
+                blockLatch.get().await();
+            }
+            finally {
+                finishBlock.set(false);
+            }
+        }
+
+        if (putAfterGaps) {
+            for (int i = 0; i < 50; i++)
+                prim.cache(DEFAULT_CACHE_NAME).put(++updateCnt, updateCnt);
+        }
+
+        // Storing counters on primary.
+        forceCheckpoint();
+
+        // Emulating power off, OOM or disk overflow. Keeping data as is, with missed counters updates.
+        backups.forEach(node -> ((BlockableFileIOFactory)node.configuration().getDataStorageConfiguration()
+            .getFileIOFactory()).blocked = true);
+
+        List<String> backNames = backups.stream().map(Ignite::name).collect(Collectors.toList());
+
+        CountDownLatch rebalanceFinished = new CountDownLatch(1);

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] timoninmaxim commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
timoninmaxim commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039330396


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointWorkflow.java:
##########
@@ -396,10 +396,13 @@ private void fillCacheGroupState(CheckpointRecord cpRec) throws IgniteCheckedExc
                     if (partState == LOST)
                         partState = OWNING;
 
+                    assert part.highestAppliedCounter() >= part.updateCounter() :

Review Comment:
   I'm not sure we need this assert:
   1. PartitionCounter implementation should guarantee that by default, otherwise it's a bug. 
   2. `part.highestAppliedCounter()` is a synchronized method, and you invoke it twice here.
   3. I have concerns for this expression, is there guarantees that there are no concurrent updates of partition counter between invokation `highestAppliedCounter` and invokation `updateCounter`?
   
   WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1041005518


##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)
+            cachePutAsync.accept(++updateCnt);
+
+        blockLatch.get().await();
+
+        // Storing highest counters on backup.
+        forceCheckpoint();
+
+        String backName = backup.name();
+
+        backup.close();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages((n, m) -> m instanceof GridDhtPartitionDemandMessage ||
+            m instanceof GridDhtPartitionSupplyMessage
+        );
+
+        startGrid(backName);
+
+        awaitPartitionMapExchange();
+
+        // Primary commits transactions on node left. Ensure no rebalance occurs.
+        TestRecordingCommunicationSpi.spi(prim).waitForBlocked(1, 5_000);
+        assertFalse(TestRecordingCommunicationSpi.spi(prim).hasBlockedMessages());
+
+        IdleVerifyResultV2 checkRes = idleVerify(prim, DEFAULT_CACHE_NAME);
+        assertFalse(checkRes.hasConflicts());
+    }
+
+    /** */
+    private int prepareCluster(int nodes, int loadCnt) throws Exception {
+        assert nodes > 1;
+
+        int backupNodes = nodes - 1;
+
+        IgniteEx ignite = startGrids(nodes);
+
+        ignite.cluster().state(ClusterState.ACTIVE);
+
+        IgniteCache<Object, Object> cache = ignite.createCache(new CacheConfiguration<>()
+            .setAffinity(new RendezvousAffinityFunction(false, 1))
+            .setBackups(backupNodes)
+            .setName(DEFAULT_CACHE_NAME)
+            .setAtomicityMode(TRANSACTIONAL)
+            .setWriteSynchronizationMode(FULL_SYNC) // Allows to be sure that all messages are sent when put succeed.
+            .setReadFromBackup(true)); // Allows checking values on backups.
+
+        // Initial preloading enough to have historical rebalance.
+        for (int i = 0; i < loadCnt; i++)  //
+            cache.put(i, i);
+
+        // To have historical rebalance on cluster recovery. Decreases percent of updates in comparison to cache size.
+        stopAllGrids();
+        startGrids(nodes);
+
+        return backupNodes;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups.
+     *
+     * @param nodes Nodes number. The backups number is {@code nodes} - 1.
+     * @param putAfterGaps If {@code true}, does more puts to the cache after the simulated gaps.
+     */
+    private void doTestDelayedToBackupsRequests(int nodes, boolean putAfterGaps) throws Exception {
+        int updateCnt = 2_000;
+
+        int backupNodes = prepareCluster(nodes, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        List<Ignite> backups = backupNodes(0L, DEFAULT_CACHE_NAME);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+        AtomicBoolean finishBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if ((msg instanceof GridDhtTxPrepareRequest && prepareBlock.get()) ||
+                    (msg instanceof GridDhtTxFinishRequest && finishBlock.get())) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    assertTrue(latch.getCount() > 0);
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        try {
+            // Blocked at primary and backups.
+            prepareBlock.set(true);
+
+            blockLatch.set(new CountDownLatch(backupNodes * 20));
+
+            for (int i = 0; i < 20; i++)
+                cachePutAsync.accept(++updateCnt);
+
+            blockLatch.get().await();
+        }
+        finally {
+            prepareBlock.set(false);
+        }
+
+        if (backupNodes > 1) {
+            try {
+                // Blocked at backups only.
+                finishBlock.set(true);
+
+                blockLatch.set(new CountDownLatch(backupNodes * 30));
+
+                for (int i = 0; i < 30; i++)
+                    cachePutAsync.accept(++updateCnt);
+
+                blockLatch.get().await();
+            }
+            finally {
+                finishBlock.set(false);
+            }
+        }
+
+        if (putAfterGaps) {
+            for (int i = 0; i < 50; i++)
+                prim.cache(DEFAULT_CACHE_NAME).put(++updateCnt, updateCnt);
+        }
+
+        // Storing counters on primary.
+        forceCheckpoint();
+
+        // Emulating power off, OOM or disk overflow. Keeping data as is, with missed counters updates.

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039397824


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointWorkflow.java:
##########
@@ -396,10 +396,13 @@ private void fillCacheGroupState(CheckpointRecord cpRec) throws IgniteCheckedExc
                     if (partState == LOST)
                         partState = OWNING;
 
+                    assert part.highestAppliedCounter() >= part.updateCounter() :

Review Comment:
   There is no assert any more. But isn't it a bug with counter? Or what if just write highest applied ath the checkpoint time? Regarding to the data updated before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] timoninmaxim commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
timoninmaxim commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1038436438


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/IgniteCacheOffheapManager.java:
##########
@@ -670,7 +670,12 @@ interface CacheDataStore {
         long updateCounter();
 
         /**
-         * @return Reserved counter (HWM).
+         * @return Highest applied update counter (HWM).

Review Comment:
   Let's discuss terms again. From my understanding, `highestAppliedCounter != HWM` but `reservedCounter == HWM`. My understanding is based on the Ignite design doc about data consistency [1]. 
   
   Occasionally, we can say that a backup is aware of only `highestAppliedCounter` and then it can think about it as HWM. But backups don't send them to other nodes, then it just a useless info. From the other side primary nodes are aware of `reservedCounter` as the highest known counter value, and use it to send to other nodes. 
   
   [1] https://cwiki.apache.org/confluence/display/IGNITE/Data+consistency



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039286568


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/IgniteCacheOffheapManager.java:
##########
@@ -670,7 +670,12 @@ interface CacheDataStore {
         long updateCounter();
 
         /**
-         * @return Reserved counter (HWM).
+         * @return Highest applied update counter (HWM).

Review Comment:
   ok. I'll add only the method. Without HWM docs renaming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039397824


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointWorkflow.java:
##########
@@ -396,10 +396,13 @@ private void fillCacheGroupState(CheckpointRecord cpRec) throws IgniteCheckedExc
                     if (partState == LOST)
                         partState = OWNING;
 
+                    assert part.highestAppliedCounter() >= part.updateCounter() :

Review Comment:
   There is no assert any more. But isn't it a bug with counter?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1040784143


##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.

Review Comment:
   There is 'and one-phase commit.'



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] Vladsz83 commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
Vladsz83 commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039338481


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointWorkflow.java:
##########
@@ -396,10 +396,13 @@ private void fillCacheGroupState(CheckpointRecord cpRec) throws IgniteCheckedExc
                     if (partState == LOST)
                         partState = OWNING;
 
+                    assert part.highestAppliedCounter() >= part.updateCounter() :

Review Comment:
   1. Removed. 
   2. Yes. But we do it at checkpoint. I dont think is a performance issue.
   3. Same. This is checkoint time. Should not be updates.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] timoninmaxim commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
timoninmaxim commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039388562


##########
modules/core/src/main/java/org/apache/ignite/internal/processors/cache/persistence/checkpoint/CheckpointWorkflow.java:
##########
@@ -396,10 +396,13 @@ private void fillCacheGroupState(CheckpointRecord cpRec) throws IgniteCheckedExc
                     if (partState == LOST)
                         partState = OWNING;
 
+                    assert part.highestAppliedCounter() >= part.updateCounter() :

Review Comment:
   > Same. This is checkoint time. Should not be updates.
   
   I checked `PartitionUpdateCounterTrackingImpl#update(long start, long delta)` and it looks like there are some places where it's invoked without acquiring the checkpoint lock, especially when transaction is rollbacked. If I'm correct, then It looks like that such cases don't change the data in partition, but it concurrently updates the counter. Then this assert might fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [ignite] anton-vinogradov commented on a diff in pull request #10396: Ignite-17793 : Historical rebalance must use HWM instead of LWM, v2: store pending counter instead

Posted by GitBox <gi...@apache.org>.
anton-vinogradov commented on code in PR #10396:
URL: https://github.com/apache/ignite/pull/10396#discussion_r1039661275


##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)
+            cachePutAsync.accept(++updateCnt);
+
+        blockLatch.get().await();
+
+        // Storing highest counters on backup.
+        forceCheckpoint();
+
+        String backName = backup.name();
+
+        backup.close();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages((n, m) -> m instanceof GridDhtPartitionDemandMessage ||
+            m instanceof GridDhtPartitionSupplyMessage
+        );
+
+        startGrid(backName);
+
+        awaitPartitionMapExchange();
+
+        // Primary commits transactions on node left. Ensure no rebalance occurs.
+        TestRecordingCommunicationSpi.spi(prim).waitForBlocked(1, 5_000);
+        assertFalse(TestRecordingCommunicationSpi.spi(prim).hasBlockedMessages());

Review Comment:
   It looks like you may just assert that first line returns false



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)

Review Comment:
   Please measure counters after this blocked put. 
   You should see that backup counters a higher than the primary before the restart.



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.

Review Comment:
   Explicit 2PC mention is required



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.

Review Comment:
   Explicit 1PC mention is required



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)
+            cachePutAsync.accept(++updateCnt);
+
+        blockLatch.get().await();
+
+        // Storing highest counters on backup.

Review Comment:
   'the' missed (hint from IDEA)



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)
+            cachePutAsync.accept(++updateCnt);
+
+        blockLatch.get().await();
+
+        // Storing highest counters on backup.
+        forceCheckpoint();
+
+        String backName = backup.name();
+
+        backup.close();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages((n, m) -> m instanceof GridDhtPartitionDemandMessage ||
+            m instanceof GridDhtPartitionSupplyMessage
+        );
+
+        startGrid(backName);
+
+        awaitPartitionMapExchange();
+
+        // Primary commits transactions on node left. Ensure no rebalance occurs.
+        TestRecordingCommunicationSpi.spi(prim).waitForBlocked(1, 5_000);
+        assertFalse(TestRecordingCommunicationSpi.spi(prim).hasBlockedMessages());
+
+        IdleVerifyResultV2 checkRes = idleVerify(prim, DEFAULT_CACHE_NAME);
+        assertFalse(checkRes.hasConflicts());
+    }
+
+    /** */
+    private int prepareCluster(int nodes, int loadCnt) throws Exception {
+        assert nodes > 1;
+
+        int backupNodes = nodes - 1;
+
+        IgniteEx ignite = startGrids(nodes);
+
+        ignite.cluster().state(ClusterState.ACTIVE);
+
+        IgniteCache<Object, Object> cache = ignite.createCache(new CacheConfiguration<>()
+            .setAffinity(new RendezvousAffinityFunction(false, 1))
+            .setBackups(backupNodes)
+            .setName(DEFAULT_CACHE_NAME)
+            .setAtomicityMode(TRANSACTIONAL)
+            .setWriteSynchronizationMode(FULL_SYNC) // Allows to be sure that all messages are sent when put succeed.
+            .setReadFromBackup(true)); // Allows checking values on backups.
+
+        // Initial preloading enough to have historical rebalance.
+        for (int i = 0; i < loadCnt; i++)  //
+            cache.put(i, i);
+
+        // To have historical rebalance on cluster recovery. Decreases percent of updates in comparison to cache size.
+        stopAllGrids();
+        startGrids(nodes);
+
+        return backupNodes;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups.
+     *
+     * @param nodes Nodes number. The backups number is {@code nodes} - 1.
+     * @param putAfterGaps If {@code true}, does more puts to the cache after the simulated gaps.
+     */
+    private void doTestDelayedToBackupsRequests(int nodes, boolean putAfterGaps) throws Exception {
+        int updateCnt = 2_000;
+
+        int backupNodes = prepareCluster(nodes, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        List<Ignite> backups = backupNodes(0L, DEFAULT_CACHE_NAME);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+        AtomicBoolean finishBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if ((msg instanceof GridDhtTxPrepareRequest && prepareBlock.get()) ||
+                    (msg instanceof GridDhtTxFinishRequest && finishBlock.get())) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    assertTrue(latch.getCount() > 0);
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        try {
+            // Blocked at primary and backups.
+            prepareBlock.set(true);
+
+            blockLatch.set(new CountDownLatch(backupNodes * 20));
+
+            for (int i = 0; i < 20; i++)
+                cachePutAsync.accept(++updateCnt);
+
+            blockLatch.get().await();
+        }
+        finally {
+            prepareBlock.set(false);
+        }
+
+        if (backupNodes > 1) {
+            try {
+                // Blocked at backups only.
+                finishBlock.set(true);
+
+                blockLatch.set(new CountDownLatch(backupNodes * 30));
+
+                for (int i = 0; i < 30; i++)
+                    cachePutAsync.accept(++updateCnt);
+
+                blockLatch.get().await();
+            }
+            finally {
+                finishBlock.set(false);
+            }
+        }
+
+        if (putAfterGaps) {
+            for (int i = 0; i < 50; i++)
+                prim.cache(DEFAULT_CACHE_NAME).put(++updateCnt, updateCnt);
+        }
+
+        // Storing counters on primary.
+        forceCheckpoint();
+
+        // Emulating power off, OOM or disk overflow. Keeping data as is, with missed counters updates.

Review Comment:
   Please measure counters after this blocked puts.
   You should see different states for 1PC and 2PC.



##########
modules/core/src/test/java/org/apache/ignite/internal/processors/cache/distributed/dht/preloader/HistoricalRebalanceCheckpointTest.java:
##########
@@ -0,0 +1,386 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *      http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.ignite.internal.processors.cache.distributed.dht.preloader;
+
+import java.io.File;
+import java.io.IOException;
+import java.nio.ByteBuffer;
+import java.nio.file.OpenOption;
+import java.util.List;
+import java.util.concurrent.CountDownLatch;
+import java.util.concurrent.atomic.AtomicBoolean;
+import java.util.concurrent.atomic.AtomicReference;
+import java.util.function.Consumer;
+import java.util.stream.Collectors;
+import org.apache.ignite.Ignite;
+import org.apache.ignite.IgniteCache;
+import org.apache.ignite.cache.affinity.rendezvous.RendezvousAffinityFunction;
+import org.apache.ignite.cluster.ClusterNode;
+import org.apache.ignite.cluster.ClusterState;
+import org.apache.ignite.configuration.CacheConfiguration;
+import org.apache.ignite.configuration.DataRegionConfiguration;
+import org.apache.ignite.configuration.DataStorageConfiguration;
+import org.apache.ignite.configuration.IgniteConfiguration;
+import org.apache.ignite.configuration.WALMode;
+import org.apache.ignite.events.EventType;
+import org.apache.ignite.failure.StopNodeFailureHandler;
+import org.apache.ignite.internal.IgniteEx;
+import org.apache.ignite.internal.TestRecordingCommunicationSpi;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxFinishRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareRequest;
+import org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxPrepareResponse;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIO;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIODecorator;
+import org.apache.ignite.internal.processors.cache.persistence.file.FileIOFactory;
+import org.apache.ignite.internal.processors.cache.verify.IdleVerifyResultV2;
+import org.apache.ignite.lang.IgniteBiPredicate;
+import org.apache.ignite.plugin.extensions.communication.Message;
+import org.apache.ignite.testframework.GridTestUtils;
+import org.apache.ignite.testframework.junits.common.GridCommonAbstractTest;
+import org.junit.Test;
+
+import static org.apache.ignite.cache.CacheAtomicityMode.TRANSACTIONAL;
+import static org.apache.ignite.cache.CacheWriteSynchronizationMode.FULL_SYNC;
+
+/**
+ *
+ */
+public class HistoricalRebalanceCheckpointTest extends GridCommonAbstractTest {
+    /** {@inheritDoc} */
+    @Override protected void beforeTest() throws Exception {
+        super.beforeTest();
+
+        cleanPersistenceDir();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected void afterTest() throws Exception {
+        super.afterTest();
+
+        stopAllGrids();
+    }
+
+    /** {@inheritDoc} */
+    @Override protected IgniteConfiguration getConfiguration(String igniteInstanceName) throws Exception {
+        IgniteConfiguration cfg = super.getConfiguration(igniteInstanceName);
+
+        cfg.setCommunicationSpi(new TestRecordingCommunicationSpi());
+
+        DataStorageConfiguration dsCfg = new DataStorageConfiguration()
+            .setWalMode(WALMode.LOG_ONLY)
+            .setDefaultDataRegionConfiguration(
+                new DataRegionConfiguration().setMaxSize(50L * 1024 * 1024).setPersistenceEnabled(true)
+            );
+
+        cfg.setDataStorageConfiguration(dsCfg);
+
+        cfg.getDataStorageConfiguration().setFileIOFactory(
+            new BlockableFileIOFactory(cfg.getDataStorageConfiguration().getFileIOFactory()));
+
+        cfg.getDataStorageConfiguration().setWalMode(WALMode.FSYNC); // Allows to use special IO at WAL as well.
+
+        cfg.setFailureHandler(new StopNodeFailureHandler()); // Helps to kill nodes on stop with disabled IO.
+
+        cfg.setIncludeEventTypes(EventType.EVT_CACHE_REBALANCE_STOPPED);
+
+        return cfg;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2Backups() throws Exception {
+        doTestDelayedToBackupsRequests(3, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 2 backups and puts after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests2BackupsMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(3, true);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1Backup() throws Exception {
+        doTestDelayedToBackupsRequests(2, false);
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups with 1 backup and one-phase commit using puts
+     * after the gaps.
+     */
+    @Test
+    public void testDelayedToBackupsRequests1BackupMorePuts() throws Exception {
+        doTestDelayedToBackupsRequests(2, true);
+    }
+
+    /**
+     * Test one-phase commit with lost backup responses.
+     */
+    @Test
+    public void testDelayed1PhaseCommitResponses() throws Exception {
+        int updateCnt = 2_000;
+
+        prepareCluster(2, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        Ignite backup = backupNodes(0L, DEFAULT_CACHE_NAME).get(0);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(backup).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if (msg instanceof GridDhtTxPrepareResponse && prepareBlock.get()) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        prepareBlock.set(true);
+
+        blockLatch.set(new CountDownLatch(20));
+
+        for (int i = 0; i < 20; i++)
+            cachePutAsync.accept(++updateCnt);
+
+        blockLatch.get().await();
+
+        // Storing highest counters on backup.
+        forceCheckpoint();
+
+        String backName = backup.name();
+
+        backup.close();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages((n, m) -> m instanceof GridDhtPartitionDemandMessage ||
+            m instanceof GridDhtPartitionSupplyMessage
+        );
+
+        startGrid(backName);
+
+        awaitPartitionMapExchange();
+
+        // Primary commits transactions on node left. Ensure no rebalance occurs.
+        TestRecordingCommunicationSpi.spi(prim).waitForBlocked(1, 5_000);
+        assertFalse(TestRecordingCommunicationSpi.spi(prim).hasBlockedMessages());
+
+        IdleVerifyResultV2 checkRes = idleVerify(prim, DEFAULT_CACHE_NAME);
+        assertFalse(checkRes.hasConflicts());
+    }
+
+    /** */
+    private int prepareCluster(int nodes, int loadCnt) throws Exception {
+        assert nodes > 1;
+
+        int backupNodes = nodes - 1;
+
+        IgniteEx ignite = startGrids(nodes);
+
+        ignite.cluster().state(ClusterState.ACTIVE);
+
+        IgniteCache<Object, Object> cache = ignite.createCache(new CacheConfiguration<>()
+            .setAffinity(new RendezvousAffinityFunction(false, 1))
+            .setBackups(backupNodes)
+            .setName(DEFAULT_CACHE_NAME)
+            .setAtomicityMode(TRANSACTIONAL)
+            .setWriteSynchronizationMode(FULL_SYNC) // Allows to be sure that all messages are sent when put succeed.
+            .setReadFromBackup(true)); // Allows checking values on backups.
+
+        // Initial preloading enough to have historical rebalance.
+        for (int i = 0; i < loadCnt; i++)  //
+            cache.put(i, i);
+
+        // To have historical rebalance on cluster recovery. Decreases percent of updates in comparison to cache size.
+        stopAllGrids();
+        startGrids(nodes);
+
+        return backupNodes;
+    }
+
+    /**
+     * Tests delayed prepare/finish transaction requests to the backups.
+     *
+     * @param nodes Nodes number. The backups number is {@code nodes} - 1.
+     * @param putAfterGaps If {@code true}, does more puts to the cache after the simulated gaps.
+     */
+    private void doTestDelayedToBackupsRequests(int nodes, boolean putAfterGaps) throws Exception {
+        int updateCnt = 2_000;
+
+        int backupNodes = prepareCluster(nodes, updateCnt);
+
+        Ignite prim = primaryNode(0L, DEFAULT_CACHE_NAME);
+
+        List<Ignite> backups = backupNodes(0L, DEFAULT_CACHE_NAME);
+
+        AtomicBoolean prepareBlock = new AtomicBoolean();
+        AtomicBoolean finishBlock = new AtomicBoolean();
+
+        AtomicReference<CountDownLatch> blockLatch = new AtomicReference<>();
+
+        TestRecordingCommunicationSpi.spi(prim).blockMessages(new IgniteBiPredicate<ClusterNode, Message>() {
+            @Override public boolean apply(ClusterNode node, Message msg) {
+                if ((msg instanceof GridDhtTxPrepareRequest && prepareBlock.get()) ||
+                    (msg instanceof GridDhtTxFinishRequest && finishBlock.get())) {
+                    CountDownLatch latch = blockLatch.get();
+
+                    assertTrue(latch.getCount() > 0);
+
+                    latch.countDown();
+
+                    return true;
+                }
+                else
+                    return false;
+            }
+        });
+
+        IgniteCache<Integer, Integer> primCache = prim.cache(DEFAULT_CACHE_NAME);
+
+        Consumer<Integer> cachePutAsync = (key) -> GridTestUtils.runAsync(() -> primCache.put(key, key));
+
+        try {
+            // Blocked at primary and backups.
+            prepareBlock.set(true);
+
+            blockLatch.set(new CountDownLatch(backupNodes * 20));
+
+            for (int i = 0; i < 20; i++)
+                cachePutAsync.accept(++updateCnt);
+
+            blockLatch.get().await();
+        }
+        finally {
+            prepareBlock.set(false);
+        }
+
+        if (backupNodes > 1) {
+            try {
+                // Blocked at backups only.
+                finishBlock.set(true);
+
+                blockLatch.set(new CountDownLatch(backupNodes * 30));
+
+                for (int i = 0; i < 30; i++)
+                    cachePutAsync.accept(++updateCnt);
+
+                blockLatch.get().await();
+            }
+            finally {
+                finishBlock.set(false);
+            }
+        }
+
+        if (putAfterGaps) {
+            for (int i = 0; i < 50; i++)
+                prim.cache(DEFAULT_CACHE_NAME).put(++updateCnt, updateCnt);
+        }
+
+        // Storing counters on primary.
+        forceCheckpoint();
+
+        // Emulating power off, OOM or disk overflow. Keeping data as is, with missed counters updates.
+        backups.forEach(node -> ((BlockableFileIOFactory)node.configuration().getDataStorageConfiguration()
+            .getFileIOFactory()).blocked = true);
+
+        List<String> backNames = backups.stream().map(Ignite::name).collect(Collectors.toList());
+
+        CountDownLatch rebalanceFinished = new CountDownLatch(1);

Review Comment:
   You should make sure that it was a historical rebalance.
   
   Use smth like 
   ```
   LogListener lsnrRebalanceType = matches("fullPartitions=[], " +
               "histPartitions=[0]").times(backupNodes).build();
   ```
   
   where 0 is a patrition number



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscribe@ignite.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org