You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Alexey Kuznetsov (JIRA)" <ji...@apache.org> on 2018/05/28 15:26:00 UTC
[jira] [Comment Edited] (IGNITE-5968) Test fail in Ignite Cache 2: GridCachePartitionNotLoadedEventSelfTest.testPrimaryAndBackupDead

    [ https://issues.apache.org/jira/browse/IGNITE-5968?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16492783#comment-16492783 ] 

Alexey Kuznetsov edited comment on IGNITE-5968 at 5/28/18 3:25 PM:
-------------------------------------------------------------------

[~DmitriyGovorukhin] [~agoncharuk] 
The bug due to "lost partition" event is only thrown on new primary node, not on new backup(after old primary and backup nodes are down).

Partition loss policy is _IGNORE_.The test scenario is as follows,

{code:java}
startGrid(0);
startGrid(1);
startGrid(2);
startGrid(3);

ignite(2).events().localListen(lsnr1, EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);
ignite(3).events().localListen(lsnr2, EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);

cache.put(key1, key1);// node 0 is primary for key key1, node 1 is backup for key1.

stopGrid(0, true);
stopGrid(1, true);// after both grids are stopped, we have partition lost for key1.

// Node 2 is new primary node for key1, node 3 is new backup node for key1.

checkEventIsFired(lsn1, lsnr2); // EVT_CACHE_REBALANCE_PART_DATA_LOST event is only thrown on new primary node.
{code}

When 2 nodes, holding partition for key1, have crashed, we have "lost partition" event, fired only on new primary node(not on backup).

The essential reason for this bug is that new primary node *don't set* LOST state to the partitions, 
instead it pretends that no partition loss has happened and clears the partition loss state right away, see _GridDhtPartitionTopologyImpl#detectLostPartitions_
Primary node sends partitions map to backup node, backup node detects *no* lost partitions. So, no events are fired on backup node.

One solution to this is to broadcast partition map with lost partitions via _GridDhtPartitionsFullMessage_.

Are you agree with this solution?


was (Author: alexey kuznetsov):
[~DmitriyGovorukhin] [~agoncharuk] 
The bug due to "lost partition" event is only thrown on new primary node, not on new backup(after old primary and backup nodes are down).

The test scenario is as follows,

{code:java}
startGrid(0);
startGrid(1);
startGrid(2);
startGrid(3);

ignite(2).events().localListen(lsnr1, EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);
ignite(3).events().localListen(lsnr2, EventType.EVT_CACHE_REBALANCE_PART_DATA_LOST);

cache.put(key1, key1);// node 0 is primary for key key1, node 1 is backup for key1.

stopGrid(0, true);
stopGrid(1, true);// after both grids are stopped, we have partition lost for key1.

// Node 2 is new primary node for key1, node 3 is new backup node for key1.

checkEventIsFired(lsn1, lsnr2); // EVT_CACHE_REBALANCE_PART_DATA_LOST event is only thrown on new primary node.
{code}

When 2 nodes, holding partition for key1, have crashed, we have "lost partition" event, fired only on new primary node(not on backup).

The essential reason for this bug is that new primary node *don't set* LOST state to the partitions, 
instead it pretends that no partition loss has happened and clears the partition loss state right away, see _GridDhtPartitionTopologyImpl#detectLostPartitions_
Primary node sends partitions map to backup node, backup node detects *no* lost partitions. So, no events are fired on backup node.

One solution to this is to broadcast partition map with lost partitions via _GridDhtPartitionsFullMessage_.

Are you agree with this solution?

> Test fail in Ignite Cache 2: GridCachePartitionNotLoadedEventSelfTest.testPrimaryAndBackupDead
> ----------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-5968
>                 URL: https://issues.apache.org/jira/browse/IGNITE-5968
>             Project: Ignite
>          Issue Type: Test
>    Affects Versions: 2.1
>            Reporter: Dmitriy Govorukhin
>            Assignee: Alexey Kuznetsov
>            Priority: Major
>              Labels: MakeTeamcityGreenAgain
>             Fix For: 2.6
>
>
> java.util.concurrent.TimeoutException: Test has been timed out [test=testPrimaryAndBackupDead, timeout=300000]
>     at org.apache.ignite.testframework.junits.GridAbstractTest.runTest(GridAbstractTest.java:1949)
>     at junit.framework.TestCase.runBare(TestCase.java:141)
>     at junit.framework.TestResult$1.protect(TestResult.java:122)
>     at junit.framework.TestResult.runProtected(TestResult.java:142)
>     at junit.framework.TestResult.run(TestResult.java:125)
>     at junit.framework.TestCase.run(TestCase.java:129)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at junit.framework.TestSuite.runTest(TestSuite.java:255)
>     at junit.framework.TestSuite.run(TestSuite.java:250)
>     at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:84)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)