You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vyacheslav Koptilin (Jira)" <ji...@apache.org> on 2022/04/01 09:08:00 UTC

[jira] [Updated] (IGNITE-16718) ItIgniteNodeRestartTest#testCfgGap is flaky

     [ https://issues.apache.org/jira/browse/IGNITE-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vyacheslav Koptilin updated IGNITE-16718:
-----------------------------------------
    Priority: Blocker  (was: Major)

> ItIgniteNodeRestartTest#testCfgGap is flaky
> -------------------------------------------
>
>                 Key: IGNITE-16718
>                 URL: https://issues.apache.org/jira/browse/IGNITE-16718
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Denis Chudov
>            Priority: Blocker
>              Labels: ignite-3
>
> ItIgniteNodeRestartTest#testCfgGap could be found in ignite-16362 branch.
> The reason of failure is null value instead of previously upserted key.
> With following (a bit simplified test: one table instead of two and one insertion instead of one hundred)
> {code:java}
> public void testCfgGap(TestInfo testInfo) {
>     final int nodes = 4;
>     for (int i = 0; i < nodes; i++) {
>         startNode(testInfo, i);
>     }
>     createTableWithData(CLUSTER_NODES.get(0), "t1", nodes);
>     String igniteName = CLUSTER_NODES.get(nodes - 1).name();
>     log.info("Stopping the node.");
>     IgnitionManager.stop(igniteName);
>     checkTableWithData(CLUSTER_NODES.get(0), "t1");
>     log.info("Starting the node.");
>     Ignite newNode = IgnitionManager.start(igniteName, null, workDir.resolve(igniteName));
>     CLUSTER_NODES.set(nodes - 1, newNode);
>     checkTableWithData(CLUSTER_NODES.get(0), "t1");
>     checkTableWithData(CLUSTER_NODES.get(nodes - 1), "t1");
> }
> private void checkTableWithData(Ignite ignite, String name) {
>     ... 
>     for (int i = 0; i < 1; i++) {
>       ...
>     }
> }
> private void createTableWithData(Ignite ignite, String name, int replicas) {
>     ...
>     for (int i = 0; i < 1; i++) {
>       ...
>     }
> }{code}
> an inconsistent read is reproduced under the following flow:
>  # table.keyValueView.put(k1)
>  ## PartitionListener#handleUpsertCommand on Node B
>  ## PartitionListener#handleUpsertCommand on Node C
>  ## PartitionListener#handleUpsertCommand on Node D
>  ## Please pay attention that upsert command wasn't handled on Node A, that's actually fine because B, C, D is a majority.
>  # node D stop
>  # nodeA.table.keyValueView().get(k1)
>  ## PartitionListener#handleGetCommand on Node B // Means that B is a leader.
>  # node D start
>  ## PartitionListener#handleUpsertCommand on Node D // Inner raft rebalance
>  # nodeA.table.keyValueView().get(k1)
>  ## PartitionListener#handleGetCommand on Node B // Means that B is still a leader.
>  # nodeD.table.keyValueView().get(k1) 
>  ## PartitionListener#handleGetCommand on Node *A* // Means that leader was changed to A and what's very important there was no handling upsert command on Node A.
> I've checked this by adding
> {code:java}
> private void handleUpsertCommand(UpsertCommand cmd) {
>     System.out.println(">>> Upserted" + ((TxManagerImpl)txManager).clusterService.topologyService().localMember());
>     ...
> } {code}
> and
> {code:java}
> private SingleRowResponse handleGetCommand(GetCommand cmd) {
>     System.out.println(">>> Get" + ((TxManagerImpl)txManager).clusterService.topologyService().localMember());
>    ...
> } {code}
>  
> Further investigation items might be:
>  * Checking whether k1 upsert was committed on node A or not, meaning that committing and applying to state machine are different steps, and according to RAFT it's not valid to be a leader with missing committed entries.
>  * Checking why leader was changed between reads.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)