You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Vyacheslav Koptilin (Jira)" <ji...@apache.org> on 2022/04/01 09:08:00 UTC
[jira] [Updated] (IGNITE-16718) ItIgniteNodeRestartTest#testCfgGap is flaky
[ https://issues.apache.org/jira/browse/IGNITE-16718?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vyacheslav Koptilin updated IGNITE-16718:
-----------------------------------------
Priority: Blocker (was: Major)
> ItIgniteNodeRestartTest#testCfgGap is flaky
> -------------------------------------------
>
> Key: IGNITE-16718
> URL: https://issues.apache.org/jira/browse/IGNITE-16718
> Project: Ignite
> Issue Type: Bug
> Reporter: Denis Chudov
> Priority: Blocker
> Labels: ignite-3
>
> ItIgniteNodeRestartTest#testCfgGap could be found in ignite-16362 branch.
> The reason of failure is null value instead of previously upserted key.
> With following (a bit simplified test: one table instead of two and one insertion instead of one hundred)
> {code:java}
> public void testCfgGap(TestInfo testInfo) {
> final int nodes = 4;
> for (int i = 0; i < nodes; i++) {
> startNode(testInfo, i);
> }
> createTableWithData(CLUSTER_NODES.get(0), "t1", nodes);
> String igniteName = CLUSTER_NODES.get(nodes - 1).name();
> log.info("Stopping the node.");
> IgnitionManager.stop(igniteName);
> checkTableWithData(CLUSTER_NODES.get(0), "t1");
> log.info("Starting the node.");
> Ignite newNode = IgnitionManager.start(igniteName, null, workDir.resolve(igniteName));
> CLUSTER_NODES.set(nodes - 1, newNode);
> checkTableWithData(CLUSTER_NODES.get(0), "t1");
> checkTableWithData(CLUSTER_NODES.get(nodes - 1), "t1");
> }
> private void checkTableWithData(Ignite ignite, String name) {
> ...
> for (int i = 0; i < 1; i++) {
> ...
> }
> }
> private void createTableWithData(Ignite ignite, String name, int replicas) {
> ...
> for (int i = 0; i < 1; i++) {
> ...
> }
> }{code}
> an inconsistent read is reproduced under the following flow:
> # table.keyValueView.put(k1)
> ## PartitionListener#handleUpsertCommand on Node B
> ## PartitionListener#handleUpsertCommand on Node C
> ## PartitionListener#handleUpsertCommand on Node D
> ## Please pay attention that upsert command wasn't handled on Node A, that's actually fine because B, C, D is a majority.
> # node D stop
> # nodeA.table.keyValueView().get(k1)
> ## PartitionListener#handleGetCommand on Node B // Means that B is a leader.
> # node D start
> ## PartitionListener#handleUpsertCommand on Node D // Inner raft rebalance
> # nodeA.table.keyValueView().get(k1)
> ## PartitionListener#handleGetCommand on Node B // Means that B is still a leader.
> # nodeD.table.keyValueView().get(k1)
> ## PartitionListener#handleGetCommand on Node *A* // Means that leader was changed to A and what's very important there was no handling upsert command on Node A.
> I've checked this by adding
> {code:java}
> private void handleUpsertCommand(UpsertCommand cmd) {
> System.out.println(">>> Upserted" + ((TxManagerImpl)txManager).clusterService.topologyService().localMember());
> ...
> } {code}
> and
> {code:java}
> private SingleRowResponse handleGetCommand(GetCommand cmd) {
> System.out.println(">>> Get" + ((TxManagerImpl)txManager).clusterService.topologyService().localMember());
> ...
> } {code}
>
> Further investigation items might be:
> * Checking whether k1 upsert was committed on node A or not, meaning that committing and applying to state machine are different steps, and according to RAFT it's not valid to be a leader with missing committed entries.
> * Checking why leader was changed between reads.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)