You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mirza Aliev (Jira)" <ji...@apache.org> on 2022/08/16 14:46:00 UTC
[jira] [Updated] (IGNITE-17286) Race between completing table creation and stopping TableManager

     [ https://issues.apache.org/jira/browse/IGNITE-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mirza Aliev updated IGNITE-17286:
---------------------------------
    Description: 
As IGNITE-17048 demonstrates, our tests sometimes fail with message like the following:

java.lang.AssertionError: Raft groups are still running

The leftover Raft groups always relate to table partitions (and NOT metastorage/cmg).

It looks like this can happen due to TableManager.stop() being called before some table creation is completed (on some Ignite node). As a result, TableManager.stop() does not see this table, so the table does not get stopped, and its Raft groups are left forever.

Adding a delay to table creation completion

public void onSqlSchemaReady(long causalityToken) {
    if (Math.random() < 0.33) {
        try

{             Thread.sleep(1000);         }

catch (InterruptedException e)

{             // ignore         }

    }

    LOG.info("SCHEMA READY FOR " + causalityToken);

    tablesByIdVv.complete(causalityToken);
}

makes the failure manifest itself easily.

The reproducer is in [https://github.com/gridgain/apache-ignite-3/tree/ignite-17286-repr]

To run the reproducer, just run ItComputeTest.executesColocatedByClassNameWithTupleKey()

It usually takes less than 10 iterations to bump into the assertion.

UPD:
As a result, a set of busylock were added to the table creation flow, in places like {{SqlSchemaManagerImpl}} , {{SchemaManager}} and others. Also, added logic of stopping resources in case of stopping node in the middle of the table creation flow

  was:
As IGNITE-17048 demonstrates, our tests sometimes fail with message like the following:

java.lang.AssertionError: Raft groups are still running

The leftover Raft groups always relate to table partitions (and NOT metastorage/cmg).

It looks like this can happen due to TableManager.stop() being called before some table creation is completed (on some Ignite node). As a result, TableManager.stop() does not see this table, so the table does not get stopped, and its Raft groups are left forever.

Adding a delay to table creation completion

public void onSqlSchemaReady(long causalityToken) {
    if (Math.random() < 0.33) {
        try

{             Thread.sleep(1000);         }

catch (InterruptedException e)

{             // ignore         }

    }

    LOG.info("SCHEMA READY FOR " + causalityToken);

    tablesByIdVv.complete(causalityToken);
}

makes the failure manifest itself easily.

The reproducer is in [https://github.com/gridgain/apache-ignite-3/tree/ignite-17286-repr]

To run the reproducer, just run ItComputeTest.executesColocatedByClassNameWithTupleKey()

It usually takes less than 10 iterations to bump into the assertion.


> Race between completing table creation and stopping TableManager
> ----------------------------------------------------------------
>
>                 Key: IGNITE-17286
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17286
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Mirza Aliev
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-alpha6
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> As IGNITE-17048 demonstrates, our tests sometimes fail with message like the following:
> java.lang.AssertionError: Raft groups are still running
> The leftover Raft groups always relate to table partitions (and NOT metastorage/cmg).
> It looks like this can happen due to TableManager.stop() being called before some table creation is completed (on some Ignite node). As a result, TableManager.stop() does not see this table, so the table does not get stopped, and its Raft groups are left forever.
> Adding a delay to table creation completion
> public void onSqlSchemaReady(long causalityToken) {
>     if (Math.random() < 0.33) {
>         try
> {             Thread.sleep(1000);         }
> catch (InterruptedException e)
> {             // ignore         }
>     }
>     LOG.info("SCHEMA READY FOR " + causalityToken);
>     tablesByIdVv.complete(causalityToken);
> }
> makes the failure manifest itself easily.
> The reproducer is in [https://github.com/gridgain/apache-ignite-3/tree/ignite-17286-repr]
> To run the reproducer, just run ItComputeTest.executesColocatedByClassNameWithTupleKey()
> It usually takes less than 10 iterations to bump into the assertion.
> UPD:
> As a result, a set of busylock were added to the table creation flow, in places like {{SqlSchemaManagerImpl}} , {{SchemaManager}} and others. Also, added logic of stopping resources in case of stopping node in the middle of the table creation flow



--
This message was sent by Atlassian Jira
(v8.20.10#820010)