You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mirza Aliev (Jira)" <ji...@apache.org> on 2022/08/16 14:46:00 UTC
[jira] [Updated] (IGNITE-17286) Race between completing table creation and stopping TableManager
[ https://issues.apache.org/jira/browse/IGNITE-17286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mirza Aliev updated IGNITE-17286:
---------------------------------
Description:
As IGNITE-17048 demonstrates, our tests sometimes fail with message like the following:
java.lang.AssertionError: Raft groups are still running
The leftover Raft groups always relate to table partitions (and NOT metastorage/cmg).
It looks like this can happen due to TableManager.stop() being called before some table creation is completed (on some Ignite node). As a result, TableManager.stop() does not see this table, so the table does not get stopped, and its Raft groups are left forever.
Adding a delay to table creation completion
public void onSqlSchemaReady(long causalityToken) {
if (Math.random() < 0.33) {
try
{ Thread.sleep(1000); }
catch (InterruptedException e)
{ // ignore }
}
LOG.info("SCHEMA READY FOR " + causalityToken);
tablesByIdVv.complete(causalityToken);
}
makes the failure manifest itself easily.
The reproducer is in [https://github.com/gridgain/apache-ignite-3/tree/ignite-17286-repr]
To run the reproducer, just run ItComputeTest.executesColocatedByClassNameWithTupleKey()
It usually takes less than 10 iterations to bump into the assertion.
UPD:
As a result, a set of busylock were added to the table creation flow, in places like {{SqlSchemaManagerImpl}} , {{SchemaManager}} and others. Also, added logic of stopping resources in case of stopping node in the middle of the table creation flow
was:
As IGNITE-17048 demonstrates, our tests sometimes fail with message like the following:
java.lang.AssertionError: Raft groups are still running
The leftover Raft groups always relate to table partitions (and NOT metastorage/cmg).
It looks like this can happen due to TableManager.stop() being called before some table creation is completed (on some Ignite node). As a result, TableManager.stop() does not see this table, so the table does not get stopped, and its Raft groups are left forever.
Adding a delay to table creation completion
public void onSqlSchemaReady(long causalityToken) {
if (Math.random() < 0.33) {
try
{ Thread.sleep(1000); }
catch (InterruptedException e)
{ // ignore }
}
LOG.info("SCHEMA READY FOR " + causalityToken);
tablesByIdVv.complete(causalityToken);
}
makes the failure manifest itself easily.
The reproducer is in [https://github.com/gridgain/apache-ignite-3/tree/ignite-17286-repr]
To run the reproducer, just run ItComputeTest.executesColocatedByClassNameWithTupleKey()
It usually takes less than 10 iterations to bump into the assertion.
> Race between completing table creation and stopping TableManager
> ----------------------------------------------------------------
>
> Key: IGNITE-17286
> URL: https://issues.apache.org/jira/browse/IGNITE-17286
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Mirza Aliev
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-alpha6
>
> Time Spent: 0.5h
> Remaining Estimate: 0h
>
> As IGNITE-17048 demonstrates, our tests sometimes fail with message like the following:
> java.lang.AssertionError: Raft groups are still running
> The leftover Raft groups always relate to table partitions (and NOT metastorage/cmg).
> It looks like this can happen due to TableManager.stop() being called before some table creation is completed (on some Ignite node). As a result, TableManager.stop() does not see this table, so the table does not get stopped, and its Raft groups are left forever.
> Adding a delay to table creation completion
> public void onSqlSchemaReady(long causalityToken) {
> if (Math.random() < 0.33) {
> try
> { Thread.sleep(1000); }
> catch (InterruptedException e)
> { // ignore }
> }
> LOG.info("SCHEMA READY FOR " + causalityToken);
> tablesByIdVv.complete(causalityToken);
> }
> makes the failure manifest itself easily.
> The reproducer is in [https://github.com/gridgain/apache-ignite-3/tree/ignite-17286-repr]
> To run the reproducer, just run ItComputeTest.executesColocatedByClassNameWithTupleKey()
> It usually takes less than 10 iterations to bump into the assertion.
> UPD:
> As a result, a set of busylock were added to the table creation flow, in places like {{SqlSchemaManagerImpl}} , {{SchemaManager}} and others. Also, added logic of stopping resources in case of stopping node in the middle of the table creation flow
--
This message was sent by Atlassian Jira
(v8.20.10#820010)