You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Caleb Rackliffe (Jira)" <ji...@apache.org> on 2021/03/02 03:55:00 UTC

[jira] [Comment Edited] (CASSANDRA-16181) 4.0 Quality: Replication Test Audit

    [ https://issues.apache.org/jira/browse/CASSANDRA-16181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293329#comment-17293329 ] 

Caleb Rackliffe edited comment on CASSANDRA-16181 at 3/2/21, 3:54 AM:
----------------------------------------------------------------------

Looking through the actual logs for {{shouldStreamHintsDuringDecomission}}, the migration coordinator from one of the nodes is trying to submit a schema pull on the MIGRATION stage, but it doesn't actually check to see if the stage executor is shut down, and it might be as a result of the decommission. ({{StorageService#decommission()}} shuts down all the stage executors.)

{noformat}
ERROR [node1_isolatedExecutor:1] node1 2021-02-15 19:35:36,284 CassandraDaemon.java:579 - Exception in thread Thread[node1_NonPeriodicTasks:1,5,node1] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:72) 
at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) 
at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) 
at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:176) 
at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118) at org.apache.cassandra.concurrent.Stage.submit(Stage.java:129) 
at org.apache.cassandra.schema.MigrationCoordinator.lambda$scheduleSchemaPull$2(MigrationCoordinator.java:362) 
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) 
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) 
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 
at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) 
at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

This looks like a general issue with pool shutdown order during a decommission. [~e.dimitrova] [~adelapena] If this makes sense to you, I think the best option here would be to create a separate Jira where we can make sure {{ScheduledExecutors.nonPeriodicTasks}} is shut down in {{StorageService#decommission()}} finalization, just like it is in {{StorageService#drain()}}. Assuming, of course, that we do that before we call {{Stage.shutdownNow()}}, it shouldn't be possible for a delayed schema pull to sneak onto an already-closed MIGRATION stage executor.


was (Author: maedhroz):
Looking through the actual logs for {{shouldStreamHintsDuringDecomission}}, the migration coordinator from one of the nodes is trying to submit a schema pull on the MIGRATION stage, but it doesn't actually check to see if the stage executor is shut down, and it might be as a result of the decommission. ({{StorageService#decommission()}} shuts down all the stage executors.)

{noformat}
ERROR [node1_isolatedExecutor:1] node1 2021-02-15 19:35:36,284 CassandraDaemon.java:579 - Exception in thread Thread[node1_NonPeriodicTasks:1,5,node1] java.util.concurrent.RejectedExecutionException: ThreadPoolExecutor has shut down at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor$1.rejectedExecution(DebuggableThreadPoolExecutor.java:72) at java.base/java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) at java.base/java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) at org.apache.cassandra.concurrent.DebuggableThreadPoolExecutor.execute(DebuggableThreadPoolExecutor.java:176) at java.base/java.util.concurrent.AbstractExecutorService.submit(AbstractExecutorService.java:118) at org.apache.cassandra.concurrent.Stage.submit(Stage.java:129) at org.apache.cassandra.schema.MigrationCoordinator.lambda$scheduleSchemaPull$2(MigrationCoordinator.java:362) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.base/java.lang.Thread.run(Thread.java:834)
{noformat}

This looks like a general issue with pool shutdown order during a decommission. [~e.dimitrova] [~adelapena] If this makes sense to you, I think the best option here would be to create a separate Jira where we can make sure {{ScheduledExecutors.nonPeriodicTasks}} is shut down in {{StorageService#decommission()}} finalization, just like it is in {{StorageService#drain()}}. Assuming, of course, that we do that before we call {{Stage.shutdownNow()}}, it shouldn't be possible for a delayed schema pull to sneak onto an already-closed MIGRATION stage executor.

> 4.0 Quality: Replication Test Audit
> -----------------------------------
>
>                 Key: CASSANDRA-16181
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16181
>             Project: Cassandra
>          Issue Type: Task
>          Components: Test/unit
>            Reporter: Andres de la Peña
>            Assignee: Caleb Rackliffe
>            Priority: Normal
>             Fix For: 4.0-rc
>
>          Time Spent: 11h 20m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on replication.
> I think that the main reference dtest for this is [replication_test.py|https://github.com/apache/cassandra-dtest/blob/master/replication_test.py]. We should identify which other tests cover this and identify what should be extended, similarly to what has been done with CASSANDRA-15977.
> The doc [here|https://docs.google.com/document/d/1yPbquhAALIkkTRMmyOv5cceD5N5sPFMB1O4iOd3O7FM/edit?usp=sharing] describes the existing state of testing around replication.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org