You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "David Capwell (Jira)" <ji...@apache.org> on 2021/04/13 16:56:00 UTC

[jira] [Commented] (CASSANDRA-16585) Periodic failures in *RepairCoordinator*Test caused by race condition with nodetool repair

    [ https://issues.apache.org/jira/browse/CASSANDRA-16585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17320325#comment-17320325 ] 

David Capwell commented on CASSANDRA-16585:
-------------------------------------------

Starting commit

CI Results (pending):
||Branch||Source||Circle CI||Jenkins||
|trunk|[branch|https://github.com/dcapwell/cassandra/tree/commit_remote_branch/CASSANDRA-16585-trunk-FAA5CE7A-C333-46EE-99DA-6E208404FAB8]|[build|https://app.circleci.com/pipelines/github/dcapwell/cassandra?branch=commit_remote_branch%2FCASSANDRA-16585-trunk-FAA5CE7A-C333-46EE-99DA-6E208404FAB8]|[build|https://ci-cassandra.apache.org/job/Cassandra-devbranch/646/]|


> Periodic failures in *RepairCoordinator*Test caused by race condition with nodetool repair
> ------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-16585
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16585
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI, Consistency/Repair, Test/dtest/java
>            Reporter: David Capwell
>            Assignee: David Capwell
>            Priority: Normal
>             Fix For: 4.0-rc
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Periodic failures in *RepairCoordinator*Test cause errors such as
> FullRepairCoordinatorNeighbourDownTest#validationParticipentCrashesAndComesBack[DATACENTER_AWARE/true] 
> {code}
> nodetool command [repair, distributed_test_keyspace, validationparticipentcrashesandcomesback_full_datacenter_aware_true, --dc-parallel, --full] Error message 'Some repair failed' does not contain any of [/127.0.0.2:7012 died]
> stdout:
> [2021-04-07 22:45:24,887] Starting repair command #10 (f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace distributed_test_keyspace with repair options (parallelism: dc_parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [validationparticipentcrashesandcomesback_full_datacenter_aware_true], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, optimise streams: false, ignore unreplicated keyspaces: false)
> [2021-04-07 22:45:32,864] Repair command #10 failed with error Repair session f1342ba0-97f2-11eb-9316-794aa6ab8411 for range [(-1,9223372036854775805], (9223372036854775805,-1]] failed with error Endpoint /127.0.0.2:7012 died
> [2021-04-07 22:45:32,887] After waiting for poll interval of 1 seconds queried for parent session status and discovered repair failed.
> [2021-04-07 22:45:32,887] Repair command #10 finished with error
> [2021-04-07 22:45:32,887] Some repair failed
> [2021-04-07 22:45:32,888] Repair command #10 finished with error
> stderr:
> error: Some repair failed
> -- StackTrace --
> java.io.IOException: Some repair failed
> at org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167)
> at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72)
> at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431)
> at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171)
> at org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
> at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
> at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
> at org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836)
> at org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)
> Notifications:
> Notification{type=START, src=repair:10, message=Starting repair command #10 (f129cb60-97f2-11eb-9316-794aa6ab8411), repairing keyspace distributed_test_keyspace with repair options (parallelism: dc_parallel, primary range: false, incremental: false, job threads: 1, ColumnFamilies: [validationparticipentcrashesandcomesback_full_datacenter_aware_true], dataCenters: [], hosts: [], previewKind: NONE, # of ranges: 2, pull repair: false, force repair: false, optimise streams: false, ignore unreplicated keyspaces: false)}
> Notification{type=ERROR, src=repair:10, message=Repair command #10 failed with error Repair session f1342ba0-97f2-11eb-9316-794aa6ab8411 for range [(-1,9223372036854775805], (9223372036854775805,-1]] failed with error Endpoint /127.0.0.2:7012 died}
> Notification{type=COMPLETE, src=repair:10, message=Repair command #10 finished with error}
> Error:
> java.io.IOException: Some repair failed
> at org.apache.cassandra.tools.RepairRunner.queryForCompletedRepair(RepairRunner.java:167)
> at org.apache.cassandra.tools.RepairRunner.run(RepairRunner.java:72)
> at org.apache.cassandra.tools.NodeProbe.repairAsync(NodeProbe.java:431)
> at org.apache.cassandra.tools.nodetool.Repair.execute(Repair.java:171)
> at org.apache.cassandra.tools.NodeTool$NodeToolCmd.runInternal(NodeTool.java:358)
> at org.apache.cassandra.tools.NodeTool$NodeToolCmd.run(NodeTool.java:343)
> at org.apache.cassandra.tools.NodeTool.execute(NodeTool.java:246)
> at org.apache.cassandra.distributed.impl.Instance$DTestNodeTool.execute(Instance.java:836)
> at org.apache.cassandra.distributed.impl.Instance.lambda$nodetoolResult$38(Instance.java:746)
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> Seems there is a race condition in nodetool repair where we query the error state before we get the notification, then we throw a generic error rather than the specific error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@cassandra.apache.org
For additional commands, e-mail: commits-help@cassandra.apache.org