You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Yuki Morishita (JIRA)" <ji...@apache.org> on 2015/01/06 20:33:35 UTC
[jira] [Commented] (CASSANDRA-8316) "Did not get positive replies from all endpoints" error on incremental repair

    [ https://issues.apache.org/jira/browse/CASSANDRA-8316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266614#comment-14266614 ] 

Yuki Morishita commented on CASSANDRA-8316:
-------------------------------------------

bq. 4. B finishes preparing and marks a bunch of sstables as being repaired

B does not mark sstables as repaired for just receiving prepare message, doesn't it?

I understand that the current issue we have is prepared repair session is left on replica nodes when preparing timed out on coordinator.
(In that case, user can work around by doing "forceTerminateRepairSession" manually.)

I prefer sending cancel message, though adding new message may be difficult in minor release. Also we have to make sure message won't get dropped since AntiEntropyStage may be still busy preparing when cancel message arrives.

Alternatively, I think the right solution to automatically remove left sessions is to track repair status as we do in CASSANDRA-5839 and use that to determine which prepared session can be removed.

Either way, I think we can move this to resolve in 3.0 if I didn't miss the severity of the issue.

>  "Did not get positive replies from all endpoints" error on incremental repair
> ------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-8316
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-8316
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Core
>         Environment: cassandra 2.1.2
>            Reporter: Loic Lambiel
>            Assignee: Marcus Eriksson
>             Fix For: 2.1.3
>
>         Attachments: 0001-patch.patch, 8316-v2.patch, CassandraDaemon-2014-11-25-2.snapshot.tar.gz, CassandraDaemon-2014-12-14.snapshot.tar.gz, test.sh
>
>
> Hi,
> I've got an issue with incremental repairs on our production 15 nodes 2.1.2 (new cluster, not yet loaded, RF=3)
> After having successfully performed an incremental repair (-par -inc) on 3 nodes, I started receiving "Repair failed with error Did not get positive replies from all endpoints." from nodetool on all remaining nodes :
> [2014-11-14 09:12:36,488] Starting repair command #3, repairing 108 ranges for keyspace xxxx (seq=false, full=false)
> [2014-11-14 09:12:47,919] Repair failed with error Did not get positive replies from all endpoints.
> All the nodes are up and running and the local system log shows that the repair commands got started and that's it.
> I've also noticed that soon after the repair, several nodes started having more cpu load indefinitely without any particular reason (no tasks / queries, nothing in the logs). I then restarted C* on these nodes and retried the repair on several nodes, which were successful until facing the issue again.
> I tried to repro on our 3 nodes preproduction cluster without success
> It looks like I'm not the only one having this issue: http://www.mail-archive.com/user%40cassandra.apache.org/msg39145.html
> Any idea?
> Thanks
> Loic



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)