You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Juho Mäkinen <ju...@gmail.com> on 2014/10/30 11:37:11 UTC

"Did not get positive replies from all endpoints" error on incremental repair

I'm having problems running nodetool repair -inc -par -pr on my 2.1.1
cluster due to "Did not get positive replies from all endpoints" error.

Here's an example output:
root@db08-3:~# nodetool repair -par -inc -pr

[2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
[2014-10-30 10:33:02,420] Starting repair command #10, repairing 256 ranges
for keyspace profiles (seq=false, full=false)
[2014-10-30 10:33:17,240] Repair failed with error Did not get positive
replies from all endpoints.
[2014-10-30 10:33:17,263] Starting repair command #11, repairing 256 ranges
for keyspace OpsCenter (seq=false, full=false)
[2014-10-30 10:33:32,242] Repair failed with error Did not get positive
replies from all endpoints.
[2014-10-30 10:33:32,249] Starting repair command #12, repairing 256 ranges
for keyspace system_traces (seq=false, full=false)
[2014-10-30 10:33:44,243] Repair failed with error Did not get positive
replies from all endpoints.

The local system log shows that the repair commands got started, but it
seems that they immediately get cancelled due to that error, which btw
can't be seen in the cassandra log.

I tried monitoring all logs from all machines in case another machine would
show up with some useful error, but so far I haven't found nothing.

Any ideas where this error comes from?

 - Garo

Re: "Did not get positive replies from all endpoints" error on incremental repair

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Oct 31, 2014 at 8:55 AM, Juho Mäkinen <ju...@gmail.com>
wrote:

> I can't yet call this conclusive, but it seems that I can't run
> incremental repairs on the current 2.1.1 and I'm still wondering if anybody
> else is experiencing the same problem.
>

You have repro steps, if I were you I would file an JIRA on
http://issues.apache.org.

=Rob

Re: "Did not get positive replies from all endpoints" error on incremental repair

Posted by Juho Mäkinen <ju...@gmail.com>.

I relaunched my cluster from the scratch (due to another reason). After the
relaunch I could ran nodetool repair -par -inc -pr on the nodes without
issue, but pretty match the moment when I started pushing production load
to the cluster I ran into the same problem again. I opened a ticket first
for adding logging info, but I'll most probably end up adding the logging
by myself and I'll start digging through into the actual root cause.

I also ran one nodetool repair -par (ie. without incremental repair) and it
seems that the repair started. Guess I need to go over the sources if
there's a different code path which would explain this.

I can't yet call this conclusive, but it seems that I can't run incremental
repairs on the current 2.1.1 and I'm still wondering if anybody else is
experiencing the same problem.

On Thu, Oct 30, 2014 at 1:14 PM, Juho Mäkinen <ju...@gmail.com>
wrote:

> No, the cluster seems to be performing just fine. It seems that the
> prepareForRepair callback() could be easily modified to print which node(s)
> are unable to respond, so that the debugging effort could be focused
> better. This of course doesn't help this case as it's not trivial to add
> the log lines and to roll it out to the entire cluster.
>
> The cluster is relatively young, containing only 450GB with RF=3 spread
> over nine nodes and I'm still practicing how to run incremental repairs on
> the cluster when I stumbled on this issue.
>
> On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan <ra...@rahul.be>
> wrote:
>
>> It appears to come from the ActiveRepairService.prepareForRepair portion
>> of the Code.
>>
>> Are you sure all nodes are reachable from the node you are initiating
>> repair on, at the same time?
>>
>> Any Node up/down/died messages?
>>
>> Rahul Neelakantan
>>
>> > On Oct 30, 2014, at 6:37 AM, Juho Mäkinen <ju...@gmail.com>
>> wrote:
>> >
>> > I'm having problems running nodetool repair -inc -par -pr on my 2.1.1
>> cluster due to "Did not get positive replies from all endpoints" error.
>> >
>> > Here's an example output:
>> > root@db08-3:~# nodetool repair -par -inc -pr
>> > [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
>> > [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256
>> ranges for keyspace profiles (seq=false, full=false)
>> > [2014-10-30 10:33:17,240] Repair failed with error Did not get positive
>> replies from all endpoints.
>> > [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256
>> ranges for keyspace OpsCenter (seq=false, full=false)
>> > [2014-10-30 10:33:32,242] Repair failed with error Did not get positive
>> replies from all endpoints.
>> > [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256
>> ranges for keyspace system_traces (seq=false, full=false)
>> > [2014-10-30 10:33:44,243] Repair failed with error Did not get positive
>> replies from all endpoints.
>> >
>> > The local system log shows that the repair commands got started, but it
>> seems that they immediately get cancelled due to that error, which btw
>> can't be seen in the cassandra log.
>> >
>> > I tried monitoring all logs from all machines in case another machine
>> would show up with some useful error, but so far I haven't found nothing.
>> >
>> > Any ideas where this error comes from?
>> >
>> >  - Garo
>> >
>>
>
>

Re: "Did not get positive replies from all endpoints" error on incremental repair

Posted by Juho Mäkinen <ju...@gmail.com>.

No, the cluster seems to be performing just fine. It seems that the
prepareForRepair callback() could be easily modified to print which node(s)
are unable to respond, so that the debugging effort could be focused
better. This of course doesn't help this case as it's not trivial to add
the log lines and to roll it out to the entire cluster.

The cluster is relatively young, containing only 450GB with RF=3 spread
over nine nodes and I'm still practicing how to run incremental repairs on
the cluster when I stumbled on this issue.

On Thu, Oct 30, 2014 at 12:52 PM, Rahul Neelakantan <ra...@rahul.be> wrote:

> It appears to come from the ActiveRepairService.prepareForRepair portion
> of the Code.
>
> Are you sure all nodes are reachable from the node you are initiating
> repair on, at the same time?
>
> Any Node up/down/died messages?
>
> Rahul Neelakantan
>
> > On Oct 30, 2014, at 6:37 AM, Juho Mäkinen <ju...@gmail.com>
> wrote:
> >
> > I'm having problems running nodetool repair -inc -par -pr on my 2.1.1
> cluster due to "Did not get positive replies from all endpoints" error.
> >
> > Here's an example output:
> > root@db08-3:~# nodetool repair -par -inc -pr
> > [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
> > [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256
> ranges for keyspace profiles (seq=false, full=false)
> > [2014-10-30 10:33:17,240] Repair failed with error Did not get positive
> replies from all endpoints.
> > [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256
> ranges for keyspace OpsCenter (seq=false, full=false)
> > [2014-10-30 10:33:32,242] Repair failed with error Did not get positive
> replies from all endpoints.
> > [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256
> ranges for keyspace system_traces (seq=false, full=false)
> > [2014-10-30 10:33:44,243] Repair failed with error Did not get positive
> replies from all endpoints.
> >
> > The local system log shows that the repair commands got started, but it
> seems that they immediately get cancelled due to that error, which btw
> can't be seen in the cassandra log.
> >
> > I tried monitoring all logs from all machines in case another machine
> would show up with some useful error, but so far I haven't found nothing.
> >
> > Any ideas where this error comes from?
> >
> >  - Garo
> >
>

Re: "Did not get positive replies from all endpoints" error on incremental repair

Posted by Rahul Neelakantan <ra...@rahul.be>.

It appears to come from the ActiveRepairService.prepareForRepair portion of the Code.

Are you sure all nodes are reachable from the node you are initiating repair on, at the same time?

Any Node up/down/died messages?

Rahul Neelakantan

> On Oct 30, 2014, at 6:37 AM, Juho Mäkinen <ju...@gmail.com> wrote:
> 
> I'm having problems running nodetool repair -inc -par -pr on my 2.1.1 cluster due to "Did not get positive replies from all endpoints" error.
> 
> Here's an example output:
> root@db08-3:~# nodetool repair -par -inc -pr                                                                                 
> [2014-10-30 10:33:02,396] Nothing to repair for keyspace 'system'
> [2014-10-30 10:33:02,420] Starting repair command #10, repairing 256 ranges for keyspace profiles (seq=false, full=false)
> [2014-10-30 10:33:17,240] Repair failed with error Did not get positive replies from all endpoints.
> [2014-10-30 10:33:17,263] Starting repair command #11, repairing 256 ranges for keyspace OpsCenter (seq=false, full=false)
> [2014-10-30 10:33:32,242] Repair failed with error Did not get positive replies from all endpoints.
> [2014-10-30 10:33:32,249] Starting repair command #12, repairing 256 ranges for keyspace system_traces (seq=false, full=false)
> [2014-10-30 10:33:44,243] Repair failed with error Did not get positive replies from all endpoints.
> 
> The local system log shows that the repair commands got started, but it seems that they immediately get cancelled due to that error, which btw can't be seen in the cassandra log.
> 
> I tried monitoring all logs from all machines in case another machine would show up with some useful error, but so far I haven't found nothing.
> 
> Any ideas where this error comes from?
> 
>  - Garo
>