You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Brian Tarbox <ta...@cabotresearch.com> on 2014/07/01 18:24:44 UTC

nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

I have a six node cluster in AWS (repl:3) and recently noticed that repair
was hanging.  I've run with the "-pr" switch.

I see this output in the nodetool command line (and also in that node's
system.log):
 Starting repair command #9, repairing 256 ranges for keyspace dev_a

but then no other output.  And I see nothing in any of the other node's log
files.

Right now the application using C* is turned off so there is zero activity.
I've let it be in this state for up to 24 hours with nothing more logged.

Any suggestions?

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Brian Tarbox <ta...@cabotresearch.com>.

"For what purpose are you running repair?"   Because I read that we should!
:-)

We do delete data from one column family quite regularly...from the other
CFs occasionally.  We almost never run with less than 100% of our nodes up.

In this configuration do we *need* to run repair?

Thanks,

On Tue, Jul 1, 2014 at 2:57 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox <ta...@cabotresearch.com>
> wrote:
>
>> Given that an upgrade is (for various internal reasons) not an option at
>> this point...is there anything I can do to get repair working again?  I'll
>> also mention that I see this behavior from all nodes.
>>
>
> I think maybe increasing your phi tolerance for streaming timeouts might
> help.
>
> But basically, no. Repair has historically been quite broken in AWS. It
> was re-written in 2.0 along with the rest of streaming, and hopefully will
> soon stabilize and actually work.
>
> For what purpose are you running repair?
>
> =Rob
>

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:

> Given that an upgrade is (for various internal reasons) not an option at
> this point...is there anything I can do to get repair working again?  I'll
> also mention that I see this behavior from all nodes.
>

I think maybe increasing your phi tolerance for streaming timeouts might
help.

But basically, no. Repair has historically been quite broken in AWS. It was
re-written in 2.0 along with the rest of streaming, and hopefully will soon
stabilize and actually work.

For what purpose are you running repair?

=Rob

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Brian Tarbox <ta...@cabotresearch.com>.

Given that an upgrade is (for various internal reasons) not an option at
this point...is there anything I can do to get repair working again?  I'll
also mention that I see this behavior from all nodes.

Thanks.

On Tue, Jul 1, 2014 at 2:51 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox <ta...@cabotresearch.com>
> wrote:
>
>> We're running 1.2.13.
>>
>
> 1.2.17 contains a few streaming fixes which might help.
>
>
>> Any chance that doing a rolling-restart would help?
>>
>
> Probably not.
>
>
>> Would running without the "-pr" improve the odds?
>>
>
> No, that'd make it less likely to succeed.
>
> =Rob
>
>

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:

> We're running 1.2.13.
>

1.2.17 contains a few streaming fixes which might help.

> Any chance that doing a rolling-restart would help?
>

Probably not.

> Would running without the "-pr" improve the odds?
>

No, that'd make it less likely to succeed.

=Rob

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Brian Tarbox <ta...@cabotresearch.com>.

Does this output from jstack indicate a problem?

"ReadRepairStage:12170" daemon prio=10 tid=0x00007f9dcc018800 nid=0x7361
waiting on condition [0x00007f9db540c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000613e049d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

"ReadRepairStage:12169" daemon prio=10 tid=0x00007f9dd4009000 nid=0x7340
waiting on condition [0x00007f9db53cb000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000613e049d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)

"ReadRepairStage:12168" daemon prio=10 tid=0x00007f9dd001d000 nid=0x733f
waiting on condition [0x00007f9db51a6000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x0000000613e049d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
        at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
        at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
        at java.lang.Thread.run(Thread.java:744)




On Tue, Jul 1, 2014 at 2:09 PM, Brian Tarbox <ta...@cabotresearch.com>
wrote:

> We're running 1.2.13.
>
> Any chance that doing a rolling-restart would help?
>
> Would running without the "-pr" improve the odds?
>
> Thanks.
>
>
> On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
>> wrote:
>>
>>> I have a six node cluster in AWS (repl:3) and recently noticed that
>>> repair was hanging.  I've run with the "-pr" switch.
>>>
>>
>> It'll do that.
>>
>> What version of Cassandra?
>>
>> =Rob
>>
>>
>
>

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Brian Tarbox <ta...@cabotresearch.com>.

We're running 1.2.13.

Any chance that doing a rolling-restart would help?

Would running without the "-pr" improve the odds?

Thanks.

On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
> wrote:
>
>> I have a six node cluster in AWS (repl:3) and recently noticed that
>> repair was hanging.  I've run with the "-pr" switch.
>>
>
> It'll do that.
>
> What version of Cassandra?
>
> =Rob
>
>

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Robert Coli <rc...@eventbrite.com>.

On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:

> I have a six node cluster in AWS (repl:3) and recently noticed that repair
> was hanging.  I've run with the "-pr" switch.
>

It'll do that.

What version of Cassandra?

=Rob

Re: nodetool repair saying "starting" and then nothing, and nothing in any of the server logs either

Posted by Kevin Burton <bu...@spinn3r.com>.

if the boxes are idle, you could use jstack and look at the stack… perhaps
it's locked somewhere.

Worth a shot.


On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:

> I have a six node cluster in AWS (repl:3) and recently noticed that repair
> was hanging.  I've run with the "-pr" switch.
>
> I see this output in the nodetool command line (and also in that node's
> system.log):
>  Starting repair command #9, repairing 256 ranges for keyspace dev_a
>
> but then no other output.  And I see nothing in any of the other node's
> log files.
>
> Right now the application using C* is turned off so there is zero activity.
> I've let it be in this state for up to 24 hours with nothing more logged.
>
> Any suggestions?
>



-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>