You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Brian Tarbox <ta...@cabotresearch.com> on 2014/07/01 18:24:44 UTC
nodetool repair saying "starting" and then nothing, and nothing in
any of the server logs either
I have a six node cluster in AWS (repl:3) and recently noticed that repair
was hanging. I've run with the "-pr" switch.
I see this output in the nodetool command line (and also in that node's
system.log):
Starting repair command #9, repairing 256 ranges for keyspace dev_a
but then no other output. And I see nothing in any of the other node's log
files.
Right now the application using C* is turned off so there is zero activity.
I've let it be in this state for up to 24 hours with nothing more logged.
Any suggestions?
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Brian Tarbox <ta...@cabotresearch.com>.
"For what purpose are you running repair?" Because I read that we should!
:-)
We do delete data from one column family quite regularly...from the other
CFs occasionally. We almost never run with less than 100% of our nodes up.
In this configuration do we *need* to run repair?
Thanks,
On Tue, Jul 1, 2014 at 2:57 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox <ta...@cabotresearch.com>
> wrote:
>
>> Given that an upgrade is (for various internal reasons) not an option at
>> this point...is there anything I can do to get repair working again? I'll
>> also mention that I see this behavior from all nodes.
>>
>
> I think maybe increasing your phi tolerance for streaming timeouts might
> help.
>
> But basically, no. Repair has historically been quite broken in AWS. It
> was re-written in 2.0 along with the rest of streaming, and hopefully will
> soon stabilize and actually work.
>
> For what purpose are you running repair?
>
> =Rob
>
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Robert Coli <rc...@eventbrite.com>.
On Tue, Jul 1, 2014 at 11:54 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:
> Given that an upgrade is (for various internal reasons) not an option at
> this point...is there anything I can do to get repair working again? I'll
> also mention that I see this behavior from all nodes.
>
I think maybe increasing your phi tolerance for streaming timeouts might
help.
But basically, no. Repair has historically been quite broken in AWS. It was
re-written in 2.0 along with the rest of streaming, and hopefully will soon
stabilize and actually work.
For what purpose are you running repair?
=Rob
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Brian Tarbox <ta...@cabotresearch.com>.
Given that an upgrade is (for various internal reasons) not an option at
this point...is there anything I can do to get repair working again? I'll
also mention that I see this behavior from all nodes.
Thanks.
On Tue, Jul 1, 2014 at 2:51 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox <ta...@cabotresearch.com>
> wrote:
>
>> We're running 1.2.13.
>>
>
> 1.2.17 contains a few streaming fixes which might help.
>
>
>> Any chance that doing a rolling-restart would help?
>>
>
> Probably not.
>
>
>> Would running without the "-pr" improve the odds?
>>
>
> No, that'd make it less likely to succeed.
>
> =Rob
>
>
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Robert Coli <rc...@eventbrite.com>.
On Tue, Jul 1, 2014 at 11:09 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:
> We're running 1.2.13.
>
1.2.17 contains a few streaming fixes which might help.
> Any chance that doing a rolling-restart would help?
>
Probably not.
> Would running without the "-pr" improve the odds?
>
No, that'd make it less likely to succeed.
=Rob
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Brian Tarbox <ta...@cabotresearch.com>.
Does this output from jstack indicate a problem?
"ReadRepairStage:12170" daemon prio=10 tid=0x00007f9dcc018800 nid=0x7361
waiting on condition [0x00007f9db540c000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000613e049d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
"ReadRepairStage:12169" daemon prio=10 tid=0x00007f9dd4009000 nid=0x7340
waiting on condition [0x00007f9db53cb000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000613e049d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
"ReadRepairStage:12168" daemon prio=10 tid=0x00007f9dd001d000 nid=0x733f
waiting on condition [0x00007f9db51a6000]
java.lang.Thread.State: TIMED_WAITING (parking)
at sun.misc.Unsafe.park(Native Method)
- parking to wait for <0x0000000613e049d8> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
at
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:226)
at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2082)
at
java.util.concurrent.LinkedBlockingQueue.poll(LinkedBlockingQueue.java:467)
at
java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1068)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
On Tue, Jul 1, 2014 at 2:09 PM, Brian Tarbox <ta...@cabotresearch.com>
wrote:
> We're running 1.2.13.
>
> Any chance that doing a rolling-restart would help?
>
> Would running without the "-pr" improve the odds?
>
> Thanks.
>
>
> On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli <rc...@eventbrite.com> wrote:
>
>> On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
>> wrote:
>>
>>> I have a six node cluster in AWS (repl:3) and recently noticed that
>>> repair was hanging. I've run with the "-pr" switch.
>>>
>>
>> It'll do that.
>>
>> What version of Cassandra?
>>
>> =Rob
>>
>>
>
>
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Brian Tarbox <ta...@cabotresearch.com>.
We're running 1.2.13.
Any chance that doing a rolling-restart would help?
Would running without the "-pr" improve the odds?
Thanks.
On Tue, Jul 1, 2014 at 1:40 PM, Robert Coli <rc...@eventbrite.com> wrote:
> On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
> wrote:
>
>> I have a six node cluster in AWS (repl:3) and recently noticed that
>> repair was hanging. I've run with the "-pr" switch.
>>
>
> It'll do that.
>
> What version of Cassandra?
>
> =Rob
>
>
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Robert Coli <rc...@eventbrite.com>.
On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:
> I have a six node cluster in AWS (repl:3) and recently noticed that repair
> was hanging. I've run with the "-pr" switch.
>
It'll do that.
What version of Cassandra?
=Rob
Re: nodetool repair saying "starting" and then nothing, and nothing
in any of the server logs either
Posted by Kevin Burton <bu...@spinn3r.com>.
if the boxes are idle, you could use jstack and look at the stack… perhaps
it's locked somewhere.
Worth a shot.
On Tue, Jul 1, 2014 at 9:24 AM, Brian Tarbox <ta...@cabotresearch.com>
wrote:
> I have a six node cluster in AWS (repl:3) and recently noticed that repair
> was hanging. I've run with the "-pr" switch.
>
> I see this output in the nodetool command line (and also in that node's
> system.log):
> Starting repair command #9, repairing 256 ranges for keyspace dev_a
>
> but then no other output. And I see nothing in any of the other node's
> log files.
>
> Right now the application using C* is turned off so there is zero activity.
> I've let it be in this state for up to 24 hours with nothing more logged.
>
> Any suggestions?
>
--
Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>