You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Bill Au <bi...@gmail.com> on 2012/04/24 23:55:31 UTC

nodetool repair hanging

I am running 1.0.8.  I am adding a new data center to an existing cluster.
Following steps outlined in another thread on the mailing list, things went
fine except for the last step, which is to run repair on all the nodes in
the new data center.  Repair seems to be hanging indefinitely.  There is no
activity in system.log.  I did notice that the node being repair is
requesting ranges from nodes in both the existing and new data center.
Since there is not data in the new data center initially, I though that it
may be why repair is hanging.  So I break out of the repair with a
control-C after waiting for a while.  I do see data being added to the new
nodes.  When I ran repair for the second time it is still hanging.

Why is repair hanging?  Is it save to use control-C to break out of it.
How do I recover from this?

Bill

Re: nodetool repair hanging

Posted by Bill Au <bi...@gmail.com>.
My cluster is very small (300 MB) and compact was taking more than 2 hours.

I ended up bouncing all the nodes.  After that,  I was able to run repair
on all nodes, and each one takes less than a minute.

If this happens again I will be sure to run compactionstats and netstats.
Thanks for that tip.

Bill

On Wed, Apr 25, 2012 at 11:49 AM, Gregg Ulrich <gu...@netflix.com> wrote:

> How much data do you have and how long is "a while"?  In my experience
> repairs can take a very long time.  Check to see if validation compactions
> are running (nodetool compactionstats) or if files are streaming (nodetool
> netstats).  If either of those are in progress then your repair should be
> running.  I've seen 12 node, 50G clusters take days to repair to a new data
> center.
>
> Not sure if 1.0 is different but in 0.X I don't believe killing the
> nodetool process stops the repair.  When we need to stop a repair we have
> bounced all of the participating nodes.  I've been told that there is no
> harm in stopping repairs.
>
> On Apr 24, 2012, at 2:55 PM, Bill Au wrote:
>
> > I am running 1.0.8.  I am adding a new data center to an existing
> cluster.  Following steps outlined in another thread on the mailing list,
> things went fine except for the last step, which is to run repair on all
> the nodes in the new data center.  Repair seems to be hanging indefinitely.
>  There is no activity in system.log.  I did notice that the node being
> repair is requesting ranges from nodes in both the existing and new data
> center.  Since there is not data in the new data center initially, I though
> that it may be why repair is hanging.  So I break out of the repair with a
> control-C after waiting for a while.  I do see data being added to the new
> nodes.  When I ran repair for the second time it is still hanging.
> >
> > Why is repair hanging?  Is it save to use control-C to break out of it.
>  How do I recover from this?
> >
> > Bill
>
>

Re: nodetool repair hanging

Posted by Gregg Ulrich <gu...@netflix.com>.
How much data do you have and how long is "a while"?  In my experience repairs can take a very long time.  Check to see if validation compactions are running (nodetool compactionstats) or if files are streaming (nodetool netstats).  If either of those are in progress then your repair should be running.  I've seen 12 node, 50G clusters take days to repair to a new data center.

Not sure if 1.0 is different but in 0.X I don't believe killing the nodetool process stops the repair.  When we need to stop a repair we have bounced all of the participating nodes.  I've been told that there is no harm in stopping repairs.

On Apr 24, 2012, at 2:55 PM, Bill Au wrote:

> I am running 1.0.8.  I am adding a new data center to an existing cluster.  Following steps outlined in another thread on the mailing list, things went fine except for the last step, which is to run repair on all the nodes in the new data center.  Repair seems to be hanging indefinitely.  There is no activity in system.log.  I did notice that the node being repair is requesting ranges from nodes in both the existing and new data center.  Since there is not data in the new data center initially, I though that it may be why repair is hanging.  So I break out of the repair with a control-C after waiting for a while.  I do see data being added to the new nodes.  When I ran repair for the second time it is still hanging.
> 
> Why is repair hanging?  Is it save to use control-C to break out of it.  How do I recover from this?
> 
> Bill