You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@cassandra.apache.org by Prem Yadav <ip...@gmail.com> on 2014/10/14 15:46:51 UTC

repair getting stuck

Hi,
this is an issue we have a faced a couple times now.

Every ones in a while Opscenter throws an error that repair service failed
die to errors. In the logs we can see multiple lines like:

 Repair task (<Node nodename='-5517036565151358111'>,
(-6964720218971987043L, -6963882488374905088L), set([tables])) timed out
after 3600 seconds.

manually running "nodetool repair -pr" on that node just hangs there and
doesn't do anything.
Once we restart dse, the repair job starts fine.

Any ideas?

Thanks

Re: repair getting stuck

Posted by Robert Coli <rc...@eventbrite.com>.
On Tue, Oct 14, 2014 at 6:46 AM, Prem Yadav <ip...@gmail.com> wrote:

> Every ones in a while Opscenter throws an error that repair service failed
> die to errors. In the logs we can see multiple lines like:
>
>  Repair task (<Node nodename='-5517036565151358111'>,
> (-6964720218971987043L, -6963882488374905088L), set([tables])) timed out
> after 3600 seconds.
>
> manually running "nodetool repair -pr" on that node just hangs there and
> doesn't do anything.
> Once we restart dse, the repair job starts fine.
>

Repairs (streams, really) are fragile in all versions up to 2.1. In theory
the remaining edge cases are being squashed in 2.1.

I don't know what opscenter is doing, but this is likely Yet Another Case
of "repair hangs". Basically you need to restart some subset of affected
nodes.

https://issues.apache.org/jira/browse/CASSANDRA-3486
and
https://issues.apache.org/jira/browse/CASSANDRA-6651 et al for background

=Rob