You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by David Koblas <da...@koblas.com> on 2012/10/12 01:09:22 UTC

Repair Failing due to bad network

I'm trying to bring up a new Datacenter - while I probably could have 
brought things up in another way I've now got a DC that has a ready 
Cassandra with keys allocated.  The problem is that I cannot get a 
repair to complete due since it appears that some part of my network 
decides to restart all connections twice a day (6am and 2pm - ok 5 
minutes before).

So when I start a repair job, it usually get's a ways into things before 
one of the nodes goes DOWN, then back up.  What I don't see is the 
repair restarting, it just stops.

Is there a workaround for this case, or is there something else I could 
be doing?

--david

Re: Repair Failing due to bad network

Posted by Rob Coli <rc...@palominodb.com>.

https://issues.apache.org/jira/browse/CASSANDRA-3483

Is directly on point for the use case in question, and introduces
"rebuild" concept..

https://issues.apache.org/jira/browse/CASSANDRA-3487
https://issues.apache.org/jira/browse/CASSANDRA-3112

Are for improvements in repair sessions..

https://issues.apache.org/jira/browse/CASSANDRA-4767

Is for unambiguous indication of repair session status.

=Rob

-- 
=Robert Coli
AIM&GTALK - rcoli@palominodb.com
YAHOO - rcoli.palominob
SKYPE - rcoli_palominodb

Re: Repair Failing due to bad network

Posted by David Koblas <da...@koblas.com>.

Jim,

Great idea - though it doesn't look like it's in 1.1.3 (which is what 
I'm running).

My lame idea of the morning is that I'm going to just read the whole 
keyspace with QUORUM reads to force read repairs - the unfortunate truth 
is that this is about 2B reads...

--david

On 10/11/12 4:51 PM, Jim Cistaro wrote:
> I am not aware of any built-in mechanism for retrying repairs.  I believe
> you will have to build that into your process.
>
> As for reducing the time of each repair command to fit in your windows:
>
> If you have multiple reasonable size column families, and are not already
> doing this, one approach might be to do repairs on a per cf basis.  This
> will break your repairs up into smaller chunks that might fit in the
> window.
>
> If you are not doing -pr (primary range), using that on each node causes
> the repair command to only repair the primary range on the node (not the
> ones it is replicating).
>
> Depending on your version, there is also
> https://issues.apache.org/jira/browse/CASSANDRA-3912 which might help you
> - but I have no experience using this feature.
>
> jc
>
> On 10/11/12 4:09 PM, "David Koblas" <da...@koblas.com> wrote:
>
>> I'm trying to bring up a new Datacenter - while I probably could have
>> brought things up in another way I've now got a DC that has a ready
>> Cassandra with keys allocated.  The problem is that I cannot get a
>> repair to complete due since it appears that some part of my network
>> decides to restart all connections twice a day (6am and 2pm - ok 5
>> minutes before).
>>
>> So when I start a repair job, it usually get's a ways into things before
>> one of the nodes goes DOWN, then back up.  What I don't see is the
>> repair restarting, it just stops.
>>
>> Is there a workaround for this case, or is there something else I could
>> be doing?
>>
>> --david
>>

Re: Repair Failing due to bad network

Posted by Jim Cistaro <jc...@netflix.com>.

I am not aware of any built-in mechanism for retrying repairs.  I believe
you will have to build that into your process.

As for reducing the time of each repair command to fit in your windows:

If you have multiple reasonable size column families, and are not already
doing this, one approach might be to do repairs on a per cf basis.  This
will break your repairs up into smaller chunks that might fit in the
window.

If you are not doing -pr (primary range), using that on each node causes
the repair command to only repair the primary range on the node (not the
ones it is replicating).

Depending on your version, there is also
https://issues.apache.org/jira/browse/CASSANDRA-3912 which might help you
- but I have no experience using this feature.

jc

On 10/11/12 4:09 PM, "David Koblas" <da...@koblas.com> wrote:

>I'm trying to bring up a new Datacenter - while I probably could have
>brought things up in another way I've now got a DC that has a ready
>Cassandra with keys allocated.  The problem is that I cannot get a
>repair to complete due since it appears that some part of my network
>decides to restart all connections twice a day (6am and 2pm - ok 5
>minutes before).
>
>So when I start a repair job, it usually get's a ways into things before
>one of the nodes goes DOWN, then back up.  What I don't see is the
>repair restarting, it just stops.
>
>Is there a workaround for this case, or is there something else I could
>be doing?
>
>--david
>