You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Stan Lemon <sl...@salesforce.com> on 2015/08/14 20:33:00 UTC

Parallel repairs

Is it safe to run repairs in parallel on multiple nodes in the same DC at
the time or is this discouraged?

I've got a pretty neglected cluster where repairs have not been run for
quite some time and on average I'm seeing them take about 3.5 days to
complete per node. Just trying to figure out if I can shave some time off
the total life of this process by having more then one run at the same time.

Thanks for you help,
Stan

Re: Parallel repairs

Posted by Stan Lemon <sl...@salesforce.com>.

Gotcha, we are using vnodes - so I'll go sequentially through both
datacenters. Unfortunately that's going to take me two months to complete
repairs at this rate. :(

Thanks again for your help,
SL


On Mon, Aug 24, 2015 at 5:17 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Aug 24, 2015 at 2:14 PM, Stan Lemon <sl...@salesforce.com> wrote:
>
>> I do have one other logistical question for you. My tables all have RF2
>> and I my topology is set to have one rac each in two different
>> datacenters.  Each datacenter has 12 nodes, for a grand total of 24.  I am
>> wondering if I am using nodetool repair --parallel and I have it scripted
>> to sequentially walk through the cluster if I can have two scripts walking
>> the cluster for each datacenter, or if I need to go through datacenter 1
>> and then through datacenter 2, before looping back and starting to walk
>> through the cluster again.
>>
>
> If you're using vnodes, I would run repair on one node at a time.
>
> =Rob
>
>
>

Re: Parallel repairs

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Aug 24, 2015 at 2:14 PM, Stan Lemon <sl...@salesforce.com> wrote:

> I do have one other logistical question for you. My tables all have RF2
> and I my topology is set to have one rac each in two different
> datacenters.  Each datacenter has 12 nodes, for a grand total of 24.  I am
> wondering if I am using nodetool repair --parallel and I have it scripted
> to sequentially walk through the cluster if I can have two scripts walking
> the cluster for each datacenter, or if I need to go through datacenter 1
> and then through datacenter 2, before looping back and starting to walk
> through the cluster again.
>

If you're using vnodes, I would run repair on one node at a time.

=Rob

Re: Parallel repairs

Posted by Stan Lemon <sl...@salesforce.com>.

Rob,
Thanks for all the tips.  So I do have SSDs and I set the throttle to 0,
and have started using the --parallel mode. It's taking me about 2.5 days
to complete a node at this point.

I do have one other logistical question for you. My tables all have RF2 and
I my topology is set to have one rac each in two different datacenters.
Each datacenter has 12 nodes, for a grand total of 24.  I am wondering if I
am using nodetool repair --parallel and I have it scripted to sequentially
walk through the cluster if I can have two scripts walking the cluster for
each datacenter, or if I need to go through datacenter 1 and then through
datacenter 2, before looping back and starting to walk through the cluster
again.

Thanks for all your help,
Stan

On Mon, Aug 17, 2015 at 4:55 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Mon, Aug 17, 2015 at 1:37 PM, Stan Lemon <sl...@salesforce.com> wrote:
>
>> I have not changed the throttle compaction value for our cluster. I've
>> not been sure how to gauge where I can I take this value.  Any guidance
>> here would be extremely appreciated.
>>
>
> If you have SSDs, probably just set it to 0?
>
> =Rob
>
>>
>

Re: Parallel repairs

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Aug 17, 2015 at 1:37 PM, Stan Lemon <sl...@salesforce.com> wrote:

> I have not changed the throttle compaction value for our cluster. I've not
> been sure how to gauge where I can I take this value.  Any guidance here
> would be extremely appreciated.
>

If you have SSDs, probably just set it to 0?

=Rob

>

Re: Parallel repairs

Posted by Stan Lemon <sl...@salesforce.com>.

On Mon, Aug 17, 2015 at 2:31 PM, Robert Coli <rc...@eventbrite.com> wrote:

> Have you unthrottled compaction and etc.? 10 days is a long time...
>

I have not changed the throttle compaction value for our cluster. I've not
been sure how to gauge where I can I take this value.  Any guidance here
would be extremely appreciated.

Re: Parallel repairs

Posted by Robert Coli <rc...@eventbrite.com>.

On Mon, Aug 17, 2015 at 8:50 AM, Stan Lemon <sl...@salesforce.com> wrote:,
>
> Thanks for the reply.  I do have vnodes.  I was not aware of the -par flag
> on the repair command, I was actually referring to have repair run on both
> nodes A & B at the same time.  It sounds like though, that I should
> probably be using this -par flag but to do it per node sequentially one
> after the other?
>

That's what I was suggesting, yes.

> Right now just running 'nodetool repair' on one node is taking ~3.5 days,
> with 24 nodes I am now starting to question whether or not I should be
> increasing gc_grace_second to be much longer than the default 10 days since
> it's going to take longer than 10 days to repair the whole cluster.
>

Yep, I personally recommend setting it to 34 days and then kicking off
repair on the first of the month. That way you have been 3 and 7 days for
it to complete, which should in most cases be enough time.

Have you unthrottled compaction and etc.? 10 days is a long time...

=Rob

Re: Parallel repairs

Posted by Stan Lemon <sl...@salesforce.com>.

Rob,
Thanks for the reply.  I do have vnodes.  I was not aware of the -par flag
on the repair command, I was actually referring to have repair run on both
nodes A & B at the same time.  It sounds like though, that I should
probably be using this -par flag but to do it per node sequentially one
after the other?

Right now just running 'nodetool repair' on one node is taking ~3.5 days,
with 24 nodes I am now starting to question whether or not I should be
increasing gc_grace_second to be much longer than the default 10 days since
it's going to take longer than 10 days to repair the whole cluster.

Thanks,
Stan

On Fri, Aug 14, 2015 at 2:44 PM, Robert Coli <rc...@eventbrite.com> wrote:

> On Fri, Aug 14, 2015 at 11:33 AM, Stan Lemon <sl...@salesforce.com>
> wrote:
>
>> Is it safe to run repairs in parallel on multiple nodes in the same DC at
>> the time or is this discouraged?
>>
>
> If you have enough headroom, it's safe. It may impact latency.
>
> It also depends on whether you have vnodes or not. If you don't, and you
> use -par option to repair, you will repair a set of nodes but not all
> nodes. If you do use vnodes, you will effectively repair a range already on
> all nodes with -par repair. If you do a rolling -par -pr on a cluster with
> vnodes, you'll probably be getting close to ideal parallelism anyway.
>
> =Rob
>
>

Re: Parallel repairs

Posted by Robert Coli <rc...@eventbrite.com>.

On Fri, Aug 14, 2015 at 11:33 AM, Stan Lemon <sl...@salesforce.com> wrote:

> Is it safe to run repairs in parallel on multiple nodes in the same DC at
> the time or is this discouraged?
>

If you have enough headroom, it's safe. It may impact latency.

It also depends on whether you have vnodes or not. If you don't, and you
use -par option to repair, you will repair a set of nodes but not all
nodes. If you do use vnodes, you will effectively repair a range already on
all nodes with -par repair. If you do a rolling -par -pr on a cluster with
vnodes, you'll probably be getting close to ideal parallelism anyway.

=Rob