You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Vincent Rischmann <me...@vrischmann.me> on 2016/10/27 14:27:53 UTC

Tools to manage repairs

Hi,

we have two Cassandra 2.1.15 clusters at work and are having some
trouble with repairs.

Each cluster has 9 nodes, and the amount of data is not gigantic but
some column families have 300+Gb of data.
We tried to use `nodetool repair` for these tables but at the time we
tested it, it made the whole cluster load too much and it impacted our
production apps.

Next we saw https://github.com/spotify/cassandra-reaper , tried it and
had some success until recently. Since 2 to 3 weeks it never completes a
repair run, deadlocking itself somehow.

I know DSE includes a repair service but I'm wondering how do other
Cassandra users manage repairs ?

Vincent.

Re: Tools to manage repairs

Posted by Jonathan Haddad <jo...@jonhaddad.com>.

Vincent: currently big partitions, even if you're using paging & slicing by
clustering keys, will give you performance problems over time.  Please read
the JIRAs that Alex linked to, they provide in depth explanations as to
why, from some of the best Cassandra operators in the world :)

On Fri, Oct 28, 2016 at 9:50 AM Vincent Rischmann <me...@vrischmann.me> wrote:

> Well I only asked that because I wanted to make sure that we're not doing
> it wrong, because that's actually how we query stuff,  we always provide a
> cluster key or a range of cluster keys.
>
> But yes, I understand that compactions may suffer and/or there may be
> hidden bottlenecks because of big partitions, so it's definitely good to
> know, and I'll definitely work on reducing partition sizes.
>
> On Fri, Oct 28, 2016, at 06:32 PM, Edward Capriolo wrote:
>
>
>
> On Fri, Oct 28, 2016 at 11:21 AM, Vincent Rischmann <me...@vrischmann.me>
> wrote:
>
>
> Doesn't paging help with this ? Also if we select a range via the cluster
> key we're never really selecting the full partition. Or is that wrong ?
>
>
> On Fri, Oct 28, 2016, at 05:00 PM, Edward Capriolo wrote:
>
> Big partitions are an anti-pattern here is why:
>
> First Cassandra is not an analytic datastore. Sure it has some UDFs and
> aggregate UDFs, but the true purpose of the data store is to satisfy point
> reads. Operations have strict timeouts:
>
> # How long the coordinator should wait for read operations to complete
> read_request_timeout_in_ms: 5000
>
> # How long the coordinator should wait for seq or index scans to complete
> range_request_timeout_in_ms: 10000
>
> This means you need to be able to satisfy the operation in 5 seconds.
> Which is not only the "think time" for 1 server, but if you are doing a
> quorum the operation has to complete and compare on 2 or more servers.
> Beyond these cutoffs are thread pools which fill up and start dropping
> requests once full.
>
> Something has to give, either functionality or physics. Particularly the
> physics of aggregating an ever-growing data set across N replicas in less
> than 5 seconds.  How many 2ms point reads will be blocked by 50 ms queries
> etc.
>
> I do not see the technical limitations of big partitions on disk is the
> only hurdle to climb here.
>
>
> On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
> Hi Eric,
>
> that would be https://issues.apache.org/jira/browse/CASSANDRA-9754 by
> Michael Kjellman and https://issues.apache.org/jira/browse/CASSANDRA-11206 by
> Robert Stupp.
> If you haven't seen it yet, Robert's summit talk on big partitions is
> totally worth it :
> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
> Slides :
> http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016
>
> Cheers,
>
>
> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans <jo...@gmail.com>
> wrote:
>
> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
> <al...@thelastpickle.com> wrote:
> > A few patches are pushing the limits of partition sizes so we may soon be
> > more comfortable with big partitions.
>
> You don't happen to have Jira links to these handy, do you?
>
>
>
> --
> Eric Evans
> john.eric.evans@gmail.com
>
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
> "Doesn't paging help with this ? Also if we select a range via the
> cluster key we're never really selecting the full partition. Or is that
> wrong ?"
>
> What I am suggestion is that the data store has had this practical
> limitation on size of partition since inception. As a result the common use
> case is not to use it in such a way. For example, the compaction manager
> may not be optimized for this cases, queries running across large
> partitions may cause more contention or lots of young gen garbage , queries
> running across large partitions may occupy the slots of the read stage etc.
>
>
>
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201602.mbox/%3CCAJjpQyTS2eaCcRBVa=ZmM-hcBX5nF4ovC1enW+SFfGwvngOi7g@mail.gmail.com%3E
>
> I think there is possibly some more "little details" to discover. Not in a
> bad thing. I just do not think it you can hand-waive like a specific thing
> someone is working on now or paging solves it. If it was that easy it would
> be solved by now :)
>
>
>

Re: Tools to manage repairs

Posted by Vincent Rischmann <me...@vrischmann.me>.

Well I only asked that because I wanted to make sure that we're not
doing it wrong, because that's actually how we query stuff,  we always
provide a cluster key or a range of cluster keys.

But yes, I understand that compactions may suffer and/or there may be
hidden bottlenecks because of big partitions, so it's definitely good to
know, and I'll definitely work on reducing partition sizes.

On Fri, Oct 28, 2016, at 06:32 PM, Edward Capriolo wrote:
>
>
> On Fri, Oct 28, 2016 at 11:21 AM, Vincent Rischmann
> <me...@vrischmann.me> wrote:
>> __
>> Doesn't paging help with this ? Also if we select a range via the
>> cluster key we're never really selecting the full partition. Or is
>> that wrong ?
>>
>>
>> On Fri, Oct 28, 2016, at 05:00 PM, Edward Capriolo wrote:
>>> Big partitions are an anti-pattern here is why:
>>>
>>> First Cassandra is not an analytic datastore. Sure it has some UDFs
>>> and aggregate UDFs, but the true purpose of the data store is to
>>> satisfy point reads. Operations have strict timeouts:
>>>
>>> # How long the coordinator should wait for read operations to
>>> # complete
>>> read_request_timeout_in_ms: 5000
>>>
>>> # How long the coordinator should wait for seq or index scans to
>>> # complete
>>> range_request_timeout_in_ms: 10000
>>>
>>> This means you need to be able to satisfy the operation in 5
>>> seconds. Which is not only the "think time" for 1 server, but if you
>>> are doing a quorum the operation has to complete and compare on 2 or
>>> more servers. Beyond these cutoffs are thread pools which fill up
>>> and start dropping requests once full.
>>>
>>> Something has to give, either functionality or physics. Particularly
>>> the physics of aggregating an ever-growing data set across N
>>> replicas in less than 5 seconds.  How many 2ms point reads will be
>>> blocked by 50 ms queries etc.
>>>
>>> I do not see the technical limitations of big partitions on disk is
>>> the only hurdle to climb here.
>>>
>>>
>>> On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski
>>> <al...@thelastpickle.com> wrote:
>>>> Hi Eric,
>>>>
>>>> that would be https://issues.apache.org/jira/browse/CASSANDRA-9754
>>>> by Michael Kjellman and
>>>> https://issues.apache.org/jira/browse/CASSANDRA-11206 by Robert
>>>> Stupp.
>>>> If you haven't seen it yet, Robert's summit talk on big partitions
>>>> is totally worth it :
>>>> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
>>>> Slides :
>>>> http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016
>>>>
>>>> Cheers,
>>>>
>>>>
>>>> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans
>>>> <jo...@gmail.com> wrote:
>>>>> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
>>>>> <al...@thelastpickle.com> wrote:
>>>>> > A few patches are pushing the limits of partition sizes so we
>>>>> > may soon be
>>>>> > more comfortable with big partitions.
>>>>>
>>>>> You don't happen to have Jira links to these handy, do you?
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>  Eric Evans john.eric.evans@gmail.com
>>>>>
>>>>
>>>>
>>>> --
>>>> -----------------
>>>> Alexander Dejanovski
>>>> France
>>>> @alexanderdeja
>>>>
>>>> Consultant
>>>> Apache Cassandra Consulting
>>>> http://www.thelastpickle.com[1]
>>>>
>>>>
>>
>
> "Doesn't paging help with this ? Also if we select a range via the
> cluster key we're never really selecting the full partition. Or is
> that wrong ?"
>
> What I am suggestion is that the data store has had this practical
> limitation on size of partition since inception. As a result the
> common use case is not to use it in such a way. For example, the
> compaction manager may not be optimized for this cases, queries
> running across large partitions may cause more contention or lots of
> young gen garbage , queries running across large partitions may occupy
> the slots of the read stage etc.
>
>
> http://mail-archives.apache.org/mod_mbox/cassandra-user/201602.mbox/%3CCAJjpQyTS2eaCcRBVa=ZmM-hcBX5nF4ovC1enW+SFfGwvngOi7g@mail.gmail.com%3E
>
> I think there is possibly some more "little details" to discover. Not
> in a bad thing. I just do not think it you can hand-waive like a
> specific thing someone is working on now or paging solves it. If it
> was that easy it would be solved by now :)
>


Links:

  1. http://www.thelastpickle.com/

Re: Tools to manage repairs

Posted by Edward Capriolo <ed...@gmail.com>.

On Fri, Oct 28, 2016 at 11:21 AM, Vincent Rischmann <me...@vrischmann.me>
wrote:

> Doesn't paging help with this ? Also if we select a range via the cluster
> key we're never really selecting the full partition. Or is that wrong ?
>
>
> On Fri, Oct 28, 2016, at 05:00 PM, Edward Capriolo wrote:
>
> Big partitions are an anti-pattern here is why:
>
> First Cassandra is not an analytic datastore. Sure it has some UDFs and
> aggregate UDFs, but the true purpose of the data store is to satisfy point
> reads. Operations have strict timeouts:
>
> # How long the coordinator should wait for read operations to complete
> read_request_timeout_in_ms: 5000
>
> # How long the coordinator should wait for seq or index scans to complete
> range_request_timeout_in_ms: 10000
>
> This means you need to be able to satisfy the operation in 5 seconds.
> Which is not only the "think time" for 1 server, but if you are doing a
> quorum the operation has to complete and compare on 2 or more servers.
> Beyond these cutoffs are thread pools which fill up and start dropping
> requests once full.
>
> Something has to give, either functionality or physics. Particularly the
> physics of aggregating an ever-growing data set across N replicas in less
> than 5 seconds.  How many 2ms point reads will be blocked by 50 ms queries
> etc.
>
> I do not see the technical limitations of big partitions on disk is the
> only hurdle to climb here.
>
>
> On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski <
> alex@thelastpickle.com> wrote:
>
> Hi Eric,
>
> that would be https://issues.apache.org/jira/browse/CASSANDRA-9754 by
> Michael Kjellman and https://issues.apache.org/jira/browse/CASSANDRA-11206 by
> Robert Stupp.
> If you haven't seen it yet, Robert's summit talk on big partitions is
> totally worth it :
> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
> Slides : http://www.slideshare.net/DataStax/myths-of-big-partitions
> -robert-stupp-datastax-cassandra-summit-2016
>
> Cheers,
>
>
> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans <jo...@gmail.com>
> wrote:
>
> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
> <al...@thelastpickle.com> wrote:
> > A few patches are pushing the limits of partition sizes so we may soon be
> > more comfortable with big partitions.
>
> You don't happen to have Jira links to these handy, do you?
>
>
> --
> Eric Evans
> john.eric.evans@gmail.com
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
"Doesn't paging help with this ? Also if we select a range via the cluster
key we're never really selecting the full partition. Or is that wrong ?"

What I am suggestion is that the data store has had this practical
limitation on size of partition since inception. As a result the common use
case is not to use it in such a way. For example, the compaction manager
may not be optimized for this cases, queries running across large
partitions may cause more contention or lots of young gen garbage , queries
running across large partitions may occupy the slots of the read stage etc.


http://mail-archives.apache.org/mod_mbox/cassandra-user/201602.mbox/%3CCAJjpQyTS2eaCcRBVa=ZmM-hcBX5nF4ovC1enW+SFfGwvngOi7g@mail.gmail.com%3E

I think there is possibly some more "little details" to discover. Not in a
bad thing. I just do not think it you can hand-waive like a specific thing
someone is working on now or paging solves it. If it was that easy it would
be solved by now :)

Re: Tools to manage repairs

Posted by Vincent Rischmann <me...@vrischmann.me>.

Doesn't paging help with this ? Also if we select a range via the
cluster key we're never really selecting the full partition. Or is
that wrong ?


On Fri, Oct 28, 2016, at 05:00 PM, Edward Capriolo wrote:
> Big partitions are an anti-pattern here is why:
>
> First Cassandra is not an analytic datastore. Sure it has some UDFs
> and aggregate UDFs, but the true purpose of the data store is to
> satisfy point reads. Operations have strict timeouts:
>
> # How long the coordinator should wait for read operations to complete
> read_request_timeout_in_ms: 5000
>
> # How long the coordinator should wait for seq or index scans to
> # complete
> range_request_timeout_in_ms: 10000
>
> This means you need to be able to satisfy the operation in 5 seconds.
> Which is not only the "think time" for 1 server, but if you are doing
> a quorum the operation has to complete and compare on 2 or more
> servers. Beyond these cutoffs are thread pools which fill up and start
> dropping requests once full.
>
> Something has to give, either functionality or physics. Particularly
> the physics of aggregating an ever-growing data set across N replicas
> in less than 5 seconds.  How many 2ms point reads will be blocked by
> 50 ms queries etc.
>
> I do not see the technical limitations of big partitions on disk is
> the only hurdle to climb here.
>
>
> On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski
> <al...@thelastpickle.com> wrote:
>> Hi Eric,
>>
>> that would be
>> https://issues.apache.org/jira/browse/CASSANDRA-9754 by
>> Michael Kjellman and
>> https://issues.apache.org/jira/browse/CASSANDRA-11206 by
>> Robert Stupp.
>> If you haven't seen it yet, Robert's summit talk on big partitions is
>> totally worth it :
>> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
>> Slides :
>> http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016
>>
>> Cheers,
>>
>>
>> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans
>> <jo...@gmail.com> wrote:
>>> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
>>>  <al...@thelastpickle.com> wrote:
>>>  > A few patches are pushing the limits of partition sizes so we may
>>>  > soon be
>>>  > more comfortable with big partitions.
>>>
>>>  You don't happen to have Jira links to these handy, do you?
>>>
>>>
>>>  --
>>>  Eric Evans john.eric.evans@gmail.com
>>
>> --
>> -----------------
>> Alexander Dejanovski
>> France
>> @alexanderdeja
>>
>> Consultant
>> Apache Cassandra Consulting
>> http://www.thelastpickle.com[1]
>>


Links:

  1. http://www.thelastpickle.com/

Re: Tools to manage repairs

Posted by Edward Capriolo <ed...@gmail.com>.

Big partitions are an anti-pattern here is why:

First Cassandra is not an analytic datastore. Sure it has some UDFs and
aggregate UDFs, but the true purpose of the data store is to satisfy point
reads. Operations have strict timeouts:

# How long the coordinator should wait for read operations to complete
read_request_timeout_in_ms: 5000

# How long the coordinator should wait for seq or index scans to complete
range_request_timeout_in_ms: 10000

This means you need to be able to satisfy the operation in 5 seconds. Which
is not only the "think time" for 1 server, but if you are doing a quorum
the operation has to complete and compare on 2 or more servers. Beyond
these cutoffs are thread pools which fill up and start dropping requests
once full.

Something has to give, either functionality or physics. Particularly the
physics of aggregating an ever-growing data set across N replicas in less
than 5 seconds.  How many 2ms point reads will be blocked by 50 ms queries
etc.

I do not see the technical limitations of big partitions on disk is the
only hurdle to climb here.

On Fri, Oct 28, 2016 at 10:39 AM, Alexander Dejanovski <
alex@thelastpickle.com> wrote:

> Hi Eric,
>
> that would be https://issues.apache.org/jira/browse/CASSANDRA-9754 by
> Michael Kjellman and https://issues.apache.org/jira/browse/CASSANDRA-11206 by
> Robert Stupp.
> If you haven't seen it yet, Robert's summit talk on big partitions is
> totally worth it :
> Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
> Slides : http://www.slideshare.net/DataStax/myths-of-big-
> partitions-robert-stupp-datastax-cassandra-summit-2016
>
> Cheers,
>
>
> On Fri, Oct 28, 2016 at 4:09 PM Eric Evans <jo...@gmail.com>
> wrote:
>
>> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
>> <al...@thelastpickle.com> wrote:
>> > A few patches are pushing the limits of partition sizes so we may soon
>> be
>> > more comfortable with big partitions.
>>
>> You don't happen to have Jira links to these handy, do you?
>>
>>
>> --
>> Eric Evans
>> john.eric.evans@gmail.com
>>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>

Re: Tools to manage repairs

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Hi Eric,

that would be https://issues.apache.org/jira/browse/CASSANDRA-9754 by
Michael Kjellman and https://issues.apache.org/jira/browse/CASSANDRA-11206 by
Robert Stupp.
If you haven't seen it yet, Robert's summit talk on big partitions is
totally worth it :
Video : https://www.youtube.com/watch?v=N3mGxgnUiRY
Slides :
http://www.slideshare.net/DataStax/myths-of-big-partitions-robert-stupp-datastax-cassandra-summit-2016

Cheers,

On Fri, Oct 28, 2016 at 4:09 PM Eric Evans <jo...@gmail.com>
wrote:

> On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
> <al...@thelastpickle.com> wrote:
> > A few patches are pushing the limits of partition sizes so we may soon be
> > more comfortable with big partitions.
>
> You don't happen to have Jira links to these handy, do you?
>
>
> --
> Eric Evans
> john.eric.evans@gmail.com
>
-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Tools to manage repairs

Posted by Eric Evans <jo...@gmail.com>.

On Thu, Oct 27, 2016 at 4:13 PM, Alexander Dejanovski
<al...@thelastpickle.com> wrote:
> A few patches are pushing the limits of partition sizes so we may soon be
> more comfortable with big partitions.

You don't happen to have Jira links to these handy, do you?


-- 
Eric Evans
john.eric.evans@gmail.com

Re: Tools to manage repairs

Posted by Jeff Jirsa <je...@crowdstrike.com>.

If you go above ~1GB, the primary symptom you’ll see is a LOT of garbage created on reads (CASSANDRA-9754 details this).

 

As redesigning data model is often expensive (engineering time, reloading data, etc), one workaround is to tune your JVM to better handle situations where you create a lot of trash. One method that can help work around this is to use a much larger eden size than default – up to 50% of your total heap size.

 

For example, if you were using 8G heap and 2G eden, going  to 3G or 4G eden (new heap size in cassandra-env.sh) MAY work better for you if you’re reading from large partitions (it can also crash your server in some cases, so TEST IT IN A LAB FIRST).

 

-          Jeff

 

From: Alexander Dejanovski <al...@thelastpickle.com>
Reply-To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Date: Thursday, October 27, 2016 at 2:13 PM
To: "user@cassandra.apache.org" <us...@cassandra.apache.org>
Subject: Re: Tools to manage repairs

 

The "official" recommendation would be 100MB, but it's hard to give a precise answer.
Keeping it under the GB seems like a good target.
A few patches are pushing the limits of partition sizes so we may soon be more comfortable with big partitions.

Cheers

 

Le jeu. 27 oct. 2016 21:28, Vincent Rischmann <me...@vrischmann.me> a écrit :

Yeah that particular table is badly designed, I intend to fix it, when the roadmap allows us to do it :)

What is the recommended maximum partition size ?

 

Thanks for all the information.

 

 

On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote:

3.3GB is already too high, and it's surely not good to have well performing compactions. Still I know changing a data model is no easy thing to do, but you should try to do something here.

Anticompaction is a special type of compaction and if an sstable is being anticompacted, then any attempt to run validation compaction on it will fail, telling you that you cannot have an sstable being part of 2 repair sessions at the same time, so incremental repair must be run one node at a time, waiting for anticompactions to end before moving from one node to the other.

Be mindful of running incremental repair on a regular basis once you started as you'll have two separate pools of sstables (repaired and unrepaired) that won't get compacted together, which could be a problem if you want tombstones to be purged efficiently.

Cheers,

 

Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <me...@vrischmann.me> a écrit :

 

Ok, I think we'll give incremental repairs a try on a limited number of CFs first and then if it goes well we'll progressively switch more CFs to incremental.

 

I'm not sure I understand the problem with anticompaction and validation running concurrently. As far as I can tell, right now when a CF is repaired (either via reaper, or via nodetool) there may be compactions running at the same time. In fact, it happens very often. Is it a problem ?

 

As far as big partitions, the biggest one we have is around 3.3Gb. Some less big partitions are around 500Mb and less.

 

 

On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:

Oh right, that's what they advise :)

I'd say that you should skip the full repair phase in the migration procedure as that will obviously fail, and just mark all sstables as repaired (skip 1, 2 and 6).

Anyway you can't do better, so take a leap of faith there.

 

Intensity is already very low and 10000 segments is a whole lot for 9 nodes, you should not need that many.

 

You can definitely pick which CF you'll run incremental repair on, and still run full repair on the rest.

If you pick our Reaper fork, watch out for schema changes that add incremental repair fields, and I do not advise to run incremental repair without it, otherwise you might have issues with anticompaction and validation compactions running concurrently from time to time.

 

One last thing : can you check if you have particularly big partitions in the CFs that fail to get repaired ? You can run nodetool cfhistograms to check that.

 

Cheers,

 

 

 

On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <me...@vrischmann.me> wrote:

 

Thanks for the response.

 

We do break up repairs between tables, we also tried our best to have no overlap between repair runs. Each repair has 10000 segments (purely arbitrary number, seemed to help at the time). Some runs have an intensity of 0.4, some have as low as 0.05.

 

Still, sometimes one particular app (which does a lot of read/modify/write batches in quorum) gets slowed down to the point we have to stop the repair run.

 

But more annoyingly, since 2 to 3 weeks as I said, it looks like runs don't progress after some time. Every time I restart reaper, it starts to repair correctly again, up until it gets stuck. I have no idea why that happens now, but it means I have to baby sit reaper, and it's becoming annoying.

 

Thanks for the suggestion about incremental repairs. It would probably be a good thing but it's a little challenging to setup I think. Right now running a full repair of all keyspaces (via nodetool repair) is going to take a lot of time, probably like 5 days or more. We were never able to run one to completion. I'm not sure it's a good idea to disable autocompaction for that long.

 

But maybe I'm wrong. Is it possible to use incremental repairs on some column family only ?

 

 

On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:

Hi Vincent,

 

most people handle repair with : 

- pain (by hand running nodetool commands)

- cassandra range repair : https://github.com/BrianGallew/cassandra_range_repair

- Spotify Reaper

- and OpsCenter repair service for DSE users

 

Reaper is a good option I think and you should stick to it. If it cannot do the job here then no other tool will.

 

You have several options from here : 
Try to break up your repair table by table and see which ones actually get stuck
Check your logs for any repair/streaming error
Avoid repairing everything : 
you may have expendable tables 
you may have TTLed only tables with no deletes, accessed with QUORUM CL only
You can try to relieve repair pressure in Reaper by lowering repair intensity (on the tables that get stuck)
You can try adding steps to your repair process by putting a higher segment count in reaper (on the tables that get stuck)
And lastly, you can turn to incremental repair. As you're familiar with Reaper already, you might want to take a look at our Reaper fork that handles incremental repair : https://github.com/thelastpickle/cassandra-reaper
If you go down that way, make sure you first mark all sstables as repaired before you run your first incremental repair, otherwise you'll end up in anticompaction hell (bad bad place) : https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html

Even if people say that's not necessary anymore, it'll save you from a very bad first experience with incremental repair.

Furthermore, make sure you run repair daily after your first inc repair run, in order to work on small sized repairs.

 

Cheers,

 

 

On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <me...@vrischmann.me> wrote:

 

Hi,

 

we have two Cassandra 2.1.15 clusters at work and are having some trouble with repairs.

 

Each cluster has 9 nodes, and the amount of data is not gigantic but some column families have 300+Gb of data.

We tried to use `nodetool repair` for these tables but at the time we tested it, it made the whole cluster load too much and it impacted our production apps.

 

Next we saw https://github.com/spotify/cassandra-reaper , tried it and had some success until recently. Since 2 to 3 weeks it never completes a repair run, deadlocking itself somehow.

 

I know DSE includes a repair service but I'm wondering how do other Cassandra users manage repairs ?

 

Vincent.

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

 

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

 

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

 

-- 

-----------------

Alexander Dejanovski

France

@alexanderdeja

 

Consultant

Apache Cassandra Consulting

http://www.thelastpickle.com

CONFIDENTIALITY NOTE: This e-mail and any attachments are confidential and may be legally privileged. If you are not the intended recipient, do not disclose, copy, distribute, or use this email or any attachments. If you have received this in error please let the sender know and then delete the email and all attachments.

Re: Tools to manage repairs

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

The "official" recommendation would be 100MB, but it's hard to give a
precise answer.
Keeping it under the GB seems like a good target.
A few patches are pushing the limits of partition sizes so we may soon be
more comfortable with big partitions.

Cheers

Le jeu. 27 oct. 2016 21:28, Vincent Rischmann <me...@vrischmann.me> a écrit :

> Yeah that particular table is badly designed, I intend to fix it, when the
> roadmap allows us to do it :)
> What is the recommended maximum partition size ?
>
> Thanks for all the information.
>
>
> On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote:
>
> 3.3GB is already too high, and it's surely not good to have well
> performing compactions. Still I know changing a data model is no easy thing
> to do, but you should try to do something here.
>
> Anticompaction is a special type of compaction and if an sstable is being
> anticompacted, then any attempt to run validation compaction on it will
> fail, telling you that you cannot have an sstable being part of 2 repair
> sessions at the same time, so incremental repair must be run one node at a
> time, waiting for anticompactions to end before moving from one node to the
> other.
>
> Be mindful of running incremental repair on a regular basis once you
> started as you'll have two separate pools of sstables (repaired and
> unrepaired) that won't get compacted together, which could be a problem if
> you want tombstones to be purged efficiently.
>
> Cheers,
>
> Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <me...@vrischmann.me> a écrit :
>
>
> Ok, I think we'll give incremental repairs a try on a limited number of
> CFs first and then if it goes well we'll progressively switch more CFs to
> incremental.
>
> I'm not sure I understand the problem with anticompaction and validation
> running concurrently. As far as I can tell, right now when a CF is repaired
> (either via reaper, or via nodetool) there may be compactions running at
> the same time. In fact, it happens very often. Is it a problem ?
>
> As far as big partitions, the biggest one we have is around 3.3Gb. Some
> less big partitions are around 500Mb and less.
>
>
> On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:
>
> Oh right, that's what they advise :)
> I'd say that you should skip the full repair phase in the migration
> procedure as that will obviously fail, and just mark all sstables as
> repaired (skip 1, 2 and 6).
> Anyway you can't do better, so take a leap of faith there.
>
> Intensity is already very low and 10000 segments is a whole lot for 9
> nodes, you should not need that many.
>
> You can definitely pick which CF you'll run incremental repair on, and
> still run full repair on the rest.
> If you pick our Reaper fork, watch out for schema changes that add
> incremental repair fields, and I do not advise to run incremental repair
> without it, otherwise you might have issues with anticompaction and
> validation compactions running concurrently from time to time.
>
> One last thing : can you check if you have particularly big partitions in
> the CFs that fail to get repaired ? You can run nodetool cfhistograms to
> check that.
>
> Cheers,
>
>
>
> On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <me...@vrischmann.me>
> wrote:
>
>
> Thanks for the response.
>
> We do break up repairs between tables, we also tried our best to have no
> overlap between repair runs. Each repair has 10000 segments (purely
> arbitrary number, seemed to help at the time). Some runs have an intensity
> of 0.4, some have as low as 0.05.
>
> Still, sometimes one particular app (which does a lot of read/modify/write
> batches in quorum) gets slowed down to the point we have to stop the repair
> run.
>
> But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
> don't progress after some time. Every time I restart reaper, it starts to
> repair correctly again, up until it gets stuck. I have no idea why that
> happens now, but it means I have to baby sit reaper, and it's becoming
> annoying.
>
> Thanks for the suggestion about incremental repairs. It would probably be
> a good thing but it's a little challenging to setup I think. Right now
> running a full repair of all keyspaces (via nodetool repair) is going to
> take a lot of time, probably like 5 days or more. We were never able to run
> one to completion. I'm not sure it's a good idea to disable autocompaction
> for that long.
>
> But maybe I'm wrong. Is it possible to use incremental repairs on some
> column family only ?
>
>
> On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
>
> Hi Vincent,
>
> most people handle repair with :
> - pain (by hand running nodetool commands)
> - cassandra range repair :
> https://github.com/BrianGallew/cassandra_range_repair
> - Spotify Reaper
> - and OpsCenter repair service for DSE users
>
> Reaper is a good option I think and you should stick to it. If it cannot
> do the job here then no other tool will.
>
> You have several options from here :
>
>    - Try to break up your repair table by table and see which ones
>    actually get stuck
>    - Check your logs for any repair/streaming error
>    - Avoid repairing everything :
>    - you may have expendable tables
>       - you may have TTLed only tables with no deletes, accessed with
>       QUORUM CL only
>       - You can try to relieve repair pressure in Reaper by lowering
>    repair intensity (on the tables that get stuck)
>    - You can try adding steps to your repair process by putting a higher
>    segment count in reaper (on the tables that get stuck)
>    - And lastly, you can turn to incremental repair. As you're familiar
>    with Reaper already, you might want to take a look at our Reaper fork that
>    handles incremental repair :
>    https://github.com/thelastpickle/cassandra-reaper
>    If you go down that way, make sure you first mark all sstables as
>    repaired before you run your first incremental repair, otherwise you'll end
>    up in anticompaction hell (bad bad place) :
>    https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>    Even if people say that's not necessary anymore, it'll save you from a
>    very bad first experience with incremental repair.
>    Furthermore, make sure you run repair daily after your first inc
>    repair run, in order to work on small sized repairs.
>
>
> Cheers,
>
>
> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <me...@vrischmann.me>
> wrote:
>
>
> Hi,
>
> we have two Cassandra 2.1.15 clusters at work and are having some trouble
> with repairs.
>
> Each cluster has 9 nodes, and the amount of data is not gigantic but some
> column families have 300+Gb of data.
> We tried to use `nodetool repair` for these tables but at the time we
> tested it, it made the whole cluster load too much and it impacted our
> production apps.
>
> Next we saw https://github.com/spotify/cassandra-reaper , tried it and
> had some success until recently. Since 2 to 3 weeks it never completes a
> repair run, deadlocking itself somehow.
>
> I know DSE includes a repair service but I'm wondering how do other
> Cassandra users manage repairs ?
>
> Vincent.
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Tools to manage repairs

Posted by Vincent Rischmann <me...@vrischmann.me>.

Yeah that particular table is badly designed, I intend to fix it, when
the roadmap allows us to do it :)
What is the recommended maximum partition size ?

Thanks for all the information.


On Thu, Oct 27, 2016, at 08:14 PM, Alexander Dejanovski wrote:
> 3.3GB is already too high, and it's surely not good to have well
>   performing compactions. Still I know changing a data model is no
>   easy thing to do, but you should try to do something here.
> Anticompaction is a special type of compaction and if an sstable is
> being anticompacted, then any attempt to run validation compaction on
> it will fail, telling you that you cannot have an sstable being part
> of 2 repair sessions at the same time, so incremental repair must be
> run one node at a time, waiting for anticompactions to end before
> moving from one node to the other.
> Be mindful of running incremental repair on a regular basis once you
> started as you'll have two separate pools of sstables (repaired and
> unrepaired) that won't get compacted together, which could be a
> problem if you want tombstones to be purged efficiently.
> Cheers,
>
> Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <me...@vrischmann.me>
> a écrit :
>> __
>> Ok, I think we'll give incremental repairs a try on a limited number
>> of CFs first and then if it goes well we'll progressively switch more
>> CFs to incremental.
>>
>> I'm not sure I understand the problem with anticompaction and
>> validation running concurrently. As far as I can tell, right now when
>> a CF is repaired (either via reaper, or via nodetool) there may be
>> compactions running at the same time. In fact, it happens very often.
>> Is it a problem ?
>>
>> As far as big partitions, the biggest one we have is around 3.3Gb.
>> Some less big partitions are around 500Mb and less.
>>
>>
>> On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:
>>> Oh right, that's what they advise :)
>>> I'd say that you should skip the full repair phase in the migration
>>> procedure as that will obviously fail, and just mark all sstables as
>>> repaired (skip 1, 2 and 6).
>>> Anyway you can't do better, so take a leap of faith there.
>>>
>>> Intensity is already very low and 10000 segments is a whole lot for
>>> 9 nodes, you should not need that many.
>>>
>>> You can definitely pick which CF you'll run incremental repair on,
>>> and still run full repair on the rest.
>>> If you pick our Reaper fork, watch out for schema changes that add
>>> incremental repair fields, and I do not advise to run incremental
>>> repair without it, otherwise you might have issues with
>>> anticompaction and validation compactions running concurrently from
>>> time to time.
>>>
>>> One last thing : can you check if you have particularly big
>>> partitions in the CFs that fail to get repaired ? You can run
>>> nodetool cfhistograms to check that.
>>>
>>> Cheers,
>>>
>>>
>>>
>>> On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <me...@vrischmann.me>
>>> wrote:
>>>> __
>>>> Thanks for the response.
>>>>
>>>> We do break up repairs between tables, we also tried our best to
>>>> have no overlap between repair runs. Each repair has 10000 segments
>>>> (purely arbitrary number, seemed to help at the time). Some runs
>>>> have an intensity of 0.4, some have as low as 0.05.
>>>>
>>>> Still, sometimes one particular app (which does a lot of
>>>> read/modify/write batches in quorum) gets slowed down to the point
>>>> we have to stop the repair run.
>>>>
>>>> But more annoyingly, since 2 to 3 weeks as I said, it looks like
>>>> runs don't progress after some time. Every time I restart reaper,
>>>> it starts to repair correctly again, up until it gets stuck. I have
>>>> no idea why that happens now, but it means I have to baby sit
>>>> reaper, and it's becoming annoying.
>>>>
>>>> Thanks for the suggestion about incremental repairs. It would
>>>> probably be a good thing but it's a little challenging to setup I
>>>> think. Right now running a full repair of all keyspaces (via
>>>> nodetool repair) is going to take a lot of time, probably like 5
>>>> days or more. We were never able to run one to completion. I'm not
>>>> sure it's a good idea to disable autocompaction for that long.
>>>>
>>>> But maybe I'm wrong. Is it possible to use incremental repairs on
>>>> some column family only ?
>>>>
>>>>
>>>> On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
>>>>> Hi Vincent,
>>>>>
>>>>> most people handle repair with :
>>>>> - pain (by hand running nodetool commands)
>>>>> - cassandra range repair :
>>>>>   https://github.com/BrianGallew/cassandra_range_repair
>>>>> - Spotify Reaper
>>>>> - and OpsCenter repair service for DSE users
>>>>>
>>>>> Reaper is a good option I think and you should stick to it. If it
>>>>> cannot do the job here then no other tool will.
>>>>>
>>>>> You have several options from here :
>>>>>  * Try to break up your repair table by table and see which ones
>>>>>    actually get stuck
>>>>>  * Check your logs for any repair/streaming error
>>>>>  * Avoid repairing everything :
>>>>>    * you may have expendable tables
>>>>>    * you may have TTLed only tables with no deletes, accessed with
>>>>>      QUORUM CL only
>>>>>  * You can try to relieve repair pressure in Reaper by lowering
>>>>>    repair intensity (on the tables that get stuck)
>>>>>  * You can try adding steps to your repair process by putting a
>>>>>    higher segment count in reaper (on the tables that get stuck)
>>>>>  * And lastly, you can turn to incremental repair. As you're
>>>>>    familiar with Reaper already, you might want to take a look at
>>>>>    our Reaper fork that handles incremental repair :
>>>>>    https://github.com/thelastpickle/cassandra-reaper If you go
>>>>>    down that way, make sure you first mark all sstables as
>>>>>    repaired before you run your first incremental repair,
>>>>>    otherwise you'll end up in anticompaction hell (bad bad place)
>>>>>    :
>>>>>    https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>>>>>    Even if people say that's not necessary anymore, it'll save you
>>>>>    from a very bad first experience with incremental repair.
>>>>>    Furthermore, make sure you run repair daily after your first
>>>>>    inc repair run, in order to work on small sized repairs.
>>>>>
>>>>> Cheers,
>>>>>
>>>>>
>>>>> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann
>>>>> <me...@vrischmann.me> wrote:
>>>>>> __
>>>>>> Hi,
>>>>>>
>>>>>> we have two Cassandra 2.1.15 clusters at work and are having some
>>>>>> trouble with repairs.
>>>>>>
>>>>>> Each cluster has 9 nodes, and the amount of data is not gigantic
>>>>>> but some column families have 300+Gb of data.
>>>>>> We tried to use `nodetool repair` for these tables but at the
>>>>>> time we tested it, it made the whole cluster load too much and it
>>>>>> impacted our production apps.
>>>>>>
>>>>>> Next we saw https://github.com/spotify/cassandra-reaper , tried
>>>>>> it and had some success until recently. Since 2 to 3 weeks it
>>>>>> never completes a repair run, deadlocking itself somehow.
>>>>>>
>>>>>> I know DSE includes a repair service but I'm wondering how do
>>>>>> other Cassandra users manage repairs ?
>>>>>>
>>>>>> Vincent.
>>>>> --
>>>>> -----------------
>>>>> Alexander Dejanovski
>>>>> France
>>>>> @alexanderdeja
>>>>>
>>>>> Consultant
>>>>> Apache Cassandra Consulting
>>>>> http://www.thelastpickle.com[1]
>>>>
>>> --
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com[2]
>>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[3]


Links:

  1. http://www.thelastpickle.com/
  2. http://www.thelastpickle.com/
  3. http://www.thelastpickle.com/

Re: Tools to manage repairs

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

3.3GB is already too high, and it's surely not good to have well performing
compactions. Still I know changing a data model is no easy thing to do, but
you should try to do something here.

Anticompaction is a special type of compaction and if an sstable is being
anticompacted, then any attempt to run validation compaction on it will
fail, telling you that you cannot have an sstable being part of 2 repair
sessions at the same time, so incremental repair must be run one node at a
time, waiting for anticompactions to end before moving from one node to the
other.

Be mindful of running incremental repair on a regular basis once you
started as you'll have two separate pools of sstables (repaired and
unrepaired) that won't get compacted together, which could be a problem if
you want tombstones to be purged efficiently.

Cheers,

Le jeu. 27 oct. 2016 17:57, Vincent Rischmann <me...@vrischmann.me> a écrit :

> Ok, I think we'll give incremental repairs a try on a limited number of
> CFs first and then if it goes well we'll progressively switch more CFs to
> incremental.
>
> I'm not sure I understand the problem with anticompaction and validation
> running concurrently. As far as I can tell, right now when a CF is repaired
> (either via reaper, or via nodetool) there may be compactions running at
> the same time. In fact, it happens very often. Is it a problem ?
>
> As far as big partitions, the biggest one we have is around 3.3Gb. Some
> less big partitions are around 500Mb and less.
>
>
> On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:
>
> Oh right, that's what they advise :)
> I'd say that you should skip the full repair phase in the migration
> procedure as that will obviously fail, and just mark all sstables as
> repaired (skip 1, 2 and 6).
> Anyway you can't do better, so take a leap of faith there.
>
> Intensity is already very low and 10000 segments is a whole lot for 9
> nodes, you should not need that many.
>
> You can definitely pick which CF you'll run incremental repair on, and
> still run full repair on the rest.
> If you pick our Reaper fork, watch out for schema changes that add
> incremental repair fields, and I do not advise to run incremental repair
> without it, otherwise you might have issues with anticompaction and
> validation compactions running concurrently from time to time.
>
> One last thing : can you check if you have particularly big partitions in
> the CFs that fail to get repaired ? You can run nodetool cfhistograms to
> check that.
>
> Cheers,
>
>
>
> On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <me...@vrischmann.me>
> wrote:
>
>
> Thanks for the response.
>
> We do break up repairs between tables, we also tried our best to have no
> overlap between repair runs. Each repair has 10000 segments (purely
> arbitrary number, seemed to help at the time). Some runs have an intensity
> of 0.4, some have as low as 0.05.
>
> Still, sometimes one particular app (which does a lot of read/modify/write
> batches in quorum) gets slowed down to the point we have to stop the repair
> run.
>
> But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
> don't progress after some time. Every time I restart reaper, it starts to
> repair correctly again, up until it gets stuck. I have no idea why that
> happens now, but it means I have to baby sit reaper, and it's becoming
> annoying.
>
> Thanks for the suggestion about incremental repairs. It would probably be
> a good thing but it's a little challenging to setup I think. Right now
> running a full repair of all keyspaces (via nodetool repair) is going to
> take a lot of time, probably like 5 days or more. We were never able to run
> one to completion. I'm not sure it's a good idea to disable autocompaction
> for that long.
>
> But maybe I'm wrong. Is it possible to use incremental repairs on some
> column family only ?
>
>
> On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
>
> Hi Vincent,
>
> most people handle repair with :
> - pain (by hand running nodetool commands)
> - cassandra range repair :
> https://github.com/BrianGallew/cassandra_range_repair
> - Spotify Reaper
> - and OpsCenter repair service for DSE users
>
> Reaper is a good option I think and you should stick to it. If it cannot
> do the job here then no other tool will.
>
> You have several options from here :
>
>    - Try to break up your repair table by table and see which ones
>    actually get stuck
>    - Check your logs for any repair/streaming error
>    - Avoid repairing everything :
>    - you may have expendable tables
>       - you may have TTLed only tables with no deletes, accessed with
>       QUORUM CL only
>       - You can try to relieve repair pressure in Reaper by lowering
>    repair intensity (on the tables that get stuck)
>    - You can try adding steps to your repair process by putting a higher
>    segment count in reaper (on the tables that get stuck)
>    - And lastly, you can turn to incremental repair. As you're familiar
>    with Reaper already, you might want to take a look at our Reaper fork that
>    handles incremental repair :
>    https://github.com/thelastpickle/cassandra-reaper
>    If you go down that way, make sure you first mark all sstables as
>    repaired before you run your first incremental repair, otherwise you'll end
>    up in anticompaction hell (bad bad place) :
>    https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>    Even if people say that's not necessary anymore, it'll save you from a
>    very bad first experience with incremental repair.
>    Furthermore, make sure you run repair daily after your first inc
>    repair run, in order to work on small sized repairs.
>
>
> Cheers,
>
>
> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <me...@vrischmann.me>
> wrote:
>
>
> Hi,
>
> we have two Cassandra 2.1.15 clusters at work and are having some trouble
> with repairs.
>
> Each cluster has 9 nodes, and the amount of data is not gigantic but some
> column families have 300+Gb of data.
> We tried to use `nodetool repair` for these tables but at the time we
> tested it, it made the whole cluster load too much and it impacted our
> production apps.
>
> Next we saw https://github.com/spotify/cassandra-reaper , tried it and
> had some success until recently. Since 2 to 3 weeks it never completes a
> repair run, deadlocking itself somehow.
>
> I know DSE includes a repair service but I'm wondering how do other
> Cassandra users manage repairs ?
>
> Vincent.
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Tools to manage repairs

Posted by Vincent Rischmann <me...@vrischmann.me>.

Ok, I think we'll give incremental repairs a try on a limited number of
CFs first and then if it goes well we'll progressively switch more CFs
to incremental.

I'm not sure I understand the problem with anticompaction and
validation running concurrently. As far as I can tell, right now when a
CF is repaired (either via reaper, or via nodetool) there may be
compactions running at the same time. In fact, it happens very often.
Is it a problem ?

As far as big partitions, the biggest one we have is around 3.3Gb. Some
less big partitions are around 500Mb and less.


On Thu, Oct 27, 2016, at 05:37 PM, Alexander Dejanovski wrote:
> Oh right, that's what they advise :)
> I'd say that you should skip the full repair phase in the migration
> procedure as that will obviously fail, and just mark all sstables as
> repaired (skip 1, 2 and 6).
> Anyway you can't do better, so take a leap of faith there.
>
> Intensity is already very low and 10000 segments is a whole lot for 9
> nodes, you should not need that many.
>
> You can definitely pick which CF you'll run incremental repair on, and
> still run full repair on the rest.
> If you pick our Reaper fork, watch out for schema changes that add
> incremental repair fields, and I do not advise to run incremental
> repair without it, otherwise you might have issues with anticompaction
> and validation compactions running concurrently from time to time.
>
> One last thing : can you check if you have particularly big partitions
> in the CFs that fail to get repaired ? You can run nodetool
> cfhistograms to check that.
>
> Cheers,
>
>
>
> On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann
> <me...@vrischmann.me> wrote:
>> __
>> Thanks for the response.
>>
>> We do break up repairs between tables, we also tried our best to have
>> no overlap between repair runs. Each repair has 10000 segments
>> (purely arbitrary number, seemed to help at the time). Some runs have
>> an intensity of 0.4, some have as low as 0.05.
>>
>> Still, sometimes one particular app (which does a lot of
>> read/modify/write batches in quorum) gets slowed down to the point we
>> have to stop the repair run.
>>
>> But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
>> don't progress after some time. Every time I restart reaper, it
>> starts to repair correctly again, up until it gets stuck. I have no
>> idea why that happens now, but it means I have to baby sit reaper,
>> and it's becoming annoying.
>>
>> Thanks for the suggestion about incremental repairs. It would
>> probably be a good thing but it's a little challenging to setup I
>> think. Right now running a full repair of all keyspaces (via nodetool
>> repair) is going to take a lot of time, probably like 5 days or more.
>> We were never able to run one to completion. I'm not sure it's a good
>> idea to disable autocompaction for that long.
>>
>> But maybe I'm wrong. Is it possible to use incremental repairs on
>> some column family only ?
>>
>>
>> On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
>>> Hi Vincent,
>>>
>>> most people handle repair with :
>>> - pain (by hand running nodetool commands)
>>> - cassandra range repair :
>>>   https://github.com/BrianGallew/cassandra_range_repair
>>> - Spotify Reaper
>>> - and OpsCenter repair service for DSE users
>>>
>>> Reaper is a good option I think and you should stick to it. If it
>>> cannot do the job here then no other tool will.
>>>
>>> You have several options from here :
>>>  * Try to break up your repair table by table and see which ones
>>>    actually get stuck
>>>  * Check your logs for any repair/streaming error
>>>  * Avoid repairing everything :
>>>    * you may have expendable tables
>>>    * you may have TTLed only tables with no deletes, accessed with
>>>      QUORUM CL only
>>>  * You can try to relieve repair pressure in Reaper by lowering
>>>    repair intensity (on the tables that get stuck)
>>>  * You can try adding steps to your repair process by putting a
>>>    higher segment count in reaper (on the tables that get stuck)
>>>  * And lastly, you can turn to incremental repair. As you're
>>>    familiar with Reaper already, you might want to take a look at
>>>    our Reaper fork that handles incremental repair :
>>>    https://github.com/thelastpickle/cassandra-reaper If you go down
>>>    that way, make sure you first mark all sstables as repaired
>>>    before you run your first incremental repair, otherwise you'll
>>>    end up in anticompaction hell (bad bad place) :
>>>    https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>>>    Even if people say that's not necessary anymore, it'll save you
>>>    from a very bad first experience with incremental repair.
>>>    Furthermore, make sure you run repair daily after your first inc
>>>    repair run, in order to work on small sized repairs.
>>>
>>> Cheers,
>>>
>>>
>>> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <me...@vrischmann.me>
>>> wrote:
>>>> __
>>>> Hi,
>>>>
>>>> we have two Cassandra 2.1.15 clusters at work and are having some
>>>> trouble with repairs.
>>>>
>>>> Each cluster has 9 nodes, and the amount of data is not gigantic
>>>> but some column families have 300+Gb of data.
>>>> We tried to use `nodetool repair` for these tables but at the time
>>>> we tested it, it made the whole cluster load too much and it
>>>> impacted our production apps.
>>>>
>>>> Next we saw https://github.com/spotify/cassandra-reaper , tried it
>>>> and had some success until recently. Since 2 to 3 weeks it never
>>>> completes a repair run, deadlocking itself somehow.
>>>>
>>>> I know DSE includes a repair service but I'm wondering how do other
>>>> Cassandra users manage repairs ?
>>>>
>>>> Vincent.
>>> --
>>> -----------------
>>> Alexander Dejanovski
>>> France
>>> @alexanderdeja
>>>
>>> Consultant
>>> Apache Cassandra Consulting
>>> http://www.thelastpickle.com[1]
>>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[2]


Links:

  1. http://www.thelastpickle.com/
  2. http://www.thelastpickle.com/

Re: Tools to manage repairs

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Oh right, that's what they advise :)
I'd say that you should skip the full repair phase in the migration
procedure as that will obviously fail, and just mark all sstables as
repaired (skip 1, 2 and 6).
Anyway you can't do better, so take a leap of faith there.

Intensity is already very low and 10000 segments is a whole lot for 9
nodes, you should not need that many.

You can definitely pick which CF you'll run incremental repair on, and
still run full repair on the rest.
If you pick our Reaper fork, watch out for schema changes that add
incremental repair fields, and I do not advise to run incremental repair
without it, otherwise you might have issues with anticompaction and
validation compactions running concurrently from time to time.

One last thing : can you check if you have particularly big partitions in
the CFs that fail to get repaired ? You can run nodetool cfhistograms to
check that.

Cheers,



On Thu, Oct 27, 2016 at 5:24 PM Vincent Rischmann <me...@vrischmann.me> wrote:

> Thanks for the response.
>
> We do break up repairs between tables, we also tried our best to have no
> overlap between repair runs. Each repair has 10000 segments (purely
> arbitrary number, seemed to help at the time). Some runs have an intensity
> of 0.4, some have as low as 0.05.
>
> Still, sometimes one particular app (which does a lot of read/modify/write
> batches in quorum) gets slowed down to the point we have to stop the repair
> run.
>
> But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
> don't progress after some time. Every time I restart reaper, it starts to
> repair correctly again, up until it gets stuck. I have no idea why that
> happens now, but it means I have to baby sit reaper, and it's becoming
> annoying.
>
> Thanks for the suggestion about incremental repairs. It would probably be
> a good thing but it's a little challenging to setup I think. Right now
> running a full repair of all keyspaces (via nodetool repair) is going to
> take a lot of time, probably like 5 days or more. We were never able to run
> one to completion. I'm not sure it's a good idea to disable autocompaction
> for that long.
>
> But maybe I'm wrong. Is it possible to use incremental repairs on some
> column family only ?
>
>
> On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
>
> Hi Vincent,
>
> most people handle repair with :
> - pain (by hand running nodetool commands)
> - cassandra range repair :
> https://github.com/BrianGallew/cassandra_range_repair
> - Spotify Reaper
> - and OpsCenter repair service for DSE users
>
> Reaper is a good option I think and you should stick to it. If it cannot
> do the job here then no other tool will.
>
> You have several options from here :
>
>    - Try to break up your repair table by table and see which ones
>    actually get stuck
>    - Check your logs for any repair/streaming error
>    - Avoid repairing everything :
>    - you may have expendable tables
>       - you may have TTLed only tables with no deletes, accessed with
>       QUORUM CL only
>       - You can try to relieve repair pressure in Reaper by lowering
>    repair intensity (on the tables that get stuck)
>    - You can try adding steps to your repair process by putting a higher
>    segment count in reaper (on the tables that get stuck)
>    - And lastly, you can turn to incremental repair. As you're familiar
>    with Reaper already, you might want to take a look at our Reaper fork that
>    handles incremental repair :
>    https://github.com/thelastpickle/cassandra-reaper
>    If you go down that way, make sure you first mark all sstables as
>    repaired before you run your first incremental repair, otherwise you'll end
>    up in anticompaction hell (bad bad place) :
>    https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>    Even if people say that's not necessary anymore, it'll save you from a
>    very bad first experience with incremental repair.
>    Furthermore, make sure you run repair daily after your first inc
>    repair run, in order to work on small sized repairs.
>
>
> Cheers,
>
>
> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <me...@vrischmann.me>
> wrote:
>
>
> Hi,
>
> we have two Cassandra 2.1.15 clusters at work and are having some trouble
> with repairs.
>
> Each cluster has 9 nodes, and the amount of data is not gigantic but some
> column families have 300+Gb of data.
> We tried to use `nodetool repair` for these tables but at the time we
> tested it, it made the whole cluster load too much and it impacted our
> production apps.
>
> Next we saw https://github.com/spotify/cassandra-reaper , tried it and
> had some success until recently. Since 2 to 3 weeks it never completes a
> repair run, deadlocking itself somehow.
>
> I know DSE includes a repair service but I'm wondering how do other
> Cassandra users manage repairs ?
>
> Vincent.
>
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
> --
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Re: Tools to manage repairs

Posted by Vincent Rischmann <me...@vrischmann.me>.

Thanks for the response.

We do break up repairs between tables, we also tried our best to have no
overlap between repair runs. Each repair has 10000 segments (purely
arbitrary number, seemed to help at the time). Some runs have an
intensity of 0.4, some have as low as 0.05.

Still, sometimes one particular app (which does a lot of
read/modify/write batches in quorum) gets slowed down to the point we
have to stop the repair run.

But more annoyingly, since 2 to 3 weeks as I said, it looks like runs
don't progress after some time. Every time I restart reaper, it starts
to repair correctly again, up until it gets stuck. I have no idea why
that happens now, but it means I have to baby sit reaper, and it's
becoming annoying.

Thanks for the suggestion about incremental repairs. It would probably
be a good thing but it's a little challenging to setup I think. Right
now running a full repair of all keyspaces (via nodetool repair) is
going to take a lot of time, probably like 5 days or more. We were never
able to run one to completion. I'm not sure it's a good idea to disable
autocompaction for that long.

But maybe I'm wrong. Is it possible to use incremental repairs on some
column family only ?


On Thu, Oct 27, 2016, at 05:02 PM, Alexander Dejanovski wrote:
> Hi Vincent,
>
> most people handle repair with :
> - pain (by hand running nodetool commands)
> - cassandra range repair :
>   https://github.com/BrianGallew/cassandra_range_repair
> - Spotify Reaper
> - and OpsCenter repair service for DSE users
>
> Reaper is a good option I think and you should stick to it. If it
> cannot do the job here then no other tool will.
>
> You have several options from here :
>  * Try to break up your repair table by table and see which ones
>    actually get stuck
>  * Check your logs for any repair/streaming error
>  * Avoid repairing everything :
>    * you may have expendable tables
>    * you may have TTLed only tables with no deletes, accessed with
>      QUORUM CL only
>  * You can try to relieve repair pressure in Reaper by lowering repair
>    intensity (on the tables that get stuck)
>  * You can try adding steps to your repair process by putting a higher
>    segment count in reaper (on the tables that get stuck)
>  * And lastly, you can turn to incremental repair. As you're familiar
>    with Reaper already, you might want to take a look at our Reaper
>    fork that handles incremental repair :
>    https://github.com/thelastpickle/cassandra-reaper If you go down
>    that way, make sure you first mark all sstables as repaired before
>    you run your first incremental repair, otherwise you'll end up in
>    anticompaction hell (bad bad place) :
>    https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
>    Even if people say that's not necessary anymore, it'll save you
>    from a very bad first experience with incremental repair.
>    Furthermore, make sure you run repair daily after your first inc
>    repair run, in order to work on small sized repairs.
>
> Cheers,
>
>
> On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann
> <me...@vrischmann.me> wrote:
>> __
>> Hi,
>>
>> we have two Cassandra 2.1.15 clusters at work and are having some
>> trouble with repairs.
>>
>> Each cluster has 9 nodes, and the amount of data is not gigantic but
>> some column families have 300+Gb of data.
>> We tried to use `nodetool repair` for these tables but at the time we
>> tested it, it made the whole cluster load too much and it impacted
>> our production apps.
>>
>> Next we saw https://github.com/spotify/cassandra-reaper , tried it
>> and had some success until recently. Since 2 to 3 weeks it never
>> completes a repair run, deadlocking itself somehow.
>>
>> I know DSE includes a repair service but I'm wondering how do other
>> Cassandra users manage repairs ?
>>
>> Vincent.
> --
> -----------------
> Alexander Dejanovski
> France
> @alexanderdeja
>
> Consultant
> Apache Cassandra Consulting
> http://www.thelastpickle.com[1]


Links:

  1. http://www.thelastpickle.com/

Re: Tools to manage repairs

Posted by Alexander Dejanovski <al...@thelastpickle.com>.

Hi Vincent,

most people handle repair with :
- pain (by hand running nodetool commands)
- cassandra range repair :
https://github.com/BrianGallew/cassandra_range_repair
- Spotify Reaper
- and OpsCenter repair service for DSE users

Reaper is a good option I think and you should stick to it. If it cannot do
the job here then no other tool will.

You have several options from here :

   - Try to break up your repair table by table and see which ones actually
   get stuck
   - Check your logs for any repair/streaming error
   - Avoid repairing everything :
      - you may have expendable tables
      - you may have TTLed only tables with no deletes, accessed with
      QUORUM CL only
   - You can try to relieve repair pressure in Reaper by lowering repair
   intensity (on the tables that get stuck)
   - You can try adding steps to your repair process by putting a higher
   segment count in reaper (on the tables that get stuck)
   - And lastly, you can turn to incremental repair. As you're familiar
   with Reaper already, you might want to take a look at our Reaper fork that
   handles incremental repair :
   https://github.com/thelastpickle/cassandra-reaper
   If you go down that way, make sure you first mark all sstables as
   repaired before you run your first incremental repair, otherwise you'll end
   up in anticompaction hell (bad bad place) :
   https://docs.datastax.com/en/cassandra/2.1/cassandra/operations/opsRepairNodesMigration.html
   Even if people say that's not necessary anymore, it'll save you from a
   very bad first experience with incremental repair.
   Furthermore, make sure you run repair daily after your first inc repair
   run, in order to work on small sized repairs.


Cheers,


On Thu, Oct 27, 2016 at 4:27 PM Vincent Rischmann <me...@vrischmann.me> wrote:

Hi,

we have two Cassandra 2.1.15 clusters at work and are having some trouble
with repairs.

Each cluster has 9 nodes, and the amount of data is not gigantic but some
column families have 300+Gb of data.
We tried to use `nodetool repair` for these tables but at the time we
tested it, it made the whole cluster load too much and it impacted our
production apps.

Next we saw https://github.com/spotify/cassandra-reaper , tried it and had
some success until recently. Since 2 to 3 weeks it never completes a repair
run, deadlocking itself somehow.

I know DSE includes a repair service but I'm wondering how do other
Cassandra users manage repairs ?

Vincent.

-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com