You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Attila Wind <at...@swf.technology> on 2020/10/08 12:04:11 UTC

best pointers to learn Cassandra maintenance

Hey Guys,

We already started to feel that however Cassandra performance is awesome 
in the beginning over time
- as more and more data is present in the tables,
- more and more deletes creating tombstones,
- cluster gets here and there not that well balanced
performance can drop quickly and significantly...

After ~1 year of learning curve we had to realize that time by time we 
run into things like "running repairs", "running compactions", 
understand tombstones (row and range), TTLs, etc etc becomes critical as 
data is growing.
But on the other hand we also see often lots of warnings... Like "if you 
start Cassandra Reaper you can not stop doing that" ...

I feel a bit confused now, and so far never ran into an article which 
really deeply explains: why?
Why this? Why that? Why not this?

So I think the time has come for us in the team to start focusing on 
these topics now. Invest time to better understanding. Really learn what 
"repair" means, and all consequences of it, etc

So
Does anyone have any "you must read it" recommendations around these 
"long term maintenance" topics?
I mean really well explained blog post(s), article(s), book(s). Not some 
"half done" or  "I quickly write a post because it was too long ago when 
I blogged something..." things  :-)

Good pointers would be appreciated!

thanks

-- 
Attila Wind

http://www.linkedin.com/in/attilaw
Mobile: +49 176 43556932

Re: best pointers to learn Cassandra maintenance

Posted by Erick Ramirez <er...@datastax.com>.

In Cassandra, "repair" refers to anti-entropy repairs. I think that's where
most of the confusion is. DBAs see the word "repair" and think it is a
one-off operation to "fix something broken". Users incorrectly think that
once it is fixed then there shouldn't be a need to repair again.

However in a distributed environment, the reality is that replicas can get
out of sync for whatever reason -- nodes going offline, nodes temporarily
unresponsive, nodes suffering from a hardware failure, etc. Entropy ensues.

It is necessary to keep the data consistent across the cluster so we run
anti-entropy repairs. The recommendation is that you run repairs at least
once every gc_grace_seconds (GCGS). GCGS by default is 10 days so a good
rule of thumb is to run repairs once a week.

Let me address some of the points you raised.

> ... we run into things like "running repairs", "running compactions",
> understand tombstones (row and range), TTLs, etc etc becomes critical as
> data is growing.
>
Compactions are part of the normal operation of Cassandra. You shouldn't
however be manually running compactions. If you are, something is wrong and
it's most likely a band-aid solution to an underlying problem you need to
address.

> But on the other hand we also see often lots of warnings... Like "if you
> start Cassandra Reaper you can not stop doing that" ...
>
As above, you need to run repairs regularly. It isn't a one-off operation.
Reaper is a good tool for managing repairs in an automated fashion.

Here are some useful resources on repairs in Cassandra:
- Repair document @ the Apache website -
https://cassandra.apache.org/doc/latest/operating/repair.html
- DataStax Academy video on Repair -
https://www.youtube.com/watch?v=5V5rGDTHs20
- YouTube playlist on DataStax Academy Cassandra Operations course -
https://www.youtube.com/playlist?list=PL2g2h-wyI4SrHMlHBJVe_or_Ryek2THgQ
- DataStax Doc on when to run repairs -
https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsRepairNodesWhen.html

Cheers!

>

Re: best pointers to learn Cassandra maintenance

Posted by Jeff Jirsa <jj...@gmail.com>.

On Thu, Oct 8, 2020 at 5:31 AM Attila Wind <at...@swf.technology> wrote:

> Hey Guys,
>
> We already started to feel that however Cassandra performance is awesome
> in the beginning over time
> - as more and more data is present in the tables,
> - more and more deletes creating tombstones,
> - cluster gets here and there not that well balanced
> performance can drop quickly and significantly...
>
> After ~1 year of learning curve we had to realize that time by time we run
> into things like "running repairs", "running compactions", understand
> tombstones (row and range), TTLs, etc etc becomes critical as data is
> growing.
> But on the other hand we also see often lots of warnings... Like "if you
> start Cassandra Reaper you can not stop doing that" ...
>
> I feel a bit confused now, and so far never ran into an article which
> really deeply explains: why?
> Why this? Why that? Why not this?
>
I know you're asking in general, but let me describe why it's hard - for
repair, in particular, there's a ton of nuance. In particular, there are
two types of repair (full and incremental), and then different scopes (-pr
for primary range, using start/end tokens for sub range, repairing all the
ranges on a host, etc).

With full repair, you compare all the data in a token range, stream
differences, and you're done. If you run the same command 30 seconds later,
it has to do the exact same amount of work.

With incremental repair, it uses clean/dirty bits on data files, and
optimizes so you dont have to scan as much data on subsequent runs. This
ALSO means you have 2 different sets of data files - clean and dirty - and
they won't ever compact together until you promote dirty files to clean
files! THAT is the magic bit of knowledge that most people don't describe
when they say "once you start running incremental repair, you can't stop".

If you're using reaper for full subrange repairs, you could stop at any
time. But if you're doing it for incremental, and you stop, you need to
unset all the repaired bits on the data files or you end up with data that
can't be compacted.

The time it takes to type out every single one of these surprising edge
cases / nuances is just too high for anyone to do it for free. Some books
try, many of the books are incomplete or out of date. It's a shame.

One day, hopefully, the database matures to a point where you don't need to
know how repair works in order to run a cluster. Oct 8 2020 is not that
day.

>
> So I think the time has come for us in the team to start focusing on these
> topics now. Invest time to better understanding. Really learn what "repair"
> means, and all consequences of it, etc
>
> So
> Does anyone have any "you must read it" recommendations around these "long
> term maintenance" topics?
>
Unfortunately, not really. There's some notes here
https://cassandra.apache.org/doc/latest/operating/index.html but it's
imperfect. May be good for people to keep adding docs.

> I mean really well explained blog post(s), article(s), book(s). Not some
> "half done" or  "I quickly write a post because it was too long ago when I
> blogged something..." things  :-)
>
> Good pointers would be appreciated!
>
> thanks
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>