You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by Satoshi Hikida <sa...@gmail.com> on 2016/07/14 07:41:38 UTC

Questions about anti-entropy repair

Hi,

I have two questions about anti-entropy repair.

Q1:
According to the DataStax document, it's recommended to run full repair
weekly or monthly. Is it needed even if repair with partitioner range
option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
every node in the cluster?

References:
- DataStax, "When to run anti-entropy repair",
http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesWhen.html


Q2:
Is it a good practice to repair a node without using non-repaired snapshots
when I want to restore a node because repair process is too slow?

I've done some simple verifications for anti-entropy repair and found out
that the repair process spends too much time than simply transferring the
replica data from existing nodes to restoring node.

My verification settings are as following:

- 3 node cluster (N1, N2, N3)
- 2 CPUs, 8GB memory, 500GB HDD for each node
- Replication Factor is 3
- C* version is 2.2.6
- CS is LCS

And I prepared test data as following:

- a snapshot (10GB, full repaired) for N1, N2, N3.
- 1GB SSTables (by using incremental backup) for N1, N2, N3.
- another 1GB SSTables for N1, N2

I've measured repair time for two cases.

- Case 1: repair N3 with the snapshot and 1GB SStables
- Case 2: repair N3 with the snapshot only

In case 1, N3 is needed to repair 12GB (actually 1GB data is updated
because the snapshot is already repaired) and received 1GB data from N1 or
N2. Whereas in case 2, N3 is needed to repair 12GB (actually just compare
merkle tree for 10GB) and received 2GB data from N1 or N2.

The result showed that case 2 was faster than case 1 (case 1: 6889sec, case
2: 4535sec). I guess the repair process is very slow and it would be better
to repair a node without (non repaired) backed up (snapshot or incremental
backup) files if the other replica nodes exists.

So... I guess if I just have non-repaired backups, what's the point of
using them? Looks like there's no merit... Am I missing something?

Regards,
Satoshi

Re: Questions about anti-entropy repair

Posted by Ryan Svihla <rs...@foundev.pro>.

I would say only repairing when there is a known problem has a couple of logical issues off the top of my head:

1. you're assuming hints are successfully delivering within their time window. There isn't really any indication that I've ever found myself.
2. unless you're using CL ALL you really have no indication if the other replicas not needed didn't succeed a write on the initial attempt.

Now if you're using CL LOCAL_QUORUM you'll have reasonable consistency and chances are pretty good that you eventually hit your RF anyway with read_repair, so I get the thought process behind what you're saying Daemeon.

Likewise, I've seen well sized clusters with steady good workloads in general behave pretty well and not need to stream a lot of data during repair, but because of 1 and 2 even with good monitoring that's a bit "running with scissors" for my taste as I'm not confident there is enough monitoring coverage that'll ever guarantee you're "mostly meeting RF" or not.

Running repair within gc_grace_seconds should be something you can handle anyway with your workload or you're not sized correctly (else what happens when you need to run repair after a major event?), so why not just keep it running.

YMMV and if someone has kept their cluster up and running and know all the stuff to look for Kudos. I still view it as a cheap cost to CYA and even working with Cassandra now for 3 years in a wide variety of pretty crazy situations I'm not confident I could keep a cluster healthy without running repair consistently.

regards,

Ryan Svihla

On Jul 20, 2016, 10:32 AM -0500, daemeon reiydelle <da...@gmail.com>, wrote:
> I don't know if my perspective on this will assist, so YMMV:
>
> Summary
> Nodetool repairs are required when a node has issues and can't get its (e.g. hinted handoff) resync done: culprit: usually network, sometimes container/vm, rarely disk.
> Scripts to do partition range are a pain to maintain, and you have to be CONSTANTLY checking for new keyspaces, parsing them, etc. Git hub project?
> Monitor/monitor/monitor: if you do a best practices job of actually monitoring the FULL stack, you only need to do repairs when the world goes south.
> Are you alerted when errors show up in the logs, network goes wacky, etc? No? then you have to CYA by doing hail mary passes with periodic nodetool repairs.
> Nodetool repair is a CYA for a cluster whose status is not well monitored.
> Daemeon's thoughts:
>
> Nodetool repair is not required for a cluster that is and "always has been" in a known good state. Monitoring of the relevant logs/network/disk/etc. is the only way that I know of to assure this state. Because (e.g. AWS, and EVERY ONE OF my clients' infrastructures: screwed up networks) nodes can disappear then the cluster *can* get overloaded (network traffic) causing hinted handoffs to have all of the worst case corner cases you can never hope to see.
>
> So, if you have good monitoring in place to assure that there is known good cluster behaviour (network, disk, etc.), repairs are not required until you are alerted that a cluster health problem has occurred. Partition range repair is a pain in various parts of the anatomy because one has to CONSTANTLY be updating the scripts that generate the commands (I have not seen a git hub project around this, would love to see responses that point them out!).
>
>
>
> .......
>
> Daemeon C.M. Reiydelle
> USA (+1) 415.501.0198 (tel:(+1)%20415.501.0198)
> London (+44) (0) 20 8144 9872 (tel:(+44)%2020%208144%209872)
>
> On Wed, Jul 20, 2016 at 4:33 AM, Alain RODRIGUEZ <arodrime@gmail.com (mailto:arodrime@gmail.com)> wrote:
> > Hi Satoshi,
> >
> > > Q1:
> > > According to the DataStax document, it's recommended to run full repair weekly or monthly. Is it needed even if repair with partitioner range option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for every node in the cluster?
> >
> >
> > More accurately you need to run a repair for each node and each table within the gc_grace_seconds value defined at the table level to ensure no deleted data will return. Also running this on a regular basis ensure a constantly low entropy in your cluster, allowing better consistency (if not using a strong consistency like with CL.R&W = quorum).
> >
> > A full repair means every piece of data have been repaired. On a 3 node cluster with RF=3, running 'nodetool repair -pr' on the 3 nodes or 'nodetool repair' on one node are an equivalent "full repair". The best approach is often to run repair with '-pr' on all the nodes indeed. This is a full repair.
> >
> > > Is it a good practice to repair a node without using non-repaired snapshots when I want to restore a node because repair process is too slow?
> >
> > I am sorry, this is unclear to me. But from this "actually 1GB data is updated because the snapshot is already repaired" I understand you are using incremental repairs (or that you think that Cassandra repair uses it by default, which is not the case in your version). http://www.datastax.com/dev/blog/more-efficient-repairs
> >
> > Also, be aware that repair is a PITA for all the operators using Cassandra, that lead to many tries to improve things:
> >
> > Range repair: https://github.com/BrianGallew/cassandra_range_repair
> > Reaper: https://github.com/spotify/cassandra-reaper
> > Ticket to automatically schedule / handle repairs in Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-10070
> > Ticket to switch to Mutation Based Repairs (MBR): https://issues.apache.org/jira/browse/CASSANDRA-8911
> >
> > And probably many more... There is a lot to read and try, repair is an important yet non trivial topic for any Cassandra operator.
> >
> > C*heers,
> > -----------------------
> > Alain Rodriguez - alain@thelastpickle.com (mailto:alain@thelastpickle.com)
> > France
> >
> > The Last Pickle - Apache Cassandra Consulting
> > http://www.thelastpickle.com
> >
> >
> >
> >
> >
> > 2016-07-14 9:41 GMT+02:00 Satoshi Hikida <sahikida@gmail.com (mailto:sahikida@gmail.com)>:
> > > Hi,
> > >
> > > I have two questions about anti-entropy repair.
> > >
> > > Q1:
> > > According to the DataStax document, it's recommended to run full repair weekly or monthly. Is it needed even if repair with partitioner range option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for every node in the cluster?
> > >
> > > References:
> > > - DataStax, "When to run anti-entropy repair", http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesWhen.html
> > >
> > >
> > > Q2:
> > > Is it a good practice to repair a node without using non-repaired snapshots when I want to restore a node because repair process is too slow?
> > >
> > > I've done some simple verifications for anti-entropy repair and found out that the repair process spends too much time than simply transferring the replica data from existing nodes to restoring node.
> > >
> > > My verification settings are as following:
> > >
> > > - 3 node cluster (N1, N2, N3)
> > > - 2 CPUs, 8GB memory, 500GB HDD for each node
> > > - Replication Factor is 3
> > > - C* version is 2.2.6
> > > - CS is LCS
> > >
> > > And I prepared test data as following:
> > >
> > > - a snapshot (10GB, full repaired) for N1, N2, N3.
> > > - 1GB SSTables (by using incremental backup) for N1, N2, N3.
> > > - another 1GB SSTables for N1, N2
> > >
> > > I've measured repair time for two cases.
> > >
> > > - Case 1: repair N3 with the snapshot and 1GB SStables
> > > - Case 2: repair N3 with the snapshot only
> > >
> > > In case 1, N3 is needed to repair 12GB (actually 1GB data is updated because the snapshot is already repaired) and received 1GB data from N1 or N2. Whereas in case 2, N3 is needed to repair 12GB (actually just compare merkle tree for 10GB) and received 2GB data from N1 or N2.
> > >
> > > The result showed that case 2 was faster than case 1 (case 1: 6889sec, case 2: 4535sec). I guess the repair process is very slow and it would be better to repair a node without (non repaired) backed up (snapshot or incremental backup) files if the other replica nodes exists.
> > >
> > > So... I guess if I just have non-repaired backups, what's the point of using them? Looks like there's no merit... Am I missing something?
> > >
> > > Regards,
> > > Satoshi
> > >
> >
> >
> >
>

Re: Questions about anti-entropy repair

Posted by daemeon reiydelle <da...@gmail.com>.

I don't know if my perspective on this will assist, so YMMV:

Summary

   1. Nodetool repairs are required when a node has issues and can't get
   its (e.g. hinted handoff) resync done: culprit: usually network, sometimes
   container/vm, rarely disk.
   2. Scripts to do partition range are a pain to maintain, and you have to
   be CONSTANTLY checking for new keyspaces, parsing them, etc. Git hub
   project?
   3. Monitor/monitor/monitor: if you do a best practices job of actually
   monitoring the FULL stack, you only need to do repairs when the world goes
   south.
   4. Are you alerted when errors show up in the logs, network goes wacky,
   etc? No? then you have to CYA by doing hail mary passes with periodic
   nodetool repairs.
   5. Nodetool repair is a CYA for a cluster whose status is not well
   monitored.

Daemeon's thoughts:

Nodetool repair is not required for a cluster that is and "always has been"
in a known good state. Monitoring of the relevant logs/network/disk/etc. is
the only way that I know of to assure this state. Because (e.g. AWS, and
EVERY ONE OF my clients' infrastructures: screwed up networks) nodes can
disappear then the cluster *can* get overloaded (network traffic) causing
hinted handoffs to have all of the worst case corner cases you can never
hope to see.

So, if you have good monitoring in place to assure that there is known good
cluster behaviour (network, disk, etc.), repairs are not required until you
are alerted that a cluster health problem has occurred. Partition range
repair is a pain in various parts of the anatomy because one has to
CONSTANTLY be updating the scripts that generate the commands (I have not
seen a git hub project around this, would love to see responses that point
them out!).



*.......*



*Daemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*

On Wed, Jul 20, 2016 at 4:33 AM, Alain RODRIGUEZ <ar...@gmail.com> wrote:

> Hi Satoshi,
>
>
>> Q1:
>> According to the DataStax document, it's recommended to run full repair
>> weekly or monthly. Is it needed even if repair with partitioner range
>> option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
>> every node in the cluster?
>>
>
> More accurately you need to run a repair for each node and each table
> within the gc_grace_seconds value defined at the table level to ensure no
> deleted data will return. Also running this on a regular basis ensure a
> constantly low entropy in your cluster, allowing better consistency (if not
> using a strong consistency like with CL.R&W = quorum).
>
> A full repair means every piece of data have been repaired. On a 3 node
> cluster with RF=3, running 'nodetool repair -pr' on the 3 nodes or
> 'nodetool repair' on one node are an equivalent "full repair". The best
> approach is often to run repair with '-pr' on all the nodes indeed. This is
> a full repair.
>
> Is it a good practice to repair a node without using non-repaired
>> snapshots when I want to restore a node because repair process is too slow?
>
>
> I am sorry, this is unclear to me. But from this "actually 1GB data is
> updated because the snapshot is already repaired" I understand you are
> using incremental repairs (or that you think that Cassandra repair uses it
> by default, which is not the case in your version).
> http://www.datastax.com/dev/blog/more-efficient-repairs
>
> Also, be aware that repair is a PITA for all the operators using
> Cassandra, that lead to many tries to improve things:
>
> Range repair: https://github.com/BrianGallew/cassandra_range_repair
> Reaper: https://github.com/spotify/cassandra-reaper
> Ticket to automatically schedule / handle repairs in Cassandra:
> https://issues.apache.org/jira/browse/CASSANDRA-10070
> Ticket to switch to Mutation Based Repairs (MBR):
> https://issues.apache.org/jira/browse/CASSANDRA-8911
>
> And probably many more... There is a lot to read and try, repair is an
> important yet non trivial topic for any Cassandra operator.
>
> C*heers,
> -----------------------
> Alain Rodriguez - alain@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
>
>
>
> 2016-07-14 9:41 GMT+02:00 Satoshi Hikida <sa...@gmail.com>:
>
>> Hi,
>>
>> I have two questions about anti-entropy repair.
>>
>> Q1:
>> According to the DataStax document, it's recommended to run full repair
>> weekly or monthly. Is it needed even if repair with partitioner range
>> option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
>> every node in the cluster?
>>
>> References:
>> - DataStax, "When to run anti-entropy repair",
>> http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesWhen.html
>>
>>
>> Q2:
>> Is it a good practice to repair a node without using non-repaired
>> snapshots when I want to restore a node because repair process is too slow?
>>
>> I've done some simple verifications for anti-entropy repair and found out
>> that the repair process spends too much time than simply transferring the
>> replica data from existing nodes to restoring node.
>>
>> My verification settings are as following:
>>
>> - 3 node cluster (N1, N2, N3)
>> - 2 CPUs, 8GB memory, 500GB HDD for each node
>> - Replication Factor is 3
>> - C* version is 2.2.6
>> - CS is LCS
>>
>> And I prepared test data as following:
>>
>> - a snapshot (10GB, full repaired) for N1, N2, N3.
>> - 1GB SSTables (by using incremental backup) for N1, N2, N3.
>> - another 1GB SSTables for N1, N2
>>
>> I've measured repair time for two cases.
>>
>> - Case 1: repair N3 with the snapshot and 1GB SStables
>> - Case 2: repair N3 with the snapshot only
>>
>> In case 1, N3 is needed to repair 12GB (actually 1GB data is updated
>> because the snapshot is already repaired) and received 1GB data from N1 or
>> N2. Whereas in case 2, N3 is needed to repair 12GB (actually just compare
>> merkle tree for 10GB) and received 2GB data from N1 or N2.
>>
>> The result showed that case 2 was faster than case 1 (case 1: 6889sec,
>> case 2: 4535sec). I guess the repair process is very slow and it would be
>> better to repair a node without (non repaired) backed up (snapshot or
>> incremental backup) files if the other replica nodes exists.
>>
>> So... I guess if I just have non-repaired backups, what's the point of
>> using them? Looks like there's no merit... Am I missing something?
>>
>> Regards,
>> Satoshi
>>
>
>

Re: Questions about anti-entropy repair

Posted by Alain RODRIGUEZ <ar...@gmail.com>.

Hi Satoshi,


> Q1:
> According to the DataStax document, it's recommended to run full repair
> weekly or monthly. Is it needed even if repair with partitioner range
> option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
> every node in the cluster?
>

More accurately you need to run a repair for each node and each table
within the gc_grace_seconds value defined at the table level to ensure no
deleted data will return. Also running this on a regular basis ensure a
constantly low entropy in your cluster, allowing better consistency (if not
using a strong consistency like with CL.R&W = quorum).

A full repair means every piece of data have been repaired. On a 3 node
cluster with RF=3, running 'nodetool repair -pr' on the 3 nodes or
'nodetool repair' on one node are an equivalent "full repair". The best
approach is often to run repair with '-pr' on all the nodes indeed. This is
a full repair.

Is it a good practice to repair a node without using non-repaired snapshots
> when I want to restore a node because repair process is too slow?


I am sorry, this is unclear to me. But from this "actually 1GB data is
updated because the snapshot is already repaired" I understand you are
using incremental repairs (or that you think that Cassandra repair uses it
by default, which is not the case in your version).
http://www.datastax.com/dev/blog/more-efficient-repairs

Also, be aware that repair is a PITA for all the operators using Cassandra,
that lead to many tries to improve things:

Range repair: https://github.com/BrianGallew/cassandra_range_repair
Reaper: https://github.com/spotify/cassandra-reaper
Ticket to automatically schedule / handle repairs in Cassandra:
https://issues.apache.org/jira/browse/CASSANDRA-10070
Ticket to switch to Mutation Based Repairs (MBR):
https://issues.apache.org/jira/browse/CASSANDRA-8911

And probably many more... There is a lot to read and try, repair is an
important yet non trivial topic for any Cassandra operator.

C*heers,
-----------------------
Alain Rodriguez - alain@thelastpickle.com
France

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com




2016-07-14 9:41 GMT+02:00 Satoshi Hikida <sa...@gmail.com>:

> Hi,
>
> I have two questions about anti-entropy repair.
>
> Q1:
> According to the DataStax document, it's recommended to run full repair
> weekly or monthly. Is it needed even if repair with partitioner range
> option ("nodetool repair -pr", in C* v2.2+) is set to run periodically for
> every node in the cluster?
>
> References:
> - DataStax, "When to run anti-entropy repair",
> http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesWhen.html
>
>
> Q2:
> Is it a good practice to repair a node without using non-repaired
> snapshots when I want to restore a node because repair process is too slow?
>
> I've done some simple verifications for anti-entropy repair and found out
> that the repair process spends too much time than simply transferring the
> replica data from existing nodes to restoring node.
>
> My verification settings are as following:
>
> - 3 node cluster (N1, N2, N3)
> - 2 CPUs, 8GB memory, 500GB HDD for each node
> - Replication Factor is 3
> - C* version is 2.2.6
> - CS is LCS
>
> And I prepared test data as following:
>
> - a snapshot (10GB, full repaired) for N1, N2, N3.
> - 1GB SSTables (by using incremental backup) for N1, N2, N3.
> - another 1GB SSTables for N1, N2
>
> I've measured repair time for two cases.
>
> - Case 1: repair N3 with the snapshot and 1GB SStables
> - Case 2: repair N3 with the snapshot only
>
> In case 1, N3 is needed to repair 12GB (actually 1GB data is updated
> because the snapshot is already repaired) and received 1GB data from N1 or
> N2. Whereas in case 2, N3 is needed to repair 12GB (actually just compare
> merkle tree for 10GB) and received 2GB data from N1 or N2.
>
> The result showed that case 2 was faster than case 1 (case 1: 6889sec,
> case 2: 4535sec). I guess the repair process is very slow and it would be
> better to repair a node without (non repaired) backed up (snapshot or
> incremental backup) files if the other replica nodes exists.
>
> So... I guess if I just have non-repaired backups, what's the point of
> using them? Looks like there's no merit... Am I missing something?
>
> Regards,
> Satoshi
>