You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@cassandra.apache.org by A J <s5...@gmail.com> on 2011/06/30 22:25:29 UTC

Meaning of 'nodetool repair has to run within GCGraceSeconds'

I am little confused of the reason why nodetool repair has to run
within GCGraceSeconds.

The documentation at:
http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
is not very clear to me.

How can a delete be 'unforgotten' if I don't run nodetool repair? (I
understand that if a node is down for more than GCGraceSeconds, I
should not get it up without resynching is completely. Otherwise
deletes may reappear.http://wiki.apache.org/cassandra/DistributedDeletes
)
But not sure how exactly nodetool repair ties into this mechanism of
distributed deletes.

Thanks for any clarifications.

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by A J <s5...@gmail.com>.

Just confirming. Thanks for the clarification.

On Tue, Jul 12, 2011 at 10:53 AM, Peter Schuller
<pe...@infidyne.com> wrote:
>> From "Cassandra the definitive guide" - Basic Maintenance - Repair
>> "Running nodetool repair causes Cassandra to execute a major compaction.....
>> During a major compaction (see “Compaction” in the Glossary), the
>> server initiates a
>> TreeRequest/TreeReponse conversation to exchange Merkle trees with neighboring
>> nodes."
>>
>> So is this text from the book misleading ?
>
> It's just being a bit less specific (I suppose maybe misleading can be
> claimed). If you repair everything on a node, that will imply a
> validating compaction (i.e., do the read part of the compaction stage
> but don't merge to and write new sstables) which is expensive for the
> usual reasons with disk I/O; it's "major" since it covers all data.
> The data read is in fact used to calculate a merkle tree for
> comparison with neighbors, as claimed.
>
> --
> / Peter Schuller
>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Peter Schuller <pe...@infidyne.com>.

> From "Cassandra the definitive guide" - Basic Maintenance - Repair
> "Running nodetool repair causes Cassandra to execute a major compaction.....
> During a major compaction (see “Compaction” in the Glossary), the
> server initiates a
> TreeRequest/TreeReponse conversation to exchange Merkle trees with neighboring
> nodes."
>
> So is this text from the book misleading ?

It's just being a bit less specific (I suppose maybe misleading can be
claimed). If you repair everything on a node, that will imply a
validating compaction (i.e., do the read part of the compaction stage
but don't merge to and write new sstables) which is expensive for the
usual reasons with disk I/O; it's "major" since it covers all data.
The data read is in fact used to calculate a merkle tree for
comparison with neighbors, as claimed.

-- 
/ Peter Schuller

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by A J <s5...@gmail.com>.

>From "Cassandra the definitive guide" - Basic Maintenance - Repair
"Running nodetool repair causes Cassandra to execute a major compaction.....
During a major compaction (see “Compaction” in the Glossary), the
server initiates a
TreeRequest/TreeReponse conversation to exchange Merkle trees with neighboring
nodes."

So is this text from the book misleading ?

On Fri, Jul 8, 2011 at 10:36 AM, Jonathan Ellis <jb...@gmail.com> wrote:
> that's an internal term meaning "background i/o," not sstable merging per se.
>
> On Fri, Jul 8, 2011 at 9:24 AM, A J <s5...@gmail.com> wrote:
>> I think node repair involves some compaction too. See the issue:
>> https://issues.apache.org/jira/browse/CASSANDRA-2811
>> It talks of 'validation compaction' being triggered concurrently
>> during node repair.
>>
>> On Thu, Jun 30, 2011 at 8:51 PM, Watanabe Maki <wa...@gmail.com> wrote:
>>> Repair doesn't compact. Those are different processes already.
>>>
>>> maki
>>>
>>>
>>> On 2011/07/01, at 7:21, A J <s5...@gmail.com> wrote:
>>>
>>>> Thanks all !
>>>> In other words, I think it is safe to say that a node as a whole can
>>>> be made consistent only on 'nodetool repair'.
>>>>
>>>> Has there been enough interest in providing anti-entropy without
>>>> compaction as a separate operation (nodetool repair does both) ?
>>>>
>>>>
>>>> On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>>> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>>> Read repair does NOT repair tombstones.
>>>>>
>>>>> It does, but you can't rely on RR to repair _all_ tombstones, because
>>>>> RR only happens if the row in question is requested by a client.
>>>>>
>>>>> --
>>>>> Jonathan Ellis
>>>>> Project Chair, Apache Cassandra
>>>>> co-founder of DataStax, the source for professional Cassandra support
>>>>> http://www.datastax.com
>>>>>
>>>
>>
>
>
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Jonathan Ellis <jb...@gmail.com>.

that's an internal term meaning "background i/o," not sstable merging per se.

On Fri, Jul 8, 2011 at 9:24 AM, A J <s5...@gmail.com> wrote:
> I think node repair involves some compaction too. See the issue:
> https://issues.apache.org/jira/browse/CASSANDRA-2811
> It talks of 'validation compaction' being triggered concurrently
> during node repair.
>
> On Thu, Jun 30, 2011 at 8:51 PM, Watanabe Maki <wa...@gmail.com> wrote:
>> Repair doesn't compact. Those are different processes already.
>>
>> maki
>>
>>
>> On 2011/07/01, at 7:21, A J <s5...@gmail.com> wrote:
>>
>>> Thanks all !
>>> In other words, I think it is safe to say that a node as a whole can
>>> be made consistent only on 'nodetool repair'.
>>>
>>> Has there been enough interest in providing anti-entropy without
>>> compaction as a separate operation (nodetool repair does both) ?
>>>
>>>
>>> On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>>> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>> Read repair does NOT repair tombstones.
>>>>
>>>> It does, but you can't rely on RR to repair _all_ tombstones, because
>>>> RR only happens if the row in question is requested by a client.
>>>>
>>>> --
>>>> Jonathan Ellis
>>>> Project Chair, Apache Cassandra
>>>> co-founder of DataStax, the source for professional Cassandra support
>>>> http://www.datastax.com
>>>>
>>
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by A J <s5...@gmail.com>.

I think node repair involves some compaction too. See the issue:
https://issues.apache.org/jira/browse/CASSANDRA-2811
It talks of 'validation compaction' being triggered concurrently
during node repair.

On Thu, Jun 30, 2011 at 8:51 PM, Watanabe Maki <wa...@gmail.com> wrote:
> Repair doesn't compact. Those are different processes already.
>
> maki
>
>
> On 2011/07/01, at 7:21, A J <s5...@gmail.com> wrote:
>
>> Thanks all !
>> In other words, I think it is safe to say that a node as a whole can
>> be made consistent only on 'nodetool repair'.
>>
>> Has there been enough interest in providing anti-entropy without
>> compaction as a separate operation (nodetool repair does both) ?
>>
>>
>> On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>>> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>> Read repair does NOT repair tombstones.
>>>
>>> It does, but you can't rely on RR to repair _all_ tombstones, because
>>> RR only happens if the row in question is requested by a client.
>>>
>>> --
>>> Jonathan Ellis
>>> Project Chair, Apache Cassandra
>>> co-founder of DataStax, the source for professional Cassandra support
>>> http://www.datastax.com
>>>
>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Watanabe Maki <wa...@gmail.com>.

Repair doesn't compact. Those are different processes already.

maki


On 2011/07/01, at 7:21, A J <s5...@gmail.com> wrote:

> Thanks all !
> In other words, I think it is safe to say that a node as a whole can
> be made consistent only on 'nodetool repair'.
> 
> Has there been enough interest in providing anti-entropy without
> compaction as a separate operation (nodetool repair does both) ?
> 
> 
> On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> Read repair does NOT repair tombstones.
>> 
>> It does, but you can't rely on RR to repair _all_ tombstones, because
>> RR only happens if the row in question is requested by a client.
>> 
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by A J <s5...@gmail.com>.

Thanks all !
In other words, I think it is safe to say that a node as a whole can
be made consistent only on 'nodetool repair'.

Has there been enough interest in providing anti-entropy without
compaction as a separate operation (nodetool repair does both) ?

On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> Read repair does NOT repair tombstones.
>
> It does, but you can't rely on RR to repair _all_ tombstones, because
> RR only happens if the row in question is requested by a client.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by AJ <aj...@dude.podzone.net>.

It would be helpful if this was automated some how.

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:

> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com>
> wrote:
> > Read repair does NOT repair tombstones.
>
> It does, but you can't rely on RR to repair _all_ tombstones, because
> RR only happens if the row in question is requested by a client.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Doh! Right. I was thinking about range scans and read repair.
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Alternative-to-repair-td6098108.html

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by A J <s5...@gmail.com>.

Never mind. I see the issue with this. I will be able to catch the
writes as failed only if I set CL=ALL. For other CLs, I may not know
that it failed on some node.

On Mon, Jul 11, 2011 at 2:33 PM, A J <s5...@gmail.com> wrote:
> Instead of doing nodetool repair, is it not a cheaper operation to
> keep tab of failed writes (be it deletes or inserts or updates) and
> read these failed writes at a set frequency in some batch job ? By
> reading them, RR would get triggered and they would get to a
> consistent state.
>
> Because these would targeted reads (only for those that failed during
> writes), it should be a shorter list and quick to repair (than
> nodetool repair).
>
>
> On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
>> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> Read repair does NOT repair tombstones.
>>
>> It does, but you can't rely on RR to repair _all_ tombstones, because
>> RR only happens if the row in question is requested by a client.
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by A J <s5...@gmail.com>.

Instead of doing nodetool repair, is it not a cheaper operation to
keep tab of failed writes (be it deletes or inserts or updates) and
read these failed writes at a set frequency in some batch job ? By
reading them, RR would get triggered and they would get to a
consistent state.

Because these would targeted reads (only for those that failed during
writes), it should be a shorter list and quick to repair (than
nodetool repair).

On Thu, Jun 30, 2011 at 5:27 PM, Jonathan Ellis <jb...@gmail.com> wrote:
> On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> Read repair does NOT repair tombstones.
>
> It does, but you can't rely on RR to repair _all_ tombstones, because
> RR only happens if the row in question is requested by a client.
>
> --
> Jonathan Ellis
> Project Chair, Apache Cassandra
> co-founder of DataStax, the source for professional Cassandra support
> http://www.datastax.com
>

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Jonathan Ellis <jb...@gmail.com>.

On Thu, Jun 30, 2011 at 3:47 PM, Edward Capriolo <ed...@gmail.com> wrote:
> Read repair does NOT repair tombstones.

It does, but you can't rely on RR to repair _all_ tombstones, because
RR only happens if the row in question is requested by a client.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Edward Capriolo <ed...@gmail.com>.

On Thu, Jun 30, 2011 at 4:25 PM, A J <s5...@gmail.com> wrote:

> I am little confused of the reason why nodetool repair has to run
> within GCGraceSeconds.
>
> The documentation at:
> http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
> is not very clear to me.
>
> How can a delete be 'unforgotten' if I don't run nodetool repair? (I
> understand that if a node is down for more than GCGraceSeconds, I
> should not get it up without resynching is completely. Otherwise
> deletes may reappear.http://wiki.apache.org/cassandra/DistributedDeletes
> )
> But not sure how exactly nodetool repair ties into this mechanism of
> distributed deletes.
>
> Thanks for any clarifications.
>

Read repair does NOT repair tombstones. Failed writes/tomstones with
TimedoutException do not get hinted even if HH is on.
https://issues.apache.org/jira/browse/CASSANDRA-2034. Thus tombstones can
get lost.

Because of this the only way to find lost tombstones is to anti-entropy
repair. If you do not repair in the gc period a node could lose a tombstone
and the row could be read repaired and resurrected.

In our case, we are lucky, we delete rows when they get old and stale. While
it is not great if a deleted row appears it is not harmful thus I can live
with less repairing then most.

Re: Meaning of 'nodetool repair has to run within GCGraceSeconds'

Posted by Konstantin Naryshkin <ko...@a-bb.net>.

As I understand, it has to do with a node being up but missing the delete message (remember, if you apply the delete at CL.QUORUM, you can have almost half the replicas miss it and still succeed). Imagine that you have 3 nodes A, B, and C, each of which has a column 'foo' with a value 'bar'. Their state would be:
A: 'foo':'bar'     B: 'foo':'bar'     C: 'foo':'bar'

We attempt to delete column 'foo', and it succeeds on nodes A and B (meaning that we succeeded on CL.QUORUM). Unfortunately the packet going to node C runs afoul of the network gods and gets zapped in transit. The state is now:
A: 'foo':deleted     B: 'foo':deleted     C: 'foo':'bar'

If we try a read at this point, at CL.QUORUM, we are guaranteed to get at least one record that 'foo' was deleted and because of timestamps we know to tell the client as much.

After GCGraceSeconds and a compaction, the state of the nodes will be:
A: None     B: None     C: 'foo':'bar'

Some time later, we attempt a read and just happen to get C's response first. The response will be that 'foo' is storing 'bar'. Not only that, but read repair happens as well, so the state will become:
A: 'foo':'bar'     B: 'foo':'bar'     C: 'foo':'bar'

We have the infamous undelete.

----- Original Message -----
From: "A J" <s5...@gmail.com>
To: user@cassandra.apache.org
Sent: Thursday, June 30, 2011 8:25:29 PM
Subject: Meaning of 'nodetool repair has to run within GCGraceSeconds'

I am little confused of the reason why nodetool repair has to run
within GCGraceSeconds.

The documentation at:
http://wiki.apache.org/cassandra/Operations#Frequency_of_nodetool_repair
is not very clear to me.

How can a delete be 'unforgotten' if I don't run nodetool repair? (I
understand that if a node is down for more than GCGraceSeconds, I
should not get it up without resynching is completely. Otherwise
deletes may reappear.http://wiki.apache.org/cassandra/DistributedDeletes
)
But not sure how exactly nodetool repair ties into this mechanism of
distributed deletes.

Thanks for any clarifications.